Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners

Arkin, Gulnur; Abdukelim, Tangnur; Yilahun, Hankiz; Hamdulla, Askar

doi:10.3390/app15116335

Open AccessArticle

Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners

¹

School of Public Administration, Xinjiang University of Finance and Economics, Urumqi 830026, China

²

School of Computer Science and Technology, Xinjiang University, Urumqi 830049, China

³

School of Intelligence Science and Technology (School of Future Technology), Xinjiang University, Urumqi 830049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6335; https://doi.org/10.3390/app15116335

Submission received: 21 April 2025 / Revised: 2 June 2025 / Accepted: 4 June 2025 / Published: 5 June 2025

(This article belongs to the Collection Fishery Acoustics)

Download

Browse Figures

Versions Notes

Abstract

Based on acoustic feature analysis, this study systematically examines the differences in vowel pronunciation characteristics among Mandarin learners at various proficiency levels. A speech corpus containing samples from advanced, intermediate, and elementary learners (N = 50) and standard speakers (N = 10) was constructed, with a total of 5880 samples. Support Vector Machine (SVM) and ID3 decision tree algorithms were employed to classify vowel formant parameters (F1-F2) patterns. The results demonstrate that SVM significantly outperforms the ID3 algorithm in vowel classification, with an average accuracy of 92.09% for the three learner groups (92.38% for advanced, 92.25% for intermediate, and 91.63% for elementary), an improvement of 2.05 percentage points compared to ID3 (p < 0.05). Learners’ vowel production exhibits systematic deviations, particularly pronounced in complex vowels for the elementary group. For instance, the apical vowel “ẓ” has a deviation of 2.61 Bark (standard group: F1 = 3.39/F2 = 8.13; elementary group: F1 = 3.42/F2 = 10.74), while the advanced group’s deviations are generally less than 0.5 Bark (e.g., vowel “a” deviation is only 0.09 Bark). The difficulty of tongue position control strongly correlates with the deviation magnitude (r = 0.87, p < 0.001). This study confirms the effectiveness of objective assessment methods based on formant analysis in speech acquisition research, provides a theoretical basis for algorithm optimization in speech evaluation systems, and holds significant application value for the development of Computer-Assisted Language Learning (CALL) systems and the improvement of multi-ethnic Mandarin speech recognition technology.

Keywords:

Mandarin learners; error analysis; machine learning; acoustic deviation; pronunciation quality assessment

1. Introduction

With the increasing development of the global economy, exchanges and cooperation among countries in various aspects such as politics, economy, culture, and education have become more frequent. People from all over the world have gradually stepped out of their home countries to travel or study. Language is the most convenient means of communication for people. In today’s society, it is common for individuals to learn two or three languages and dialects in addition to their mother tongue or native dialect [1]. Mastering a communicative language, especially oral language learning, is crucial. Mandarin Chinese is the common language of our country. It is based on Beijing pronunciation, which is as the standard pronunciation, and the northern dialect as the foundation. It is the shared language of China’s 56 ethnic groups. Promoting Mandarin is conducive to enhancing the unity among various ethnic groups in our country, facilitating exchanges between different regions, strengthening the cohesion and centripetal force of the Chinese nation, and holds great significance for our country’s social development [2]. However, mastering a communicative language, particularly oral language learning, is of utmost importance. Vowels are the basic units that constitute speech sounds, and their correct pronunciation is essential for improving the quality of language communication. For non-standard Mandarin learners, accurately mastering the vowel pronunciation in Mandarin is also a challenge. CALL [3,4,5] systems have emerged in this context. CALL systems utilize computers to assist (or replace) humans in traditional speech teaching, aiming to promote students’ autonomous learning by providing speech instruction and addressing their pronunciation issues in oral language. Therefore, Mandarin learners eagerly anticipate such automatic detection methods for Mandarin pronunciation errors that can precisely identify learners’ pronunciation problems, propose corresponding improvement suggestions and measures, and further enhance students’ Mandarin pronunciation level and ability, just like a teacher [6].

Consequently, the evaluation and classification of Mandarin learners’ vowel pronunciation quality is an important research direction in the fields of phonetics and computer-assisted language teaching. With the rapid development of acoustic analysis techniques and machine learning algorithms, significant progress has been made in both theoretical methods and practical applications. However, due to the complexity of the Mandarin vowel system and the diversity of learners’ pronunciation proficiency levels, accurately and efficiently evaluating and classifying learners’ vowel pronunciation quality remains a challenging issue. In recent years, numerous scholars have conducted extensive research on learners’ pronunciation assessment and vowel classification. In recent years, breakthroughs in speech representation learning technologies have provided a new paradigm for pronunciation assessment. The wav2vec 2.0 framework proposed by Baevski et al. [7] extracts deep acoustic features from raw speech signals through self-supervised learning and demonstrates superior performance in cross-linguistic speech recognition tasks. This opens up possibilities for integrating traditional acoustic parameters (such as formants) with deep features. In the area of multilingual vowel classification, Shaw et al. [8] found significant categorical differences in cross-accent vowel perception through a comparative study of five English dialects, which highlights the urgency of combining acoustic features with perceptual evaluation.

In terms of acoustic feature analysis, traditional research often relies on formant parameters. Peterson and Barney [9] revealed the clustering characteristics of vowel categories through the F1-F2 acoustic space modeling, and this theory has been widely applied to second language pronunciation error detection. Recent studies have further expanded this paradigm: Yuan et al. [10] proposed an improved contextual speech representation method, which enhances the modeling ability for tonal languages by incorporating prosodic features, and Chen et al. [11] used convolutional neural networks (CNN) to achieve end-to-end classification of Chinese tones, with an accuracy of 92.3%, validating the potential of deep learning in speech feature extraction. Additionally, [12,13] qualitatively analyzed the relationship between formant changes and vowel phonemes and tongue positions in their research. Study [14] provides a more detailed explanation of formant changes in vowels. The authors of [15] also mention that formant movement trajectories are effective parameters for vowel recognition. Study [16] uses formants as parameters to analyze the variation trends of vowels in eight languages, including Arabic, English, Mandarin Chinese, and French. On this basis, the authors of [17] propose an objective testing method for the pronunciation level of finals in Mandarin Chinese based on formant patterns, using formants as pronunciation evaluation features and SVM as the classifier. Amami et al. [18] systematically compared the performance of algorithms such as SVM and KNN in vowel recognition and found that SVM has a significant advantage in high-dimensional acoustic feature classification (with a 12% improvement in F1 score), which is consistent with the conclusions of this study. However, existing research tends to focus on optimizing individual algorithms and lacks systematic evaluation across different levels of learners. For example, the KNN-SVM hybrid model proposed by Hafiz [19] performs excellently in isolated word recognition but does not consider the hierarchical differences in learners’ pronunciation levels.

Furthermore, the authors of [20] employ a phoneme classifier to assess learners’ pronunciation. The phoneme classifier is trained using speech data from native speakers of the target language and uses mel-frequency cepstral coefficients as features. If the learner’s pronunciation is close to the trained speech data, the classifier’s discriminative power will be high. Prosodic feature models are adopted in [21,22], which use segment duration models and speaking rate models to evaluate learners’ pronunciation quality. The authors of [23] propose a pronunciation quality assessment model based on acoustic features for second language learners’ vowel pronunciation problems, emphasizing the importance of formant parameters in non-native pronunciation analysis. The authors of [24] provide a detailed summary of the classification performance of SVM, further validating its effectiveness in processing acoustic feature data. However, existing research primarily focuses on the performance analysis of single algorithms and lacks a systematic exploration of vowel classification for learners at different proficiency levels. In the field of CALL, the intelligent analysis of acoustic features provides new ideas for the development of pronunciation evaluation systems. Study [25] summarizes the application of speech technology in education, emphasizing the importance of acoustic features in intelligent assessment. The authors of [26] investigate improved methods for speech recognition technology in multilingual contexts, indicating that the refined analysis of formant features contributes to enhancing the accuracy of speech recognition. These studies provide important references for the objectives of this research.

Despite the numerous achievements made by scholars in China and abroad in the classification of vowels produced by Mandarin learners, existing research still has some limitations. Firstly, there is a lack of systematic studies on learners at different proficiency levels, particularly the in-depth exploration of pronunciation level stratification assessment based on formant features. Secondly, the application of machine learning algorithms in vowel classification mainly focuses on single algorithms, lacking comparative analysis of multiple algorithms. Moreover, existing research rarely combines acoustic features with machine learning algorithms to explore their practical application value in computer-assisted language teaching. By combining SVM and ID3 algorithm [27], this study conducts a systematic analysis of vowel classification for Mandarin learners at different proficiency levels, aiming to validate the effectiveness of formant features in vowel classification and compare the performance differences in various algorithms. This study further investigates the pronunciation characteristics of learners at different levels and their application value in teaching, providing new theoretical foundations and methodological references for Mandarin teaching and pronunciation evaluation. The research findings not only contribute to promoting the development of computer-assisted language teaching systems but also provide important evidence for enhancing the accuracy of multi-ethnic Mandarin speech recognition.

2. Experimental Procedure and Design

2.1. Experimental Subjects and Speech Data Collection

In this study, the focus is on analyzing the formant features of Mandarin Chinese vowels produced by learners. Fifty Mandarin learners (equal number of males and females) were selected, all born in Xinjiang, with more than ten years of Mandarin learning experience. Their oral scores in the Chinese Proficiency Test were all above 45, and they had no language or hearing problems.

The remaining ten subjects were standard Mandarin speakers. Based on MHK scores and expert auditory assessment, the learners were divided into three groups (Figure 1): the elementary group (UCB): 12 individuals (6 males/6 females), average score 56.0 ± 1.41; the intermediate group (UCI): 20 individuals (10 males/10 females), average score 66.0 ± 1.41; the advanced group (UCA): 18 individuals (9 males/9 females), average score 82.5 ± 0.71; and the control group (MC): average score 88.0 ± 1.41. Mandarin proficiency was significantly positively correlated with test scores (Pearson’s r = 0.98, p < 0.001), validating the effectiveness of the grouping.

In this experiment, speech recording was conducted in a dedicated recording room. The equipment used included a laptop, external sound card, microphone, and interconnecting data cables. An external sound card was chosen for its ability to adjust volume, reduce noise, and monitor plosive sounds. The recording software was developed using Matlab R2019a, with a sampling frequency of 16 kHz. The reading material for the subjects consisted of monosyllabic words, with 98 data points per speaker, resulting in a total of 60 × 98 = 5880 Mandarin monosyllabic word data points. Detailed information about the subjects in each group is shown in Table 1.

After obtaining the speech data, Praat software (version 6.1.16) was used to process the Mandarin vowels of standard Mandarin speakers and learners, extracting the first and second formants of the obtained data. SVM and ID3 algorithms were then employed for vowel classification.

Feature Extraction: Each vowel segment was time-normalized to 200 ms using Dynamic Time Warping (DTW) in Praat to ensure uniform duration. The first and second formant trajectories (F1-F2) were then concatenated into a multivariate feature vector X∈R⁴⁰⁰, where dimensions 1–200 and 201–400 represent F1 and F2 values sampled at 10 ms intervals, respectively. This fixed-length representation preserves spectral dynamics while enabling direct comparison across proficiency levels.

2.2. Correlation Analysis

To investigate the relationship between acoustic features and pronunciation proficiency, this study employs the Pearson correlation coefficient (r) to quantify the linear correlation between the following variables:

Independent variable: Learners’ vowel acoustic deviation (Bark value), calculated as the Euclidean distance between each vowel’s F1-F2 and the standard group. Dependent variable: Learners’ Mandarin proficiency level (elementary = 1; intermediate = 2; advanced = 3). Statistical analysis is performed using Python 3.10 and SciPy 1.13.0 (scipy.stats module), with a significance level set at α = 0.05. The specific steps are as follows:

Calculate the mean deviation for all vowels of each learner to generate a deviation dataset;
Convert Mandarin proficiency levels into ordinal numeric variables;
Call the pearsonr () function to compute the correlation coefficient (r) and p-value;
Control for multiple comparison errors using the Bonferroni correction.

The analysis results show a significant negative correlation between acoustic deviation and pronunciation proficiency (r = −0.76, p < 0.001), indicating that the poorer the pronunciation ability, the more obvious the acoustic features deviate from standard values.

3. Experimental Results and Analysis

Based on 5880 speech samples from 50 Mandarin Chinese learners (advanced group UCA, intermediate group UCI, elementary group UCB) and 10 standard speakers, this study systematically evaluates the performance differences between SVM and ID3 decision tree algorithms in vowel classification tasks and explores the discriminative efficacy of formant parameters for non-standard pronunciation. By constructing an acoustic feature set with the first and second formants (F1-F2) as the core, combined with machine learning algorithms, the vowel pronunciation quality of the three learner groups is classified hierarchically. The experiment focuses on the following core issues: (1) the classification sensitivity of different algorithms to pronunciation features at different levels; (2) the quantitative representation ability of acoustic parameters in detecting non-standard pronunciation; and (3) the systematic acoustic patterns of learners’ pronunciation errors. Through comparative analysis, this study not only reveals the intrinsic relationship between algorithm performance and acoustic features but also provides data-driven theoretical foundations for optimizing pronunciation assessment models.

3.1. Classification Results and Analysis Based on Support Vector Machine

Formant patterns have a certain degree of discrimination for different vowels. Therefore, following the classification evaluation method based on the Support Vector Machines (SVM) algorithm described in reference [17], this study conducted a classification analysis of vowels produced by learners at different proficiency levels. This method helps correct pronunciation issues in tongue position and lip shape for Mandarin Chinese learners. SVM is a supervised learning algorithm primarily used for classification and regression tasks. Its core idea is to find an optimal hyperplane (or a hyperplane after non-linear mapping) that maximizes the margin between different classes of samples to achieve high generalization performance in classification [28,29].

We optimized the SVM via grid search (five-fold cross-validation) with a penalty factor of C∈{0.1,1,10} and an RBF kernel width of σ∈{0.01,0.1,1}. The best performance (C = 10, σ = 0.1) was selected for all subsequent analyses. To benchmark our formant-based features (F1-F2), we extracted 256-dimensional embeddings using wav2vec 2.0 [7] (mean-pooled frame-level features). SVM achieved comparable accuracy (92.1% vs. 91.8% with wav2vec), confirming the efficacy of our approach for pronunciation assessment.

The formant patterns for different monophthongs are shown in Figure 1. The formant trajectories in the figure demonstrate that different vowels exhibit distinctly different formant patterns. SVM is a learning method that performs optimal classification for two inseparable problems. SVM is widely applied in various fields such as image processing, speech signal processing, and speech recognition [30] because it considers the error as a constraint in the optimization problem and employs a structural risk minimization learning method.

To separate monophthong samples in the speech of learners at different proficiency levels, the optimal solution involves minimizing the following objective function:

ϕ (ω, ξ) = \frac{1}{2} (ω, ω) + c (\sum_{i = 1}^{n} ξ_{i})

(1)

subject to the following constraints:

y_{i} [(ω \cdot x_{i}) + b] - 1 + ξ_{i} \geq 0, i = 1,2, \dots, l

(2)

where

ξ

is slack variables representing margin violations (Equation (1)); ω is weight vector determining the decision boundary (Equation (1));

c

is regularization parameter controlling the penalty for misclassified samples (Equation (1));

x_{i}

is training samples; and

y_{i}

∈{−1,1} is class labels. This formulation corresponds to the soft-margin Support Vector Machine (SVM) framework, where the goal is to maximize the margin between classes while allowing controlled misclassifications via slack variables

ξ_{i}

. The parameter

c

balances the trade-off between maximizing the margin and minimizing classification errors.

To solve its dual problem, first find the maximum value of a quadratic function:

W (a) = \sum_{i = 1}^{l} a_{i} - \frac{1}{2} \sum_{i, j = 1}^{l} y_{i} y_{j} a_{i} a_{j} K (x_{i}, x_{j})

(3)

The constraints are

0 \leq a_{i} \leq C, i = 1,2, \dots, l; \sum_{i = 1}^{l} a_{i} y_{i} = 0

(4)

In Equation (3),

a_{i}

is the Lagrange multiplier,

K (x_{i}, x_{j})

is the kernel function, and

(x_{i}, x_{j})

is the sample set. By using the radial basis function (RBF) as the inner product kernel function,

K (x, x_{i}) = \exp \{- \frac{| x - x_{i} |^{2}}{σ^{2}}\}

(5)

The output weights are automatically determined by the algorithm. In the equation,

x_{i}

is the sample and

σ

is a variable parameter.

f (x) = sign \{[\sum_{i = 1}^{l} a_{i} y_{i} K (x, x_{i})] + b\}

(6)

Figure 2 below shows the classification results of each monophthong for learners at different levels based on the SVM.

As shown in Figure 3, the overall classification performance for each vowel achieves an average accuracy of 92.08%. However, the classification performance is unbalanced for some vowels. Conducting a classification analysis of vowels based on formant patterns has the following research values: it can be used to examine which vowels are easily confused in the data and it can be used to establish a more reliable acoustic model for automatic vowel recognition. The results indicate that using formant features can separate standard and non-standard pronunciations, suggesting that employing formant structural information for pronunciation evaluation is reasonable. The specific results are summarized in Table 2.

The classification results based on SVM indicate significant differences in vowel pronunciation accuracy and recall among Mandarin learners at different proficiency levels. The advanced level group (UCA) demonstrates excellent classification performance across all vowels, with an overall average accuracy of 93.1% and a recall of 91.6%. Notably, the accuracy and recall for vowels “ɤ” and “u” both reach 94%, suggesting that advanced learners exhibit highly stable acoustic features in their pronunciation of these vowels, with concentrated formant (F1, F2) distributions close to standard pronunciation. However, the accuracy and recall for apical vowels “ẓ” and “ Applsci 15 06335 i001

” are slightly lower (90% and 88–90%, respectively), possibly related to the specificity of the pronunciation position, but their performance remains significantly better than the elementary level group.

The overall classification performance of the intermediate level group (UCI) is close to that of the advanced group, with an average accuracy of 92.9% and a recall of 91.5%. The performance for vowel “o” is the most outstanding, with accuracy and recall reaching 95% and 93%, respectively, indicating that the pronunciation of rounded vowels by intermediate learners is approaching the advanced level. However, the classification performance for vowels “i” and “ Applsci 15 06335 i001

” is slightly lower (accuracy of 91% and 89%, respectively), possibly due to the complexity of tongue position control. Overall, the intermediate level group demonstrates stable performance for most vowels, but there is still room for improvement in apical vowels and high front vowels.

The classification performance of the elementary level group (UCB) is relatively lower, with an average accuracy of 91.9% and a recall of 90.4%. Although the accuracy and recall for vowels “ɤ” and “u” are high (95% and 94%, respectively), the performance for apical vowels “ Applsci 15 06335 i002

” and “

” is significantly poorer, with accuracy values of 87% and 88% and recall values of 88% and 85%, respectively. This result reflects the pronunciation difficulties faced by elementary learners in apical vowels, possibly related to native language interference or insufficient oral muscle control. Moreover, the recall for vowel “i” is only 90%, indicating that elementary learners also exhibit a certain degree of instability in their pronunciation of high front vowels.

SVM generally outperforms the ID3 algorithm across all level groups, particularly in terms of accuracy and recall, demonstrating higher stability and precision. SVM can handle high-dimensional data and complex non-linear relationships, which is particularly important for vowel classification in speech recognition.

3.2. Vowel Classification Results and Analysis Based on ID3 Algorithm

The ID3 algorithm is a classification prediction algorithm in machine learning. This algorithm uses information entropy and information gain as measurement criteria to further achieve data induction and classification [31,32]. The core of this algorithm is information entropy, which is the information contained in a set of data, a measure of probability. Suppose a set of data consists of {d1, d2, …, dn} and its sum is represented by Sum, then the formula for calculating information entropy is as follows:

E n t r o p y (D) = - \sum_{x = 1}^{n} \frac{d_{x}}{s u m} \log \frac{d_{x}}{s u m}

(7)

where

\frac{d x}{s u m}

is the probability of class

d_{x}

appearing in the sample

D

.

In the ID3 algorithm, information gain is denoted by Gain(D), which refers to the effective reduction in information entropy or the change in entropy before and after partitioning. The higher this value, the more information entropy the target attribute loses in that reference attribute. Information gain, Gain(D), can be represented by the following formula:

G a i n (D, A) = E n t r o p y (D) - \sum_{V \in V a l u e (A)} \frac{D_{V}}{D} E n t o p y (D_{V})

(8)

where

A

represents the attributes of the sample,

V a l u e (A)

is the set of all values of attribute

A

,

V

is one of the attribute values of

A

, and

D_{V}

is the set of samples where the value of

A

in

D

is

V

.

From the results in Figure 4, it can be observed that the average classification accuracy of vowels based on the ID3 algorithm for learners at different levels can reach 90.04%. This is 2.04% lower than the vowel classification accuracy of SVM. This is because SVM better handles the classification features selected in this study, namely formant patterns and the distances between them, while exhibiting better classification performance. The specific results are summarized in Table 3.

As shown in Table 3 above, there are certain differences in vowel pronunciation accuracy and recall among Mandarin learners at different proficiency levels, but the overall performance is slightly lower than the SVM algorithm. The UCA has an average accuracy of 90.4% and a recall of 89.6%, with vowel “ɤ” performing the best (accuracy 94%, recall 91%), indicating that advanced learners have relatively stable tongue position control. However, the accuracy and recall for vowel “y” are lower (88% and 85%, respectively), possibly related to the complexity of lip shape control.

The UCI has an average accuracy of 90.8% and a recall of 90.1%, with vowel “o” showing outstanding performance (accuracy 94%, recall 92%), but the classification performance for vowels “y” and “ Applsci 15 06335 i001

” decreases (accuracy ≤ 90%), reflecting the pronunciation instability of intermediate learners in some complex vowels. The UCB has an average accuracy of 89.0% and a recall of 87.9%, with vowels “ɤ” and “o” performing relatively well (accuracy ≥ 90%), but the accuracy and recall for apical vowels “ẓ” and “ Applsci 15 06335 i001

” are significantly lower (87–88% and 85–86%, respectively), indicating that elementary learners have obvious difficulties in pronouncing these vowels, possibly related to native language interference or the specificity of the pronunciation position.

Combining the analyses of Table 2 and Table 3 above, the SVM algorithm generally outperforms the ID3 algorithm in vowel classification accuracy and recall, regardless of whether the learners are in the advanced, intermediate, or elementary level group. The SVM algorithm, due to its advantages in handling complex data and non-linear problems, is more effective and reliable for the objective testing of learners’ vowel pronunciation proficiency. Therefore, the SVM-based vowel classification method is considered superior in this study and more suitable for evaluating learners’ vowel pronunciation ability at different levels. Furthermore, the experimental results further demonstrate that formant features can be used to separate standard and non-standard pronunciations, indicating that employing formant structural information for pronunciation evaluation is reasonable.

3.3. Calculation Results of Vowel Acoustic Deviation

To further quantify the pronunciation differences among learners at different levels, this study calculated the deviation (Bark) of each group’s vowels in the F1-F2 acoustic space. As shown in Table 4, elementary level learners exhibit significant acoustic deviations in complex vowels (such as the apical vowel “ẓ”), while advanced learners have lower overall deviations and are close to the standard pronunciation group.

Through the quantitative analysis of acoustic space deviation, this study reveals the systematic differences in vowel pronunciation among learners at different proficiency levels. As shown in Table 4, the acoustic deviation of advanced level learners is generally less than 0.5 Bark (e.g., the deviation of vowel “a” is only 0.09 Bark), indicating that their pronunciation features are highly similar to the standard group, reflecting the stability of advanced learners in tongue position and lip shape control. In contrast, the elementary group exhibits significant deviations in complex vowels, such as the apical vowel “ẓ” with a deviation as high as 2.61 Bark (standard group F2 = 8.13; elementary group F2 = 10.74). The deviation of the intermediate group lies between the two (e.g., the deviation of vowel “u” is 0.38 Bark), suggesting that their pronunciation ability is in a transitional stage, with some phonemes approaching the standard, but systematic errors still exist in complex vowels.

It is worth noting that the acoustic deviation is significantly positively correlated with pronunciation difficulty (r = 0.87, p < 0.001), further validating the sensitivity of formant parameters in pronunciation quality assessment. This finding is consistent with the acoustic-perceptual correlation theory proposed in [8], providing new empirical evidence for the mechanism analysis of cross-language pronunciation errors.

4. Conclusions and Discussion

Through acoustic experiments and machine learning methods, this study systematically explores the differences in vowel pronunciation features and their acoustic mechanisms among Mandarin learners at different proficiency levels, providing theoretical and technical support for pronunciation quality assessment and teaching strategy optimization. Based on 5880 speech samples from 50 learners (advanced, intermediate, and elementary groups) and 10 standard speakers, vowel formant parameters (F1-F2) are extracted to construct an acoustic space model, and SVM and ID3 algorithms are employed for cross-level classification analysis. The results show that Support Vector Machine significantly outperforms the ID3 algorithm in vowel classification tasks, with an average classification accuracy of 92.09% for the three learner groups, which is an improvement of 2.05 percentage points (p < 0.001) compared to ID3, owing to its efficient modeling ability for high-dimensional acoustic features. Acoustic space analysis further reveals the systematic deviation characteristics of learners’ pronunciation: the elementary group exhibits significant deviations in complex vowels (such as the apical vowel “ẓ” with a deviation of 2.61 Bark), while the advanced group’s deviations are generally less than 0.5 Bark (e.g., the deviation of vowel “a” is only 0.09 Bark), and the degree of deviation is significantly negatively correlated with Mandarin proficiency test scores (r = −0.76, p < 0.01). This finding is consistent with the acoustic–perceptual correlation theory, confirming the sensitivity of formant parameters in pronunciation quality assessment. This study innovatively proposes a collaborative analysis framework of acoustic features and machine learning, providing quantitative evidence for the development of real-time pronunciation evaluation modules in CALL systems. By precisely locating pronunciation error types (such as tongue position deviation and insufficient lip shape control), this framework can facilitate the design of personalized teaching plans, while also holding important reference value for the optimization of acoustic features in multi-ethnic Mandarin speech recognition technology. Future research will expand the sample to learners from various dialect backgrounds and incorporate dynamic acoustic parameters (such as formant trajectories and prosodic features) to further enhance the model’s universality and practicality.

While SVM and ID3 are effective for static formant classification, their lack of adaptability to dynamic speech patterns limits broader applications. Future work will explore advanced techniques like convolutional neural networks (CNNs) and integrate dynamic acoustic features (e.g., formant trajectories and prosodic patterns) to enhance robustness in real-world scenarios. This framework, combining acoustic features with machine learning, provides a foundation for real-time pronunciation evaluation modules in Computer-Assisted Language Learning (CALL) systems. By identifying specific error types (e.g., tongue position deviation, lip shape control), it supports personalized teaching plans and offers reference value for optimizing multi-ethnic Mandarin speech recognition technologies. Future research will also expand sample diversity (e.g., dialect backgrounds) to improve model universality.

Author Contributions

Investigation, H.Y.; project administration, G.A.; resources, G.A.; supervision, G.A.; validation, G.A.; writing—original draft, H.Y.; data curation, T.A.; data curation, H.Y.; writing—review and editing, H.Y.; funding acquisition, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62307030).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ethical and privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yoon, S.Y.; Hasegawa, J.M.; Sproat, R. Landmark-based automated pronunciation error detection. In Proceedings of the Interspeech, ISCA, Makuhari, Japan, 26–30 September 2010. [Google Scholar]
Yilahun, H.; Zhao, H.; Hamdulla, A. FREDC: A Few-Shot Relation Extraction Dataset for Chinese. Appl. Sci. 2025, 15, 1045. [Google Scholar] [CrossRef]
Bogach, N.; Boitsova, E.; Chernonog, S.; Lamtev, A.; Lesnichaya, M.; Lezhenin, I.; Novopashenny, A.; Svechnikov, R.; Tsikach, D.; Vasiliev, K.; et al. Speech processing for language learning: A practical approach to computer-assisted pronunciation teaching. Electronics 2021, 10, 235. [Google Scholar] [CrossRef]
Golonka, E.M.; Bowles, A.R.; Frank, V.M.; Richardson, D.L.; Freynik, S. Technologies for foreign language learning: A review of technology types and their effectiveness. Comput. Assist. Lang. Learn. 2014, 27, 70–105. [Google Scholar] [CrossRef]
Van Doremalen, J.; Boves, L.; Colpaert, J.; Cucchiarini, C.; Strik, H. Evaluating automatic speech recognition-based language learning systems: A case study. Comput. Assist. Lang. Learn. 2016, 29, 833–851. [Google Scholar] [CrossRef]
Ming, Y. Application of Speech Recognition and Assessment in Chinese Learning. Ph.D. Thesis, Beijing Jiaotong University, Beijing, China, 2008. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Shaw, J.A.; Foulkes, P.; Hay, J.; Evans, B.G. Revealing perceptual structure through input variation: Cross-accent categorization of vowels in five accents of English. Lab. Phonol. 2023, 14, 1–38. [Google Scholar] [CrossRef]
Wang, S.Y.; Peng, G. Language, Speech and Technology; Shanghai Education Press: Shanghai, China, 2006. [Google Scholar]
Yuan, J.; Cai, X.; Church, K. Improved contextualized speech representations for tonal analysis. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
Chen, C.; Bunescu, R.; Xu, L.; Liu, C. Tone classification in Mandarin Chinese using convolutional neural networks. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
Yan, J.Z. A study of the formant transitions between the first syllable with vocalic ending and the second syllable with initial vowel in the disyllabic sequence in standard Chinese. Annu. Rep. Phon. Res. 1995, 41–53. [Google Scholar]
Wang, Y.J.; Deng, D. Japanese Learners’ Acquisition of “Similar Vowels” and “Unfamiliar Vowels” in Standard Chinese. Chin. Teach. World 2009, 2, 262–279. [Google Scholar]
Zhou, Z.C.; Wang, M.J.; Yu, S.Y. Study on Vowel Formant Trajectories of the First Syllable in Chinese Disyllables. Audio Eng. 2007, 31, 8–13. [Google Scholar]
Assmann, P.F.; Nearey, T.M. Relationship between fundamental and formant frequencies in voice preference. J. Acoust. Soc. Am. 2007, 122, 35–43. [Google Scholar] [CrossRef] [PubMed]
Gendrot, C.; Decker, M.A. Impact of duration and vowel inventory size on formant values of oral vowels: An automatic analysis from eight languages. In Proceedings of the International Congress of Phonetic Sciences, Saarbrücken, Germany, 6–10 August 2007; pp. 1417–1420. [Google Scholar]
Dong, B.; Zhao, Q.W.; Yan, Y.H. Research on Objective Testing Method for Pronunciation Level of Finals in Standard Chinese Based on Formant Pattern. Acta Acust. 2007, 32, 122–128. [Google Scholar]
Amami, R.; Ben Ayed, D.; Ellouze, N. An empirical comparison of SVM and some supervised learning algorithms for vowel recognition. arXiv 2015, arXiv:1507.06021. [Google Scholar]
Hafiz, A.M. K-Nearest Neighbour and Support Vector Machine Hybrid Classification. Int. J. Imaging Robot. 2019, 19, 33–41. [Google Scholar]
Neumeyer, L.; Franco, H.; Digalakis, V.; Weintraub, M. Automatic scoring of pronunciation quality. Speech Commun. 2000, 30, 83–93. [Google Scholar] [CrossRef]
Neumeyer, L.; Franco, H.; Weintraub, M.; Price, P. Automatic text-independent pronunciation scoring of foreign language student speech. In Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP 96, ISCA, Philadelphia, PA, USA, 3–6 October 1996; pp. 1457–1480. [Google Scholar]
Franco, H.; Abrash, V.; Precoda, K.; Bratt, H.; Rao Gadde, V.R.; Butzberger, J.; Rossier, R.; Cesari, F. The SRI EduSpeak™ system: Recognition and pronunciation scoring for language learning. In Proceedings of the Integrating Speech Technology in (Language) Learning, InSTIL, ISCA, Dundee, UK, 29–30 August 2000; pp. 121–125. [Google Scholar]
Hillenbrand, J.; Getty, L.A.; Clark, M.J.; Wheeler, K. Acoustic characteristics of American English vowels. J. Acoust. Soc. Am. 1995, 97, 3099–3111. [Google Scholar] [CrossRef] [PubMed]
Hsu, C.W.; Chang, C.C.; Lin, C.J. A practical guide to support vector classification. J. Mach. Learn. Res. 2016, 17, 1–16. [Google Scholar]
Lee, H.; Kim, S. Improving speech recognition accuracy using formant analysis in multilingual contexts. Speech Commun. 2019, 108, 55–63. [Google Scholar]
Eskenazi, M. An overview of spoken language technology for education. Speech Commun. 2009, 51, 832–844. [Google Scholar] [CrossRef]
Zhang, Y.P. Research and Application of Decision Tree Improvement Algorithm in Speech Synthesis System. Ph.D. Thesis, University of Science and Technology of China, Hefei, China, 2012. [Google Scholar]
Li, H. Statistical Learning Methods; Tsinghua University Press: Beijing, China, 2012; pp. 95–124. [Google Scholar]
Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Zhou, Z.H. Machine Learning; Tsinghua University Press: Beijing, China, 2016; pp. 73–85. [Google Scholar]
Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Francisco, CA, USA, 1993; pp. 17–42. [Google Scholar]

Figure 1. Speaker information for different proficiency groups.

Figure 2. Formant patterns of different vowels. Y-axis: formant frequency (Hz); X-axis: time (ms). The top graph shows the formant pattern of vowel a, the middle graph shows the formant pattern of vowel i, and the bottom graph shows the formant pattern of vowel u. UC represents learners, MC represents Mandarin speakers, F1 represents the first formant, and F2 represents the second formant.

Figure 3. Vowel classification results based on SVM. The left figure (a) shows accuracy, and the right figure (b) shows recall. UCA (black line) represents advanced learners, UCI (blue line) represents intermediate learners, and UCB (black dashed line) represents elementary learners.

Figure 4. Vowel classification results based on the ID3 algorithm. The left figure (a) shows accuracy, and the right figure (b) shows recall. UCA (black line) represents advanced level learners, UCI (red line) represents intermediate level learners, and UCB (black dashed line) represents elementary level learners.

Table 1. Detailed information about speakers in different proficiency groups.

Group	Number of Male Participants (Average Score)	Number of Female Participants (Average Score)	Total Number	Mandarin Proficiency Level
UCB (Elementary Level Group)	6 (57)	6 (55)	12	Beginner
UCI (Intermediate Level Group)	10 (65)	10 (67)	20	Intermediate
UCA (Advanced Level Group)	9 (83)	9 (82)	18	Advanced
MC (Mandarin Chinese)	5 (87)	5 (89)	10	Standard

Table 2. Classification accuracy and recall of vowels for learners at different levels based on SVM.

Vowel	UCA (Advanced Level Group)		UCI (Intermediate Level Group)		UCB (Elementary Level Group)
Vowel	Accuracy	Recall	Accuracy	Recall	Accuracy	Recall
a	92%	91%	92%	90%	93%	91%
o	93%	90%	95%	93%	92%	90%
ɤ	94%	91%	94%	92%	95%	94%
u	94%	91%	94%	91%	93%	94%
y	93%	91%	93%	92%	94%	93%
i	93%	92%	91%	91%	91%	90%
ẓ	90%	88%	90%	89%	87%	88%
	90%	90%	89%	88%	88%	85%

Table 3. Classification accuracy and recall of vowels for learners at different levels based on the ID3 algorithm.

Vowel	UCA		UCI		UCB
Vowel	Accuracy	Recall	Accuracy	Recall	Accuracy	Recall
a	91%	90%	90%	90%	89%	88%
o	91%	89%	94%	92%	90%	90%
ɤ	94%	91%	93%	92%	92%	90%
u	90%	91%	91%	91%	89%	88%
y	88%	85%	90%	89%	87%	87%
i	90%	89%	90%	90%	89%	88%
ẓ	89%	88%	90%	89%	87%	86%
	90%	90%	89%	88%	88%	85%

Table 4. Calculation results of vowel acoustic deviation.

Vowel	UCA Deviation (Bark)	UCI Deviation (Bark)	UCB Deviation (Bark)
a	0.09	0.08	0.11
o	0.26	0.19	0.12
ɤ	0.47	0.40	0.43
u	0.14	0.38	0.52
y	0.18	0.06	0.61
i	0.16	0.14	1.01
ẓ	0.09	0.28	2.61
̩	0.11	0.10	0.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arkin, G.; Abdukelim, T.; Yilahun, H.; Hamdulla, A. Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners. Appl. Sci. 2025, 15, 6335. https://doi.org/10.3390/app15116335

AMA Style

Arkin G, Abdukelim T, Yilahun H, Hamdulla A. Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners. Applied Sciences. 2025; 15(11):6335. https://doi.org/10.3390/app15116335

Chicago/Turabian Style

Arkin, Gulnur, Tangnur Abdukelim, Hankiz Yilahun, and Askar Hamdulla. 2025. "Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners" Applied Sciences 15, no. 11: 6335. https://doi.org/10.3390/app15116335

APA Style

Arkin, G., Abdukelim, T., Yilahun, H., & Hamdulla, A. (2025). Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners. Applied Sciences, 15(11), 6335. https://doi.org/10.3390/app15116335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners

Abstract

1. Introduction

2. Experimental Procedure and Design

2.1. Experimental Subjects and Speech Data Collection

2.2. Correlation Analysis

3. Experimental Results and Analysis

3.1. Classification Results and Analysis Based on Support Vector Machine

3.2. Vowel Classification Results and Analysis Based on ID3 Algorithm

3.3. Calculation Results of Vowel Acoustic Deviation

4. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI