1. Introduction
Speech is the most natural and fundamental means of human communication, and vowels, as the core components of speech, play a crucial role in its intelligibility and clarity. For children with normal hearing, vowel production and acquisition typically follow a specific developmental trajectory [
1]. Early-acquired vowels (e.g., /a/, /i/, /u/) generally approach adult-like patterns by around two years of age, whereas more complex vowels or those constrained by specific phonetic contexts (e.g., /ɤ/, /y/, /ɿ/, /ʅ/) are typically mastered at later stages. However, children with hearing impairment, due to the lack of effective auditory feedback, often exhibit disrupted speech development characterized by unstable articulatory control, reduced vowel distinctiveness, abnormal phonation duration, and deviated formant distributions [
2,
3]. These issues not only affect their language development and daily communication but may also exert long-lasting negative impacts on cognition, social interaction, and educational progress [
4].
In recent years, the widespread application of cochlear implants and high- performance hearing aids has provided hearing-impaired children with alternative channels of linguistic input, enabling them to gain partial auditory perception and improve speech production. Nevertheless, limitations in electrode number, spectral resolution, and individual differences in the auditory neural system prevent full restoration of normal speech characteristics. Previous studies have shown that, even after cochlear implantation, the vowel space of hearing-impaired children is generally reduced compared to their normal-hearing peers, with marked deviations particularly in the production of high tongue-position and apical vowels [
5,
6]. Baudonck et al. [
7] clearly demonstrated that both prelingually deaf children with cochlear implants and those with hearing aids exhibit significantly smaller vowel spaces than normal-hearing children, directly contributing to reduced vowel clarity. Uchanski and Geers [
8] found through acoustic analyses that the vowel production ability of cochlear-implanted children is closely associated with their accumulated auditory experience, with earlier implantation leading to vowel production patterns more closely resembling those of normal-hearing peers. Liker et al. [
9], in a study of Croatian-speaking children, revealed another characteristic of vowel production in hearing-impaired children: greater variability in formant frequencies and substantial acoustic overlap across vowel categories. Similarly, Ertmer and Goffman [
10] summarized in their review that young cochlear implant users display notable deficits in speech production accuracy and stability compared to typically developing peers, with these difficulties being particularly evident in the vowel system. Yang et al. [
5,
6] further reported that the vowel space of Mandarin-speaking children with cochlear implants is reduced and more dispersed, with vowel space size negatively correlated with age at implantation, highlighting the importance of early intervention. Story and Bunton [
11] proposed a new spectral filtering algorithm that effectively addresses the inaccuracy of formant tracking in high-pitched children’s speech, thereby improving the reliability of acoustic analyses. Hung et al. [
12] found that the vowel space of Mandarin-speaking hearing aid users with conductive or mixed hearing loss is significantly compressed, with a salient acoustic feature being the lowered F2 of /i/. Nevertheless, current research on the acoustic characteristics of Mandarin vowels in hearing-impaired children remains limited. Most studies focus predominantly on traditional formant parameters, with insufficient attention paid to multidimensional features such as spectral envelope and short-term dynamics. Moreover, certain Mandarin vowels (e.g., /ɿ/, /ʅ/) which occur in specific consonantal environments, have rarely been investigated.
The extraction of speech feature parameters is of great importance for analyzing vowel production in Mandarin-speaking children with cochlear implants. Commonly used acoustic features can describe speech characteristics from different perspectives. At the physiological level, formants reflect the resonance properties of the vocal tract and serve as key parameters for distinguishing vowel quality. Their estimation methods primarily include cepstral analysis, linear predictive coding (LPC), and wavelet analysis [
13,
14,
15]. At the perceptual level, Mel-frequency cepstral coefficients (MFCCs), which simulate the human auditory perception mechanism, are widely applied in speech and speaker recognition tasks [
16,
17]. At the physical modeling level, linear predictive cepstral coefficients (LPCCs), derived from autoregressive models of speech signals through linear predictive analysis, offer high computational efficiency and stable performance in vowel modeling and speech analysis [
18,
19].
Recent research has provided quantitative comparisons among LPC/LPCC and MFCC features, highlighting their respective strengths and supporting the selection of feature parameters in speech analysis tasks. In terms of computational efficiency, LPCC extraction benefits from the Levinson–Durbin recursion, achieving a computational complexity of O(p
2), whereas MFCC extraction requires FFT-based spectral analysis with a complexity of O(N log N). Empirical evaluations have shown that LPCC computation is 1.8–2.3 times faster than MFCC for typical frame settings (25 ms, 16 kHz) [
20]. Yang (2019) further reported that LPCC extraction can reduce processing time by 30–40% compared with MFCC under identical experimental conditions [
21].
Regarding recognition performance, existing studies indicate that MFCC generally provides superior discriminative power, though LPCC maintains competitive performance in specific scenarios. Zailan et al. reported that MFCC features achieved 99.09% accuracy in a speaker recognition task, outperforming LPC/LPCC features (94.75%) [
22]. Conversely, in fixed-phrase speaker verification, LPCC achieved a lower equal error rate (EER) of 12.96%, compared with 15.55% obtained by MFCC, suggesting that LPCC can provide advantages in short-utterance modeling, albeit at the cost of higher computational time [
23]. In emotion recognition studies, MFCC achieved the highest accuracy (81–85%), while LPCC reached a comparable level (78–82%), demonstrating a favorable trade-off between accuracy and computational efficiency [
24].
These findings demonstrate that MFCC provides strong perceptual discrimination for speech classification, whereas LPCC offers measurable efficiency advantages and stable modeling performance. Incorporating both feature types therefore enables a more comprehensive characterization of vowel production, particularly in studies involving pediatric cochlear implant users.
In addition, to mitigate the influence of individual articulation differences on the analysis results, this study employs vowel normalization methods to standardize the acoustic parameters [
25]. At the feature modeling level, a single feature often fails to comprehensively capture the complex differences in vowel production among children with cochlear implants. To address this limitation, this study introduces a multi-view fusion framework, in which acoustic features from different levels are regarded as complementary views for joint modeling. Multi-view learning not only preserves the independent representational strengths of each feature set but also achieves information complementarity and enhancement through a shared latent representation space. This approach improves the robustness of clustering and classification results and provides a more accurate characterization of the deviations and patterns in vowel production across different feature dimensions in children with cochlear implants. The framework of the proposed multi-view clustering approach for vowel analysis is presented in
Figure 1. First, the Mandarin vowel signals of the speakers are preprocessed and subjected to feature extraction. Subsequently, multiple acoustic parameters, including formants, MFCC, and LPCC, are extracted, and K-means clustering is individually performed on each feature set to obtain preliminary results. Finally, the clustering outcomes from multiple views are integrated through feature fusion, thereby enabling a comparative analysis of Mandarin vowel production between children with cochlear implants and children with normal hearing. The main contributions of this study are as follows.
Multi-dimensional feature fusion: Vowel characteristics are comprehensively represented at the physiological level (formants), perceptual level (MFCC), and physical modeling level (LPCC).
Multi-view clustering analysis: Given that a single perspective cannot fully capture all characteristics, we adopt a multi-view integration strategy to aggregate multi-dimensional feature information, thereby enabling a comprehensive analysis of the vowel production of children with CIs.
Experimental results demonstrate that the proposed method provides a useful tool for researchers analyzing Mandarin vowels in children with CIs. The comparative analysis based on multi-view clustering highlights the differences in vowel production between children with CIs and children with normal hearing.
2. Materials and Methods
The speech productions were collected from 14 children with cochlear implants and 56 age-matched children with normal hearing. All participants were recruited from northern China and spoke Mandarin as their native language. The children with CIs (five girls, nine boys) were aged between 3.25 and 6.42 years (M = 4.65, SD = 0.98). They were all prelingually deafened with bilateral severe-to-profound sensorineural hearing loss, and received unilateral cochlear implantation combined with a contralateral hearing aid. The implantation age ranged between 1.25 and 4.92 years (M = 2.76, SD = 1.25), with the length of CI use ranging between 0.33 and 3.50 years (M = 1.89, SD = 0.83).
CI participants were implanted with devices from multiple manufacturers, including Cochlear, MED-EL, and Advanced Bionics, and all children received fully inserted intracochlear electrode arrays with no extracochlear observed in this group. However, detailed electrode-level information including array type, array length, and insertion depth was not accessible.
All children with cochlear implants received regular auditory and speech therapy at certified rehabilitation centers following implantation. They also received regular clinical follow-ups, during which certified audiologists performed device mapping using standard clinical protocols to establish individualized and audiologically appropriate MAPs.
The children with NH (33 girls, 23 boys) were aged between 3.17 and 6.42 years (M = 4.63, SD = 0.99). They all demonstrated normal hearing and with no history of speech and language impairments.
2.1. Speech Materials
Each participant produced a list of monosyllabic words that contained seven Mandarin vowels /a, i, u, ɤ, y, ɿ, ʅ/ in a picture-naming task. For each target vowel, four words corresponding to the four Mandarin tones were selected, resulting in a total of 7 × 4 = 28 items.
Recordings were conducted in a quiet room with background noise below 40 dB A. Speech samples were recorded using Praat software at a sampling rate of 44.1 kHz and 16-bit quantization, with a professional condenser microphone directly connected to a computer. To minimize experimental variation, all recordings were conducted by experimenters following standardized instructions. Participants performed warm-up exercises before formal recording to ensure familiarity with the task. After recording, Praat software was used for preprocessing: For each vowel, a stable segment from the midpoint of its pronunciation was selected for analysis. Praat extracted the First Formant (F1) and Second Formant (F2) parameters; simultaneously, the speech data from this stable segment was exported for subsequent extraction and analysis of MFCC (Mel Frequency Cepstral Coefficients) and LPCC (Linear Predictive Cepstral Coefficients) features in MATLAB R2023b.
2.2. Vowel Feature Extraction
2.2.1. Formant Extraction and Normalization
Formants are generated by the resonances of the vocal tract, and their frequency distribution varies across vowels. The first formant (F1) primarily reflects vowel height: the lower the tongue position and the wider the oral opening, the higher the F1 value. The second formant (F2) mainly reflects tongue advancement: the more fronted the tongue position, the higher the F2 value [
26]. In this study, the first and second formants (F1 and F2) were extracted from the steady-state portion at the midpoint of each vowel token using Praat software (version 6.4.13).
To reduce the effect of age-related vocal tract length differences in children with cochlear implants, the Lobanov normalization method [
27] was applied to convert formant frequencies into
z-scores, a widely recognized technique for vowel normalization. Following Thomas and Kendall [
28], these
z-scores were subsequently rescaled into Hertz values, as expressed below:
where
denotes the rescaled normalized formant frequency of the
i-th formant.
represents the Lobanov-normalized value of the
i-th formant for an individual speaker.
and
denote, respectively, the minimum and maximum Lobanov-normalized values of the
i-th formant across the entire dataset. All subsequent formant analyses were conducted based on these rescaled normalized values.
In Thomas and Kendall’s method, the reference values 250, 500, 850, and 1400 reflect the typical physiological and acoustic ranges of vowel formants in natural speech. These values allow the rescaled outputs to approximate perceptually interpretable Hz-like intervals. Given that the formant patterns of children with cochlear implants differ from those of normal-hearing peers, this rescaling approach provides a standardized vowel space that preserves inter-speaker contrasts while enhancing the interpretability of the formant distributions.
When plotting the distribution ellipse of a given vowel, all F1 and F2 values that had undergone Lobanov normalization and rescaling were first taken as sample points. Using F2 as the x-axis and F1 as the y-axis, principal component analysis (PCA) was performed on all sample points of that vowel. The orientation of the first principal component was adopted as the rotation angle of the ellipse’s major axis. The ellipse center was determined by the mean F1 and F2 values of the sample points, while the lengths of the major and minor axes were set to twice the standard deviations of the first and second principal components, respectively (corresponding to an approximate 95% confidence interval). Based on these parameters, vowel distribution ellipses were drawn to represent the acoustic distribution of each vowel.
2.2.2. Extraction of MFCC
Mel-frequency cepstral coefficients (MFCCs) capture the short-time spectral envelope characteristics of speech signals by mapping the linear spectrum onto the Mel frequency scale, which reflects the nonlinear frequency perception characteristics of the human auditory system. The computation of MFCCs involves several key steps [
16,
17].
First, the speech signal undergoes preprocessing, including pre-emphasis, framing, and windowing. During pre-emphasis, a first-order high-pass filter is applied to amplify high-frequency components, compensating for the attenuation caused by the glottal excitation and the lip–nasal radiation effects. In the framing stage, the speech signal is divided into short-time frames, typically 20–30 ms in length with a 10 ms frame shift, to satisfy the short-time stationarity assumption. Each frame is then windowed (commonly using a Hamming window) to reduce spectral leakage.
Next, a fast Fourier transform (FFT) is applied to each frame to convert the time-domain signal into the frequency domain, and the corresponding power spectrum is computed. The power spectrum is then passed through a set of Mel-scale filter banks, which provide higher frequency resolution in the low-frequency region and lower resolution in the high-frequency region, consistent with human auditory perception. The output energy of the
i-th filter,
, is calculated as:
where
denotes the spectrum of the signal,
represents the transfer function of the
i-th Mel filter, and
M is the total number of filters.
The logarithm of the filter bank output energies is then computed to simulate the nonlinear perception of sound intensity by the human auditory system:
Finally, the logarithmic filter bank energies are transformed using the discrete cosine transform (DCT) to obtain the cepstral coefficients:
where
K denotes the number of cepstral coefficients extracted, typically ranging from 12 to 13. MFCCs effectively capture the spectral shape of speech and exhibit good robustness in noisy environments. Therefore, they are widely used in speech recognition and classification tasks.
2.2.3. Extraction of LPCC
Linear Predictive Cepstral Coefficients (LPCCs) are derived from the source–filter model of speech production. They estimate the vocal tract transfer function via linear predictive analysis, thereby extracting cepstral coefficients that reflect the resonant characteristics of the vocal tract. The theoretical foundation is that a speech signal can be predicted as a linear combination of a number of its past samples, which is equivalent to approximating the vocal tract transfer function using an all-pole model [
18,
19].
The basic idea of linear predictive analysis is to predict the current speech sample as a linear combination of the previous p speech samples:
where
denotes the linear prediction coefficients, and
is the prediction error. The optimal LPC coefficients can be obtained by minimizing the mean squared prediction error [
13,
14,
15].
Once the LPC coefficients are obtained, they can be converted into cepstral coefficients using a recursive formula:
To improve computational efficiency, the LPC coefficients can alternatively be transformed using the discrete cosine transform (DCT):
where
N denotes the number of LPC coefficients, and
K is the number of cepstral coefficients. LPCCs effectively capture the formant structure of speech signals and perform well in quiet environments, making them commonly used in tasks such as speech recognition and pronunciation quality assessment.
Regarding the pronunciation characteristics of children with cochlear implants, MFCCs and LPCCs provide complementary acoustic information. MFCCs, based on the auditory perception model, are sensitive to variations in timbre and voice quality, and are suitable for evaluating pronunciation clarity and naturalness [
29]. In contrast, LPCCs, derived from the vocal tract model, more directly reflect changes in the shape of the vocal tract and are suitable for analyzing the motor function of articulatory organs [
30].
In practical applications, these two types of features are often combined. Early studies have shown that feature vectors integrating MFCCs and LPCCs provide a more comprehensive representation of speech characteristics and achieve better performance in pronunciation quality assessment [
31]. In recent years, with the development of deep learning techniques, these traditional features are frequently combined with neural network models, further improving the accuracy of pronunciation disorder recognition in children with cochlear implants [
32].
In this study, 12-dimensional MFCCs and 12-dimensional LPCCs will be extracted and combined with formant parameters to construct multidimensional feature vectors. This approach aims to comprehensively capture the acoustic characteristics of Mandarin vowel production in children with cochlear implants, providing a quantitative basis for pronunciation assessment and rehabilitation training.
2.3. Clustering Algorithms
K-means clustering is a commonly used unsupervised learning algorithm for partitioning data points into K clusters. The algorithm minimizes the sum of squared distances within clusters through iterative optimization of cluster centers and data point assignments. In this study, K-means clustering was employed to classify speech features of children with cochlear implants and children with normal hearing, aiming to evaluate the distinguishability of speech features between the two groups. The specific steps are as follows: randomly initialize K cluster centers; assign each data point to the nearest cluster center; update each cluster center to the mean of all data points within that cluster; repeat the above steps until cluster centers no longer change or the maximum iteration count is reached.
Single acoustic features usually capture only one aspect of speech signals. For example, formant frequencies reflect vocal tract morphology, Mel-frequency cepstral coefficients (MFCCs) capture auditory perception, and linear predictive cepstral coefficients (LPCCs) represent linear predictive characteristics. However, vowel abnormalities in children with cochlear implants may manifest across multiple dimensions, such as tongue position control and spectral dynamics. Therefore, this study constructs a multidimensional feature model using feature-level fusion strategies to more comprehensively represent speech characteristics. The specific methods are as follows:
In this study, multiple acoustic features are concatenated to construct a high-dimensional representation of speech signals. Specifically, the normalized formant features
, the 12-dimensional Mel-frequency cepstral coefficients (MFCC), and the 12-dimensional linear predictive cepstral coefficients (LPCC) are combined. The fused feature vector is formulated as
where
and
. Consequently, the dimensionality of the concatenated feature vector is
This fused high-dimensional feature vector is subsequently employed as the input for the clustering algorithm. The concatenated feature vector P is standardized by applying z-score normalization, in which each feature dimension is transformed to have zero mean and unit variance. This procedure eliminates the influence of differences in feature scales on the clustering results.
The advantage of the fused feature model lies in its ability to capture speech characteristics from multiple perspectives by integrating information from physiological (formants), auditory perception (MFCC), and physical model (LPCC) levels. This enables a more comprehensive characterization of vowel differences between children with cochlear implants and children with normal hearing. By combining diverse features, the fused feature clustering enhances both accuracy and robustness. Specifically, feature fusion clustering better captures the multidimensional characteristics of speech signals, enabling more accurate differentiation between the speech features of children with cochlear implants and children with normal hearing.
2.4. Performance Metrics
When evaluating the performance of clustering or classification models, precision, recall, and F-score are three core metrics. They measure the model’s effectiveness from different perspectives, helping us gain a more comprehensive understanding of its performance.
Precision focuses on “whether the correct items were identified.” It measures the proportion of truly correct results among all outcomes identified by the model, calculated as follows:
where TP represents true positives, i.e., the number of samples correctly identified as positive by the model; FP represents false positives, i.e., the number of samples incorrectly identified as positive by the model. A high precision indicates that the model’s judgments are highly reliable with a low false alarm rate.
Recall focuses on “how thoroughly it finds them.” It measures the proportion of all true positive samples that the model successfully identifies, calculated as follows:
FN (False Negative) represents the number of positive samples incorrectly classified as negative by the model. A high recall indicates the model effectively identifies most relevant targets with minimal omissions.
The F-score is a balanced evaluation metric that integrates both precision and recall. Because these two measures can be mutually constraining (e.g., an increase in recall may reduce precision), considering either metric in isolation may yield an incomplete assessment of model performance. The F-score addresses this issue by computing the harmonic mean of precision and recall, thereby providing a comprehensive measure of accuracy and completeness. It is formally defined as follows:
Therefore, an ideal clustering model should possess both high precision and high recall, resulting in a correspondingly high F-score.
3. Results
To investigate the influence of different acoustic features on vowel clustering, this study conducted separate cluster analyses using four approaches: formants, LPCC, MFCC, and multi-feature fusion.
3.1. Single-View Clustering Analysis
The F1-F2 vowel space serves to visualize the acoustic properties of vowels, providing an intuitive representation of tongue position, facilitating comparisons of pronunciation differences across individuals or groups, supporting normalization and statistical analyses, and assisting in the assessment of speech development and hearing research. F1 and F2 correspond to the vertical (height) and horizontal (front–back) dimensions of tongue placement, respectively. An F1-F2 vowel space chart therefore offers a tangible representation of vowel articulation, transforming abstract acoustic data into an interpretable visual form.
K-means clustering revealed markedly different vowel-space distributions for the two groups (
Figure 2 and
Figure 3). Data from the normal-hearing children (
n = 56) formed tight, elliptical clusters with crisp boundaries and minimal overlap, whereas the CI group (
n = 14) produced looser, more diffuse clusters whose ellipses were larger and showed substantial overlap across vowel categories.
Between-group comparisons along the F1-F2 vowel space chart revealed that CI children exhibited markedly greater variability along the F2 dimension. The scatter in the 1000–2000 Hz range was substantially broader than that of the control group, and the cluster ellipses were notably elongated along the F2 axis, whereas F1 dispersion remained relatively constrained. This divergence was most pronounced in the high-F2 region, corresponding to front vowels. Separability measures further showed that NH children’s vowel categories were well discriminated. These findings demonstrate a systematic difference in vowel acoustics: CI children show larger F2 variability and diminished vowel distinctiveness. Notably, this pattern does not indicate compromise for the accuracy of tongue-front/back position. Instead, the divergence is more likely attributed to the distinct acoustic stimulation experienced by CI children (compared to NH children), which in turn shapes differences in their articulatory patterns. Caution is warranted because the CI cohort is highly heterogeneous in degree of loss, age at onset, and intervention history, all of which may modulate speech development and contribute to the observed intra-group variance. The relatively small sample size (n = 14) may also limit the stability of these results.
To quantify this difference in vowel category distinctiveness, we calculated the variance explanation rate of the vowel data in the F1-F2 feature space. Specifically, the variance explanation rate of vowel data for NH children reached 84.73%, while that for CI children was 79.57%. This difference directly reflects the gap in the standardization of vowel articulation between the two groups. A higher variance explanation rate indicates that the seven vowel categories of NH children have good separability and discriminability, and their articulation patterns are closer to the standard phonetic system. In contrast, due to hearing loss, CI children exhibit greater individual differences in articulation, with unclear boundaries between vowel categories, which consequently leads to a lower variance explanation rate. This result is consistent with the characteristic of missing auditory feedback in the speech acquisition process of hearing-impaired children.
To reduce the dimensionality of LPCCs and MFCCs while preserving maximum variability in the original acoustic features, principal component analysis (PCA) was employed. The variance contribution ratio of each principal component (PC)—a metric reflecting the extent to which the component captures original feature variation—was calculated as the ratio of the target PC’s eigenvalue to the sum of all eigenvalues from the feature covariance matrix. PCA-based clustering of LPCC and MFCC features revealed distinct acoustic-space distributions for the two groups (
Figure 4,
Figure 5,
Figure 6 and
Figure 7). For LPCC, PC1 accounted for 31.3% of the variance in NH children and 31.0% in the CI group, indicating comparable explanatory power. Nevertheless, the NH children formed markedly tighter clusters in the PC1-PC2 plane, with high within-class cohesion and well-defined between-class boundaries. In contrast, the children with CIs showed a more dispersed pattern, larger inter-category overlap, and fuzzier decision boundaries. An identical trend was observed for MFCC: PC1 accounted for 21.9% of the variance in the NH group versus 24.3% in the CI group, yet the former showed denser, more well-demarcated clusters, while the latter was characterized by lower feature dimensional complexity, stemming from articulatory deviations or noise dominance and ultimately resulting in prominent category overlap. To further quantify the clustering results, we show the clustering indicators for NH and CI children in
Table 1 and
Table 2.
3.2. Multi-View Clustering Analysis
To enhance representational capacity, we conducted clustering on the fused-feature set (
Figure 8 and
Figure 9). After min–max normalisation, the combined representation yielded clearer separation in the principal-component space than any single-feature model. Clustering accuracy rose to 65.82 % for normal-hearing children (PC1 variance = 62.4 %) and to 56.12 % for the cochlear-implant group (PC1 variance = 55.5 %), surpassing the corresponding single-feature results. Visually, fusion tightened within-class cohesion and enlarged between-class margins along PC1-PC2, indicating that integrating multiple acoustic domains provides a more comprehensive description of the speech signal. Nevertheless, the implant group remained markedly less accurate than the normal hearing group, and the inter-group gap did not appreciably narrow, underscoring that the intrinsic variability inherent to paediatric cochlear-implant users persists even after feature fusion. The modest absolute accuracies further highlight the complexity of the problem and suggest that more sophisticated feature-extraction or classification techniques will be required.
In this study, we systematically evaluated the discriminative efficacy of several widely used acoustic features, namely formants, MFCCs, LPCCs, and their combinations, for clustering children’s speech. Considering both children with cochlear implants and children with normal hearing, we compared the resulting cluster assignments with ground-truth labels and computed precision, recall, and F-score. The results, summarized in
Table 1 and
Table 2, demonstrate the effectiveness and robustness of each feature set in capturing phonetic distinctions between the two groups, providing empirical guidance for the selection of an optimal feature extraction strategy.
As shown in
Table 1, the clustering results for normal-hearing children’s speech exhibit substantial variability in performance across different acoustic features. Using formants alone as input features yields an F-score of 0.5677, with both precision and recall maintained at relatively high levels. This observation aligns with phonetic theory, as formant frequencies serve as primary acoustic cues for vowel differentiation. They accurately reflect tongue position (height and frontness) and oral cavity configuration, thereby providing strong discriminative power in vowel category clustering. In contrast, clustering based on MFCCs or LPCCs as individual features is markedly less effective. The F-score for MFCCs is only 0.0591, approaching random assignment, indicating that, without additional dimensionality reduction or appropriate modeling, MFCCs alone struggle to capture distinctions among children’s vowels. LPCCs perform slightly better, with an F-score of 0.3705, yet still exhibit limited discriminative capability for this task. This limitation likely arises from the fact that MFCCs and LPCCs are primarily designed to capture the overall spectral envelope of speech signals, rather than vowel-specific formant information, which reduces their effectiveness in distinguishing vowel categories within an unsupervised clustering framework. When formants, MFCCs, and LPCCs are combined, clustering performance improves substantially. The fused feature set achieves an F-score of 0.6630, with both precision and recall enhanced, surpassing the performance of any individual feature. This improvement suggests the presence of complementary information among the different features: formants effectively capture primary articulatory differences of vowels, while MFCCs and LPCCs encode additional spectral or dynamic characteristics. Their integration allows clustering to represent speech acoustics across multiple dimensions, providing a more comprehensive acoustic representation and thereby enhancing both the accuracy and robustness of the clustering results.
Table 2 presents the clustering results for children with CIs, showing trends similar to those of the normal-hearing group but with overall lower performance. Specifically, when using formants alone, the F-score is 0.5082, which is noticeably lower than the 0.5677 achieved by the normal-hearing group. This discrepancy suggests that children with cochlear implants generally exhibit less stable and less consistent vowel production, leading to reduced separability in the acoustic feature space. The clustering performance of LPCC is also weak, with values lower than those of the normal-hearing group, which aligns with the greater variability typically observed in the speech of children with cochlear implants. It is noteworthy that even under the condition of unstable vowel production in children with cochlear implants, feature fusion still yields improvements. The F-score rises to 0.5534, indicating that fused features can partially compensate for the limitations of single features and facilitate better differentiation of speech categories.
Taken together, the results from both groups lead to the following conclusions: Formants are the core features for vowel differentiation, achieving the best performance under single-feature conditions. MFCC and LPCC exhibit limited discriminative power individually, but enhance the overall performance when combined with formants. Feature fusion consistently outperforms single features for both children with normal hearing and children with cochlear implants, demonstrating its robustness and generalizability in speech clustering tasks. These findings carry not only methodological significance but also potential practical implications. For normal-hearing children, feature fusion can more comprehensively capture vowel category differences, thereby offering a more precise tool for speech research. For children with cochlear implants, the method enhances reliability in the presence of high variability, providing more sensitive and objective acoustic indicators for quantitative assessment of speech disorders, as well as for monitoring and tracking rehabilitation outcomes.
4. Discussion
This study provides a systematic comparison of the acoustic–phonetic properties of vowels produced by NH children and CI children, corroborating the efficacy of multi-feature fusion for vowel clustering and revealing robust between-group differences. Owing to degraded auditory feedback, CI children exhibit reduced articulatory precision and coordination, resulting in notably lower within-category stability and between-category separability relative to their NH peers.
Quantitative evaluation further demonstrates that unimodal cues (e.g., formants alone) afford only modest discriminative power, whereas fusion of formants with MFCC and LPCC coefficients yields substantial gains in clustering accuracy and robustness. By integrating complementary spectral, cepstral and articulatory information, the fused representation captures higher-order acoustic distinctions inaccessible to single-feature models, thereby furnishing a more comprehensive phonetic profile.
Nevertheless, even with the application of the multi-feature fusion method, the vowel clustering performance of CI children remained inferior to that of NH controls, with a quantifiable performance gap between the two groups. This finding underscores the multifactorial nature of HI-related speech deficits. The persistent performance gap between the two groups stems not only from isolated acoustic parameter deviations, but also from systemic disruptions across motor planning, somatosensory monitoring, and auditory–motor integration. It is important to emphasize, however, that such phonetic articulatory differences—manifested in vowel production in the present study—do not inherently equate to functional impairment in either speech understanding or hearing skills. Phonemic variations between HI and NH children may exist as a subtle, measurable acoustic phenomenon while still allowing CI children to achieve adequate communicative competence. This distinction aligns with the ultimate goal of cochlear implantation and hearing intervention: to support CI individuals in attaining a normal quality of life with satisfactory hearing and speech function, rather than pursuing identical acoustic-phonetic patterns to NH peers. Thus, multi-feature fusion not only provides methodological innovation for vowel acoustic analysis, but also furnishes a clinically practical objective quantitative metric for profiling and tracking speech development in CI populations—one that can distinguish between benign articulatory differences and clinically meaningful deficits requiring targeted intervention.
While the present study yields valuable insights into vowel production deficits in CI children and the utility of multi-feature fusion, it is important to acknowledge several inherent limitations that should be addressed in future research. First, the sample size of pediatric CI users remains relatively small, which may constrain the generalizability of our conclusions. Second, multiple confounding variables associated with CI devices and clinical protocols could not be fully controlled, including heterogeneity in device manufacturers, electrode array configurations, and sound-processing strategies, the absence of patient-specific data (e.g., electrode insertion depth, preoperative/postoperative imaging records) and detailed rehabilitation trajectories (e.g., frequency and intensity of speech therapy), as well as unaccounted variations in clinical mapping parameters, all of which hinder the identification of specific factors driving inter-individual variability in vowel production. Third, the acoustic analysis was confined to spectral and cepstral features (formants, MFCCs, LPCCs) without integration with articulatory physiological data (e.g., tongue movement trajectories captured via ultrasound or electromagnetic articulography), preventing a comprehensive understanding of the link between acoustic deviations and underlying motor control deficits in CI children. Fourth, the cross-sectional study design only provides a snapshot of vowel production abilities at a single time point, failing to capture the dynamic changes in speech development over the course of CI use and rehabilitation. Finally, the proposed multi-feature clustering framework remains a preliminary analytical tool and has not yet been validated in clinical settings, limiting its immediate translational utility for speech assessment and intervention.
To address these limitations and advance the field, future research should prioritize several key directions. First, expanding the sample size to include a larger, more diverse cohort of pediatric CI users is essential. This should be accompanied by systematic collection of detailed clinical data, including device specifications, mapping parameters, and rehabilitation histories, enabling stratified analyses to isolate the specific impacts of each variable on vowel production outcomes. Future prospective studies should incorporate detailed preoperative and postoperative imaging data, patient-specific electrode array characteristics, and our acoustic analysis framework. Integrating these components will allow for a more precise examination of how place–pitch alignment influences speech production outcomes in children with cochlear implants. Therefore, future studies that control for or document detailed mapping parameters would be valuable in analyzing the specific contribution of this electrophysiological factor to speech production outcomes. Second, integrating acoustic analyses with articulatory physiological measurements will facilitate a mechanistic understanding of how auditory deprivation disrupts the coordination of vocal tract movements, bridging the gap between acoustic patterns and underlying motor processes. Third, adopting a longitudinal study design to track speech development in CI children over extended periods will uncover the dynamic relationships between implantation age, CI usage duration, rehabilitation intensity, and improvements in vowel articulation precision. Fourth, validating the multi-feature clustering framework in clinical practice is critical. Future studies should test its efficacy as a quantitative assessment tool for evaluating intervention outcomes, and further optimize the framework by incorporating advanced machine learning algorithms (e.g., deep learning-based clustering models) to enhance classification accuracy and robustness. Fifth, a particularly valuable extension of the present work would be to replicate this acoustic analysis in children with hearing aids (HAs), followed by a three-way comparative investigation involving HA users, CI children, and their normal-hearing peers. This comparative framework holds clinical relevance, as Portelli et al. [
33] utilized the LittlEARS Auditory Questionnaire (LEAQ) to demonstrate comparable hearing performance between children fitted with HAs and those with CIs. Such a replication and extension would not only verify the generalizability of our multi-feature fusion framework across different hearing intervention modalities but also clarify whether the vowel production deficits observed in CI children are shared or distinct in hearing aid users, thereby providing more targeted insights for clinical rehabilitation. Finally, extending the research scope to include more complex speech tasks (e.g., connected speech, conversational speech) rather than isolated vowels will provide a more ecologically valid assessment of speech function in CI children, thereby enhancing the real-world applicability of the findings.
Collectively, this study demonstrates the clinical potential of the multi-view fusion model for enhancing the management of speech production in children with CIs. The model translates into practice by providing an objective assessment tool that quantifies vowel production to pinpoint specific acoustic deficits beyond subjective ratings, thereby enabling personalized intervention through the identification of acoustically confusable vowels. Furthermore, it establishes a quantitative framework for longitudinal monitoring, allowing for the data-driven tracking of rehabilitation progress and the evaluation of intervention efficacy over time. In summary, this method offers a pathway toward more precise and individualized clinical management of speech production in pediatric CI users.
5. Conclusions
This study systematically investigated the acoustic characteristics of vowel production in CI children by comparing them with age-matched NH peers, and verified the effectiveness of the multi-view fusion model for vowel clustering analysis. In terms of clustering performance, the multi-view fusion model outperformed single-feature approaches across precision, recall, and F-score: the fused method achieved a precision of 0.6845, recall of 0.6582, and F-score of 0.6630 for NH children, which surpassed the performance of single features such as formants (0.6245/0.6008/0.5677). For CI children, the fused model also yielded a precision of 0.5557, recall of 0.5612, and F-score of 0.5534, which was likewise higher than the clustering results of single-feature methods. These findings confirm that the multi-view fusion strategy exhibits superior robustness compared with single-feature clustering approaches. Despite limitations including a small sample size, lack of integration with articulatory physiological data, and insufficient clinical validation, the proposed multi-view fusion framework provides an objective and quantitative tool for speech assessment and rehabilitation monitoring in CI children.Future research can further optimize this framework by expanding sample diversity, integrating physiological indicators such as tongue movements, adopting longitudinal tracking designs, and advancing clinical translation, thereby offering more precise support for personalized speech rehabilitation in CI children.
Author Contributions
Conceptualization, Q.X., J.W., L.D., L.Z. and Y.L.; Data curation, Q.X., L.Z. and Y.L.; Investigation, Q.X., L.Z. and Y.L.; Methodology, Q.X., J.W. and L.D.; Project administration, Q.X. and L.D.; Resources, Q.X., J.W., L.D., L.Z. and Y.L.; Software, Q.X., J.W. and L.D.; Supervision, Q.X. and L.D.; Validation, Q.X., J.W., L.D., L.Z. and Y.L.; Writing—original draft, Q.X., J.W., L.D., L.Z. and Y.L.; Writing—review and editing, Q.X., J.W., L.D., L.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.
Funding
This study was supported by the Fundamental Research Funds for the Central Universities JBKYKJCX2024-3.
Data Availability Statement
The data supporting the findings of this study are not publicly available due to privacy and ethical considerations related to the participation of children with hearing impairment and age-matched children with normal hearing in this research.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- McGowan, R.W.; McGowan, R.S.; Denny, M.; Nittrouer, S. A longitudinal study of very young children’s vowel production. J. Speech Lang. Hear. Res. 2014, 57, 1–15. [Google Scholar] [CrossRef]
- Haider, F.; Farooq, N.; Fatima, H.; Mansoor, R.; Bashir, N.; Qadeer, A.; Saqulain, G. Acoustic characteristics of vowel production in Urdu-speaking children with cochlear implant. Kurdish Stud. 2024, 12, 514–521. [Google Scholar]
- Fagniart, S.; Delvaux, V.; Harmegnies, B.; Huberlant, A.; Huet, K.; Piccaluga, M.; Watterman, I.; Charlier, B. Nasal/oral vowel perception in French-speaking children with cochlear implants and children with typical hearing. J. Speech Lang. Hear. Res. 2024, 67, 1243–1267. [Google Scholar] [CrossRef]
- Geers, A.E.; Davidson, L.S.; Uchanski, R.M.; Nicholas, J.G. Interdependence of linguistic and indexical speech perception skills in school-age children with early cochlear implantation. Ear Hear. 2013, 34, 562–574. [Google Scholar] [CrossRef]
- Yang, J.; Brown, E.; Fox, R.A.; Xu, L. Acoustic properties of vowel production in prelingually deafened Mandarin-speaking children with cochlear implants. J. Acoust. Soc. Am. 2015, 138, 2791–2799. [Google Scholar] [CrossRef]
- Yang, J.; Fox, R.A. Acoustic development of vowel production in native Mandarin-speaking children. J. Int. Phon. Assoc. 2019, 49, 33–51. [Google Scholar] [CrossRef]
- Baudonck, N.; Van Lierde, K.; Dhooge, I.; Corthals, P. A comparison of vowel productions in prelingually deaf children using cochlear implants, severe hearing-impaired children using conventional hearing aids and normal-hearing children. Folia Phoniatr. Logop. 2011, 63, 154–160. [Google Scholar] [CrossRef]
- Uchanski, R.M.; Geers, A.E. Acoustic characteristics of the speech of young cochlear implant users: A comparison with normal-hearing age-mates. Ear Hear. 2003, 24, 90S–105S. [Google Scholar] [CrossRef]
- Liker, M.; Mildner, V.; Šindija, B. Acoustic analysis of the speech of children with cochlear implants: A longitudinal study. Clin. Linguist. Phon. 2007, 21, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Ertmer, D.J.; Goffman, L.A. Speech production accuracy and variability in young cochlear implant recipients: Comparisons with typically developing age-peers. J. Speech Lang. Hear. Res. 2011, 54, 177–189. [Google Scholar] [CrossRef] [PubMed]
- Story, B.H.; Bunton, K. A spectral filtering method for tracking formants in children’s speech. Proc. Mtgs. Acoust. 2015, 23, 060002. [Google Scholar] [CrossRef]
- Hung, Y.-C.; Lee, Y.-J.; Tsai, L.-C. Vowel production of Mandarin-speaking hearing aid users with different types of hearing loss. PLoS ONE 2017, 12, e0178588. [Google Scholar] [CrossRef]
- Gargouri, D.; Kammoun, M.A.; Hamida, A.B. A comparative study of formant frequencies estimation techniques. In Proceedings of the 5th WSEAS International Conference on Signal Processing, Istanbul, Turkey, 27–29 May 2006; pp. 15–19. [Google Scholar]
- Simon, O.; Ugarte, J.P. Vowel characterization of Spanish speakers from Antioquia-Colombia using a specific-parameterized discrete wavelet transform analysis. Appl. Acoust. 2021, 172, 107635. [Google Scholar] [CrossRef]
- Loni, D.Y.; Subbaraman, S. Formant estimation of speech and singing voice by combining wavelet with LPC and Cepstrum techniques. In Proceedings of the 2014 9th International Conference on Industrial and Information Systems (ICIIS), Gwalior, India, 15–17 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–7. [Google Scholar] [CrossRef]
- Dhingra, S.D.; Nijhawan, G.; Pandit, P. Isolated speech recognition using MFCC and DTW. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 2013, 2, 4085–4092. [Google Scholar]
- Aouani, H.; Ben Ayed, Y. Emotion recognition in speech using MFCC with SVM, DSVM and auto-encoder. In Proceedings of the 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sousse, Tunisia, 21–24 March 2018; pp. 1–5. [Google Scholar] [CrossRef]
- You, S.D.; Wu, Y.C.; Peng, S.H. Comparative study of singing voice detection methods. Multimed. Tools Appl. 2016, 75, 15509–15524. [Google Scholar] [CrossRef]
- Glowacz, A.; Glowacz, W. Diagnostic of synchronous motor based on analysis of acoustic signals with application of LPCC and nearest mean classifier with cosine distance. Arch. Metall. Mater. 2010, 55, 563–569. [Google Scholar]
- Trabelsi, I.; Ben Ayed, D. On the use of different feature extraction methods for linear and non linear kernels. arXiv 2014, arXiv:1406.7314. [Google Scholar] [CrossRef]
- Yang, H.; Deng, Y.; Zhao, H.-A. A comparison of MFCC and LPCC with deep learning for speaker recognition. In Proceedings of the 4th International Conference on Big Data and Computing (ICBDC ’19), Guangzhou, China, 10–12 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 160–164. [Google Scholar] [CrossRef]
- Zailan, M.K.N.; Ali, Y.M.; Noorsal, E.; Abdullah, M.H.; Saad, Z.; Leh, A.M. Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context. ESTEEM Acad. J. 2023, 19, 101–112. [Google Scholar] [CrossRef]
- Misra, S.; Das, T.K.; Saha, P.; Baruah, U.; Laskar, R.H. Comparison of MFCC and LPCC for a fixed phrase speaker verification system, time complexity and failure analysis. In Proceedings of the 2015 International Conference on Circuits, Power and Computing Technologies (ICCPCT-2015), Nagercoil, India, 19–20 March 2015; pp. 1–4. [Google Scholar] [CrossRef]
- Patil, S.; Kharate, G. Comparison of speech emotion recognition performance with different system features. In Proceedings of the 3rd International Conference on Contents, Computing & Communication (ICCCC-2022), Bangalore, India, 15–16 December 2022. SSRN 4043566. [Google Scholar] [CrossRef]
- Adank, P.; Smits, R.; van Hout, R. A comparison of vowel normalization procedures for language variation research. J. Acoust. Soc. Am. 2004, 116, 3099–3107. [Google Scholar] [CrossRef]
- Lee, J.; Shaiman, S.; Weismer, G. Relationship between tongue positions and formant frequencies in female speakers. J. Acoust. Soc. Am. 2016, 139, 426–440. [Google Scholar] [CrossRef]
- Lobanov, B.M. Classification of Russian vowels spoken by different listeners. J. Acoust. Soc. Am. 1971, 49, 606–608. [Google Scholar] [CrossRef]
- Thomas, E.R.; Kendall, T. NORM: The Vowel Normalization and Plotting Suite. 2007. Available online: https://lingtools.uoregon.edu/norm/ (accessed on 14 December 2025).
- Terasawa, H.; Slaney, M.; Berger, J. A timbre space for speech. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 1729–1732. [Google Scholar] [CrossRef]
- Rabiner, L.R.; Juang, B.-H. Fundamentals of Speech Recognition; Prentice Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
- Shahin, I. Speaker identification in emotional environments. Iran. J. Electr. Comput. Eng. 2009, 8, 41–46. [Google Scholar]
- Sinha, R.; Azadpour, M. Employing deep learning model to evaluate speech information in acoustic simulations of Cochlear implants. Sci. Rep. 2024, 14, 24056. [Google Scholar] [CrossRef] [PubMed]
- Portelli, D.; Lombardo, C.; Loteta, S.; Galletti, C.; Azielli, C.; Ciodaro, F.; Mento, C.; Aguennouz, M.; Rosa, G.D.; Alibrandi, A.; et al. Exploring the hearing improvement and parental stress in children with hearing loss using hearing aids or cochlear implants. J. Clin. Med. 2025, 14, 2. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |