Speaker Profiling Based on the Short-Term Acoustic Features of Vowels

Humayun, Mohammad Ali; Shuja, Junaid; Abas, Pg Emeroylariffion

doi:10.3390/technologies11050119

Open AccessArticle

Speaker Profiling Based on the Short-Term Acoustic Features of Vowels

by

Mohammad Ali Humayun

^1,2,*,

Junaid Shuja

³

and

Pg Emeroylariffion Abas

^2,*

¹

Department of Computer Science, Information Technology University, Lahore 54000, Pakistan

²

Faculty of Integrated Technologies, Universiti Brunei Darussalam, Bandar Seri Begawan BE1410, Brunei

³

Department of Computer and Information Sciences, Universiti Teknologi Petronas, Seri Iskandar 32610, Malaysia

^*

Authors to whom correspondence should be addressed.

Technologies 2023, 11(5), 119; https://doi.org/10.3390/technologies11050119

Submission received: 5 July 2023 / Revised: 26 August 2023 / Accepted: 4 September 2023 / Published: 7 September 2023

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Speech samples can provide valuable information regarding speaker characteristics, including their social backgrounds. Accent variations with speaker backgrounds reflect corresponding acoustic features of speech, and these acoustic variations can be analyzed to assist in tracking down criminals from speech samples available as forensic evidence. Speech accent identification has recently received significant consideration in the speech forensics research community. However, most works have utilized long-term temporal modelling of acoustic features for accent classification and disregarded the stationary acoustic characteristics of particular phoneme articulations. This paper analyzes short-term acoustic features extracted from a central time window of English vowel speech segments for accent discrimination. Various feature computation techniques have been compared for the accent classification task. It has been found that using spectral features as an input gives better performance than using cepstral features, with the lower filters contributing more significantly to the classification task. Moreover, detailed analysis has been presented for time window durations and frequency bin resolution to compute short-term spectral features concerning accent discrimination. Using longer time durations generally requires higher frequency resolution to optimize classification performance. These results are significant, as they show the benefits of using spectral features for speaker profiling despite the popularity of cepstral features for other speech-related tasks.

Keywords:

speaker profiling; acoustics; short-term features; native language identification

1. Introduction

Modern communication tools can be easily misused by criminal elements to masquerade their identities while committing a crime. Coordination of criminal activities, threats, and ransom calls are a few common examples of such misuse. In this regard, profiling an unseen speaker becomes critical. The problem is challenging as, commonly, short-duration speech is used for criminal purposes. Traditional speech accent experts have been involved with forensic investigation agencies to identify or associate suspects with geographical regions based on speech samples as crime evidence. With the success of deep learning in speech and speaker recognition, the process of speaker profiling can be automated to assist the forensic community. The nature of variation in acoustic characteristics with changing speech accents in relation to speaker backgrounds is critical for automatic speaker profiling.

Human speech is the output of the vocal tract response to the glottal excitation signals. Pronunciations or accents in a speech mainly depend on acoustics, which results from the varying vocal tract shapes known as phoneme articulation attributes. The glottal excitation signal comprises multiple discrete frequency sound waves generated by the vibrating vocal cords, whilst the vocal tract shapes produce different standing waves corresponding to the phonemes for these excitation signals. The standing waves are generated as the vocal tract shapes act like closed or half-open tubes, causing interference of reflected waves with each other and allowing only particular frequencies corresponding to the length of the tube to be produced [1].

Since the vocal tract shape and resulting acoustic wave keep changing during speech production, the spectral representation for the complete signal cannot reflect the temporal changes. Hence, input features for speech processing models are usually extracted from overlapping time windows of a small duration. The time duration of the feature extraction window has to be small enough for the vocal tract shape to be assumed constant within the window. Acoustic features extracted from these small quasi-stationary time windows are termed short-term features. Speaker-level classification models compute a long-term temporal representation for the sequence of short-term features, as individual phoneme boundaries are irrelevant for output classification [2,3]. However, the acoustic discrimination between accents for individual vowel pronunciations can be analyzed more effectively using a single stationary window. For instance, formant positions corresponding to vowel articulations can be represented by the spectrum of the middle time window from the speech segment corresponding to the vowel. The length of the time window for extracting these spectral characteristics has to be just enough to represent attributes for the vowel only. A large window time frame results in time segment uncertainty for acoustics representing the vowel, as it will capture the neighbouring transients alongside the vowel articulation resonant frequencies. On the other hand, a small window time frame results in frequency uncertainty for acoustic representation as it might not be long enough to capture lower frequencies that change over a longer duration of time [4].

This paper evaluates the acoustic features extracted from a single time window in the middle of speech segments representing different vowels. The length of the quasi-stationary time windows and multiple types of acoustic features computed from these windows have been analyzed in terms of their effectiveness in discriminating between speakers of different Indian native languages from their second language, English speech. The main contributions of this paper can be summarized as follows:

This paper compares and analyzes multiple spectral and cepstral features of short speech segments corresponding to English vowels. The speech features have been evaluated for speaker profiling tasks with various time segment lengths and frequency resolution bins;
The evaluation has been performed using feature importance measures as well as classification performance;
Results indicate that certain vowels are more effective than others for speaker profiling and that spectral features are more useful than cepstral features. Performance improves with increasing time and bins, but the effective number of bins and time segment lengths are dependent on each other;
Finally, the implementation code for the presented analysis has been made publicly available.

This paper is organized as follows. Section 2 reviews the recent literature regarding effective speech features for accent classification. Section 3 explains the methodology used for this research. Section 4 presents the key findings of this research, and Section 5 presents the conclusion and future direction.

2. Literature Review

Input feature representation is significant for biometric recognition, and current research endeavours have been devoted to investigating feature representations for biometric recognition [5]. Speech processing tasks have mainly used spectral or cepstral features as input. Recently, Watts et al. compared the utility of spectral and cepstral features to classify vowel speech samples [6]. Another study also analyzed the sliding window’s effectiveness for extracting spectral or cepstral features [7], whilst Paliwal et al. measured the impact of time window duration on speech recognition [8]. Research shows that unsupervised compression of cepstral speech features further enhances the classification accuracy for certain classification tasks [9].

Acoustic features of speech vary significantly for various speaker accents. Considerable research works have studied the variations in speech features with speaker accents [10,11]. Huang et al. [12] have also presented the challenges of accent variation for speech recognition models. On the other hand, Najafian et al. have shown that adding an accent classification module to adapt the speech recognition model significantly improves classification accuracy [13].

Research has indicated that vowels are more effective than consonants for distinguishing speaker accents, and hence, most models only use vowel segments for phoneme-based accent classification [14]. Another study [15] considerably improved the accent classification accuracy by complementing the continuous speech accent classifier with a vowel-specific accent classifier to the text. Phoneme-level accent classification models extract segments from speech corresponding to individual phonemes as input for the accent classification model. For un-transcribed speech, automatic speech recognition models are generally used to obtain the phoneme segments [16]. In contrast, forced segmentation and alignment models can also be used to obtain the phoneme segments for transcribed speech [17].

Najafian et al. [18] have used Support Vector Machines (SVMs) and Parallel Phone Recognition and Language Modeling (PPRLM) to classify the ‘Accents of the British Isles’ (ABI) dataset. Their model achieves an accuracy of 84.87%. On the other hand, Brown [19] has used the Accent Distance (ACCDIST)-based SVM classifier on the ‘Accent and Identity of Scottish English Border’ (AISEB) dataset for forensic accent profiling. The SVM classifier uses a spatial distance matrix between the vowel acoustic features, with the mean of midpoint Mel Frequency Cepstral Coefficients (MFCC) vectors for each vowel used as acoustic features to compute the spatial distance matrix [19]. The classifier achieves an accuracy of 86.7%. Rizwan and Anderson [20] have used word-based classification using MFCC features to classify American English accents in the ‘Texas Instruments, Massachusetts Institute of Technology (TIMIT)’ dataset, with the model achieving 77.88% accuracy. More recently, Ubale et al. [21] have proposed applying a CNN model directly to a raw audio waveform, and the model achieves an accuracy of 86.05% after fusion with i-vector classification on the Educational Testing Service (ETS) native language dataset. Table 1 summarizes the list of various speech features used for the speaker profiling task in the recent literature with the corresponding accuracies (Acc.) achieved by the chosen classification models.

Although significant research has focused on the utility of speech features regarding speaker profiling using long-term acoustic representations [3,23], minimal works have targeted analysis from single time window features without long-term modelling. Accent profiling from individual time windows for each vowel can more effectively highlight the discrimination in stationary phoneme articulation characteristics for particular accents.

3. Methodology

This study has analyzed the acoustic features of spoken English vowels in terms of profiling the speaker’s social origin. Figure 1 depicts a general model for speaker profiling using speech samples.

3.1. Dataset

This research has analyzed selected vowels of an English speech corpus on Indian speakers [24,25]. The speech dataset used in this study has been collected by Babu et al. [24] and comprises 345 speakers spread across India, with the speakers having different native languages. Speaker profiling models tested for the dataset have been used to estimate the age and body parameters of speakers; however, it has not been used for accent profiling.

The data subset utilized for this research comprises English speech samples spoken by 345 Indian speakers. English is the second language of these speakers, and they belong to diverse administrative states within India, having distinct regional languages as their primary or native language. Consequently, the objective of this study is the identification of the native language of these speakers, considering the speech samples in their secondary language, i.e., English.

The speakers are distributed into five distinct categories, each aligned with a specific native language, namely, Hindi, Telugu, Tamil, Malayalam, and Kannada. These native languages are linked to distinct Indian states. The statistical breakdown of speakers belonging to the various native language groups is presented in Table 2.

For this research, the speech utterances from the dataset have been segmented with phoneme boundaries using a Montreal speech alignment and segmentation model developed by McAuliffe et al. [17]. The model was fed with the speech utterances and their corresponding sentence transcripts, and it returned a text grid of start and stop time steps with phoneme annotations.

Vowels are produced by the variation of vocal tract configurations, leading to variations in oral cavity dimensions that result in divergent resonant frequencies characterized by substantial amplitudes. The location and strength of these resonant frequencies play a pivotal role in identifying the specific vowel phoneme. The Montreal forced alignment model used for phoneme segregation capitalizes on these intrinsic acoustic features of speech, along with their corresponding transcribed linguistic content, thus providing the segregation of vowel phonemes within the time domain. Five of the most popular and distinct vowels have been selected in this research for discriminatory feature analysis. The five selected vowels in the ARPABET representation [26] are AE (e.g., Bat), AH (e.g., Butt), IH (e.g., Bit), IY (e.g., Beat), and UW (e.g., Boot). While the ARPABET vowel set encompasses a total of 15 distinct vowel sounds, the analysis has focused on preserving the five most prominent ones. This selection is based on the frequency of vowel occurrences, a metric determined post-segmentation. Consequently, opting for the top five most frequently appearing vowels contributes to achieving a balanced distribution of vowel representation within the dataset. The number of instances segmented for each of the 5 vowels, along with their example usage, are given in Table 3.

Figure 2 presents the mean of the raw audio segments and the corresponding frequency transform for the 25-millisecond window from the middle of the vowel segments for all five vowels separately. The mean for every native language partition has been displayed with a separate colour. The overlapping plots for all native languages indicate that the native accents do not have marked segregation in raw audio and spectral representations. Figure 3 presents the mean of log filter bank energy features and MFCC features for the 5 vowels, with the mean of each native language represented by a unique colour. The segregation of plots indicates that log filter bank energy features are clearly more distinct between native languages as compared to MFCC or raw and spectral representations.

3.2. Speech Features Analysis

Performances of the multiple acoustic features and their parameters have been compared for the selected vowel segments for native language identification of speakers. Log of Mel-scaled filter bank energies as spectral [27] and MFCC as cepstral representations [28] have been used in this research as acoustic features for the selected time frames. The effectiveness of the log filter bank energies as crafted input features in identifying native language has been compared with that of MFCC features. The cepstral coefficients, being inverse transform for spectral features, capture the envelope of the spectrum, with the lower coefficients representing energies in the smoother parts of the spectrum and the higher coefficients representing discrete spectral energies. Generally, automatic speech recognition (ASR) models are only concerned with vocal tract articulation in identifying the spoken phonemes. Hence, they mostly utilize only the smooth part of the spectral envelope, captured by the lower cepstral coefficients. However, accent classification or speaker profiling is a somewhat different task. As such, a comparison of the effectiveness of spectral features to cepstral representation, specifically for the accent classification task, represents an important contribution of this research.

The filter bank energies are warped to Mel-scale, which is based on experimental observations of human speech perception. The Mel-scale is linear below 1 kHz whilst logarithmic above the 1 kHz frequency. Moreover, the human perception of the loudness of sound is also logarithmically related to the intensity of the sound, and hence, the logarithm of energies in Mel-scale filter banks are utilized as spectral features and referred to as log filter bank features for this analysis.

On the other hand, MFCC represents discrete cosine transform coefficients of the spectral energies, and hence, the lower coefficients represent energies in smoother parts of the spectrum, whereas the higher coefficients represent more discrete spectral energies. Physically, the smooth or slow-changing spectral components correspond to the vocal tract response, whilst the discontinuous energies represent glottal excitations corresponding to harmonics for the pitch of a sound. Since human hearing emphasizes different frequency ranges with different weights, a pre-emphasis stage usually precedes the cosine transform computation to obtain the cepstral coefficients.

Figure 4 displays a sample of the audio waveform and features for the utterance of the vowel ‘AH’, with a segmented time frame from the middle of the audio: (a) and (b) present the raw audio for vowel segment and the middle window, respectively, whilst (c), (d), (e), and (f) represent frequency transform, filter bank energies, log filter bank energies, and MFCC of the same audio waveform, respectively.

Long-term features are representations of the complete speech utterances obtained through the temporal modelling of a sequence of short-term features. Classification models based on such long-term feature modelling often overlook local boundaries of linguistic entities, including phonemes, which can be particularly crucial for accent classification, as the articulation of specific phonemes is significantly influenced by the social background of speakers. Figure 5b–d illustrate examples of the long-term sequence of spectral filter bank energies, spectral log filter bank energies, and cepstral MFCC, respectively, extracted from the raw audio of a speech sample in Figure 5a with the intensity of Blue color indicating the magnitude.

Initially, the speech samples are segmented into individual phonemes using the Montreal forced alignment model before the audio segments corresponding to selected vowels are transformed into their corresponding spectral and cepstral features representation. Time windows of multiple durations and varying resolutions have been considered for the segmentation and feature representations. These features have been individually fed into a Support Vector Machine (SVM) classifier [29] for native language identification. Performances of the different features, effects of time window duration, and resolutions have been compared. Additionally, the importance of various ranges of frequency filters in spectral features concerning the native language categories has been compared using statistical metrics.

Performance comparison between log filter bank (spectral) and MFCC (cepstral) representations of selected time windows from vowel segments has been performed by feeding the representations as features to the Support Vector Machine (SVM) model for the native language identification task. The SVM model has been trained on 80% of the dataset, with 20% of the dataset reserved for testing, to compare different features’ performances. The length of the time window used for extracting features from the middle of vowel segments has also been varied to analyze the impact of window duration on accent classification accuracy. In particular, window lengths of 12.5, 25, and 50 milliseconds have been taken from the middle of the segmented vowel utterances before extracting log bank features from the segments and feeding them into the SVM classifier for the accent classification task. The number of frequency bins for the filter bank energies and the number of cepstral coefficients for the MFCC has also been varied to analyze the impact of the number of frequency bins on accent classification. Furthermore, 32, 64, 128, and 256 filters for log bank and the number of coefficients for MFCC have been used to compare the effect of frequency resolutions for the time frame lengths. Additionally, individual filters within the log filter bank features and cepstral coefficients in the MFCC have also been analyzed in terms of their relative importance for the native language classification, using mutual information as the statistical score.

3.3. Accent Classification Model

Finally, the speaker-level classification accuracy has also been reported for each vowel by majority voting from predictions for individual vowel segments by the speaker using multiple short-term features, which have been found to be the best in feature comparison as classifier input. Twelve speakers from each native language have been reserved as test speakers to report the speaker-level accuracy. The speaker-level model tags the accent as the statistical mode of predictions for all vowel segments by SVM classifiers using all 128 log filter bank filters as well as another SVM using only the first 43 log filter bank filters from the resolution of 128 filters. This classification using majority voting represents the proposed classification model. Since most benchmark accent classification models have used 13 MFCC coefficients [30,31,32], the results have been compared with the classification accuracy for 13 MFCC features as input to a similar SVM classifier for evaluating the proposed model. Moreover, multiple kernel functions, including linear functions, radial basis functions (RBF), and high-degree polynomials for the SVM classifier, have also been compared in terms of speaker-level accuracy. Figure 6 illustrates the method adopted to compare the speech features for the accent classification task.

4. Results and Discussion

Section 4 has been organized into three subsections. Section 4.1 compares the time–frequency resolutions for accent profiling, whilst Section 4.2 evaluates different spectral ranges for accent discrimination. The last sub-section (Section 4.3) compares multiple kernels for the SVM classifier applied to accent classification and reports the speaker-level accent classification results.

4.1. Time and Frequency Resolution Analysis

Figure 7 shows the average accuracies of the native language identification task across all five vowels for different resolutions of both types of features: log filter banks and MFCC features computed from multiple durations of time frames. SVM has been used as the classification model for the accent classification task. Accuracies are presented as bar charts as well as in tabular form, with accuracy values ranging between 0% and 100%. The vertical axis represents accuracies, with the bar colours representing either the log filter bank or MFCC features. The duration of time frames (12.5 ms, 25 ms, or 50 ms) and the number of filters or cepstral coefficients (32, 64, 128, or 256 filters/cepstral coefficients) for the log filter bank and MFCC features, respectively, are presented on the horizontal axes. The variations in the number of filters/coefficients have been presented for each time window duration.

It can be seen that using the log filter bank features as input to the SVM classification model generally gives higher accuracy as compared to using the MFCC features as input for all lengths of time frames. The highest average accuracy of 54.3% is achieved using log filter bank features with 128 filters on a 25 ms time window, in contrast to using MFCC with the same number of coefficients and window time frame, which gives an average accuracy of 53.5%. This indicates the benefits of using spectral features from a single time frame as input for the classification model for accent profiling or speaker native language identification task, in contrast to most speech recognition models, which perform better using cepstral features [33]. The higher accuracy using the log filter bank can be attributed to the fact that spectral representation contains more information as compared to cepstral coefficients. Non-smooth spectral elements representing glottal information, such as pitch, are useful for accent identification. On the other hand, disregarding speaker characteristic variability benefits speech recognition models.

4.2. Comparison of Spectral Ranges

Comparing the effect of different time frames for a given number of filters indicates that the 25 ms window time frame gives the highest accuracies of 54.3%, using 128 filters of log filter bank energies. This is in contrast to using 32 and 64 filters, whereby the highest accuracies of 45.0% and 53.8%, respectively, are achieved from a shorter window time frame of 12.5 ms and using 256 filters, where the highest accuracy of 54.0% is achieved from a longer 50 ms window time frame. A longer time window requires higher frequency resolution and, hence, more filters for improved accent discrimination capability, compared to a shorter window time frame, which requires lower frequency resolutions and a smaller number of filters.

Figure 8 shows accuracies for the same analysis but with accuracies calculated separately for the five vowels, using the log filter bank features only. It is noted that the same general trend as average accuracy is observed for all of the vowels individually, which confirms the findings on the relationship between window time frame and spectral resolution: a shorter time frame requires lower frequency resolution and vice versa. Two of the highest segment-level accuracies for all vowels are achieved using the 128 log filter bank energies from a 25 ms window time frame (55.2%) and using 256 filters from a 50 ms window time frame (55.5%). For each vowel, the best accuracies are 55.2%, 54.6%, 54.3%, 53.5%, and 53.4% using the 128 filters from a 25 ms window time frame, and 55.5%, 53.9%, 53.5%, 53.6%, 53.6% using the 256 filters from a 50 ms window time frame for the AE, AH, IH, IY, and UW vowels, respectively.

The relative importance of the different banks of the spectral filters is presented in Figure 9a,b for the 64 and 128 filter bins of the log filter bank, respectively, using mutual information with accent categories. The importance of the different filter bins, based on mutual information, is evenly distributed among the 64 filter bins of the log filter bank; however, the lower bins have significant importance at the higher frequency resolution of 128 filter bins. This implies that as the filter resolution is increased with the lower filters capturing energy from a more specific or narrower frequency range, they become significantly more discriminating for the native accent identification task. This also explains why the higher frequency resolution achieves significantly higher accuracies for all vowels.

An SVM model has been trained and tested with different filters from the 128 filters in the log filter bank as input to validate further the statistical findings regarding the importance of filter ranges. Their performances are compared against each other. The 128 filters were divided into the lowest, middle, and highest ranges of 43 filters and fed as input to the SVM classifier for the native language identification task. Table 4 presents the accuracies of the SVM classifier, using different filter ranges individually for each vowel. It can be seen that the SVM classifier reports the highest accuracy when fed with the lowest filter range for all the vowels as input, followed by the middle filter range. The lowest accuracy is reported with the highest filter range. Average accuracies are 52.8%, 41.8%, and 37.1% for the lowest, middle, and highest filter ranges, respectively. This is significant as merely using the first 44 lower filters is able to provide 52.8% accuracy, in contrast to using the whole 128 filters, which only increases average accuracy by 1.5% to an average accuracy of 54.3%.

4.3. Speaker-Level Accent Classification

Table 5 reports the speaker-level classification accuracies for all the vowels with different kernel functions considered for the SVM classifier [34]. It can be seen that the vowel AE gives the highest accuracy among the vowels irrespective of the kernel used, with an average accuracy of 65.4%. The lowest average accuracy is on the vowel UW. Comparing different kernels, both polynomial kernels with degrees 9 and 12 give considerably higher accuracies as compared to linear and RBF kernels. Polynomial kernel with degree 9 gives the highest average accuracy of 64.7%, giving the highest individual vowel accuracies among the different kernels except for the vowel UW. This is followed by the polynomial kernel with degree 12, which gives an average accuracy of 63.7%.

Finally, speaker-level accuracies have been reported by the proposed majority voting from SVM classifiers with degree 12 polynomials as the kernel function. Although the degree 9 polynomial achieves higher average accuracy, degree 12 achieves above 60% accuracy for all vowels; hence, speaker-level results have been reported using the degree 12 polynomial function. Classification accuracy using the proposed method from log filter bank ranges has been compared with the accuracy by majority voting from predictions for MFCC features by a similar SVM classifier. Two SVMs with all 128 log filter bank filters and another SVM using only the first 43 log filter banks have been combined using majority voting to give the proposed model. Table 6 compares speaker-level accuracy between the proposed model and the SVM using the 13 MFCC features as benchmark features for the same dataset [30,31]. It can be seen that the proposed model significantly outperforms the 13 MFCC features for all vowels in the considered dataset. The proposed method achieves 63% accuracy for ‘AH’ and 65% for all others, whereas the accuracies of the model utilizing the 13 MFCC features as input features range between 31% and 36% only.

F-scores and confusion matrices have been reported for the native language identification of 12 test speakers using the SVM classifier. The performance measures across different native languages and vowels aim to analyze the discriminative nature of each vowel across the native languages of India and to identify confusing pairs. Table 7 represents the F-score for each native language for all five vowels individually. It is clear that the model achieves the highest scores for the Telugu native speakers consistently for all vowels, with an average score of 82.8%, followed by Hindi, with an average score of 69.1%. The lowest score is for the Malayalam native speaker, with a very low score of only 17.8%. For a given native language, F-scores do not vary significantly across vowels; however, ‘AE’ does give a generally highest score whilst ‘UW’ gives the lowest for all native languages.

Figure 10a–e present confusion matrices for the vowels ‘AE’, ‘AH’, ‘IH’, ‘IY’, and ‘UW’ for the 12 test speakers per native language. The confusion matrices have been colour-coded ranging from light blue to dark blue, with darker blue representing higher number. All vowels show similar confusion pairs for native languages, with the most significant being the misidentification of Malayalam speakers as Hindi speakers and Tamil speakers as Kannada speakers.

5. Conclusions

There has been significant interest from the research community in the profiling of speakers through their speech, with different features and models proposed for the task. This paper aims to contribute to the active research in this area by comparing the efficacy of short-term spectral and cepstral features for the native language identification of Indian speakers. Both spectral and cepstral features have been extracted from a single time window from the middle of vowel segments, with the impact of duration of time windows for feature extraction and the number of frequency filters to compute spectral energies analyzed specifically for the accent classification task using the SVM classifier. Interestingly, it has been found that spectral features or logs of filter bank energies perform better than MFCC or cepstral features for the accent classification task. This is in contrast to other speech-related tasks, which have proven otherwise.

A combination of 128 filter energies using the log bank filters, computed from a 25-millisecond time window, achieves the highest classification accuracy of 54.3%. However, as the number of frequency filters increases, the optimal time window duration decreases and vice versa. Conversely, it can be said that using a larger time window frame necessitates a larger number of frequency filters to increase accuracy. Of the 128 log bank filters, the lower range of filters is most relevant to the native language categories, with the first 44 lower filters giving a classification accuracy of 52.8%, in contrast to the 54.3% accuracy when using the whole 128 filters of the log bank filter. Finally, majority voting from SVM predictions using the complete 128 filter banks and the first 43 filters from the 128 filter banks achieve around 65% accuracy for all vowels, which is significantly higher than the accuracy using the lower 13 MFCC coefficients normally used for speech classification models.

Additionally, multiple kernel functions for SVM have been compared in terms of speaker-level classification accuracies, and it has been shown that higher-degree polynomial functions for the SVM kernel are considerably better than linear and RBF functions. Using the polynomial of degree 12 for the SVM kernel provides accuracies of 65%, 63%, 65%, 65%, and 60% for AE, AH, IH, IY, and UW, respectively, giving average accuracy of 63.7% overall vowels. With the accent classification task only considering five native Indian languages, Hindi, Kannada, Malayalam, Telugu, and Tamil, it has been shown that the misidentification of Malayalam speakers as Hindi speakers and Tamil speakers as Kannada speakers represent the most significant confusing pairs.

It is noteworthy that this research has considered static spectral characteristics from the midpoint of the vowel duration while disregarding the complete utterance, and consequently, temporal transitions of the spectral features have been disregarded. Hence, a logical future direction for this research is to analyze spectral characteristics with long-term temporal modelling. Additionally, the results have been reported for a dataset of English speech by Indian speakers, which might not be valid for different languages and speaker backgrounds. Hence, another direction for the future can be to conduct a similar analysis for multiple datasets with other languages and speaker backgrounds.

Author Contributions

Conceptualization, M.A.H. and P.E.A.; Methodology, M.A.H. and J.S.; Software, M.A.H. and P.E.A.; Formal analysis, M.A.H. and J.S.; Investigation, M.A.H. and J.S.; Resources, P.E.A.; Data curation, M.A.H.; Writing—original draft, M.A.H.; Writing—review & editing, P.E.A.; Visualization, M.A.H. and J.S.; Supervision, P.E.A.; Project administration, P.E.A.; Funding acquisition, P.E.A.. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Universiti Brunei Darussalam Research Grant UBD//RSCH/URC/NIG/3.0/2022/002.

Data Availability Statement

No new data were created or analyzed in this study. The code for reproducing the analysis results can be found at ‘https://github.com/mohammadalihumayun/vowel-acoustics/tree/main’ (accessed on 6 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Deng, L.; Yu, D.; Acero, A. Structured speech modeling. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1492–1503. [Google Scholar] [CrossRef]
Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis For Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep neural network embeddings for text-independent speaker verification. In Proceedings of the Annual Conference of the International Speech Communication Association Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 999–1003. [Google Scholar]
Flandrin, P.; Chassande-Mottin, É.; Auger, F. Uncertainty and spectrogram geometry. In Proceedings of the European Signal Processing Conference, Bucharest, Romania, 27–31 August 2012. [Google Scholar]
Rida, I.; Al-Maadeed, N.; Al-Maadeed, S.; Bakshi, S. A comprehensive overview of feature representation for biometric recognition. Multimed. Tools Appl. 2020, 79, 4867–4890. [Google Scholar] [CrossRef]
Watts, C.R.; Awan, S.N. Use of spectral/cepstral analyses for differentiating normal from hypofunctional voices in sustained vowel and continuous speech contexts. J. Speech Lang. Hear. Res. 2011, 54, 1525–1537. [Google Scholar] [CrossRef] [PubMed]
Shu, M.; Jiang, J.J.; Willey, M. The Effect of Moving Window on Acoustic Analysis. J. Voice 2016, 30, 5–10. [Google Scholar] [CrossRef] [PubMed]
Paliwal, K.; Wojcicki, K. Effect of analysis window duration on speech intelligibility. IEEE Signal Process. Lett. 2008, 15, 785–788. [Google Scholar] [CrossRef]
Humayun, M.A.; Yassin, H.; Abas, P.E. Spatial position constraint for unsupervised learning of speech representations. PeerJ Comput. Sci. 2021, 7, e650. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Chen, T.; Li, S.; Chang, E.; Zhou, J. Analysis of speaker variability. In Proceedings of the Eurospeech 2001—Scandinavia—7th European Conference on Speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001. [Google Scholar]
Benzeghiba, M.; De Mori, R.; Deroo, O.; Dupont, S.; Erbes, T.; Jouvet, D.; Fissore, L.; Laface, P.; Mertins, A.; Ris, C.; et al. Automatic speech recognition and speech variability: A review. Speech Commun. 2007, 49, 763–786. [Google Scholar] [CrossRef]
Huang, C.; Chen, T.; Chang, E. Accent Issues in Large Vocabulary Continuous Speech Recognition. Int. J. Speech Technol. 2004, 7, 141–153. [Google Scholar] [CrossRef]
Najafian, M.; Russell, M. Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Speech Commun. 2020, 122, 44–55. [Google Scholar] [CrossRef]
Suzuki, M.; Minematsu, N.; Luo, D.; Hirose, K. Sub-structure-based estimation of pronunciation proficiency and classification of learners. In Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Merano, Italy, 13–17 December 2009; pp. 574–579. [Google Scholar]
Humayun, M.A.; Yassin, H.; Abas, P.E. Native language identification for Indian-speakers by an ensemble of phoneme-specific, and text-independent convolutions. Speech Commun. 2022, 139, 92–101. [Google Scholar] [CrossRef]
Khurana, S.; Ali, A. QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge. In Proceedings of the 2016 IEEE Workshop on Spoken Language Technology, SLT 2016—Proceedings, San Diego, CA, USA, 13–16 December 2016; pp. 292–298. [Google Scholar]
McAuliffe, M.; Socolof, M.; Mihuc, S.; Wagner, M.; Sonderegger, M. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Annual Conference of the International Speech Communication Association Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 498–502. [Google Scholar]
Najafian, M.; Safavi, S.; Weber, P.; Russell, M. Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems. In Proceedings of the Odyssey 2016: Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2016; pp. 132–139. [Google Scholar]
Brown, G. Automatic accent recognition systems and the effects of data on performance. In Proceedings of the Odyssey 2016: The Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2016; pp. 94–100. [Google Scholar]
Rizwan, M.; Anderson, D.V. A weighted accent classification using multiple words. Neurocomputing 2018, 277, 120–128. [Google Scholar] [CrossRef]
Ubale, R.; Ramanarayanan, V.; Qian, Y.; Evanini, K.; Leong, C.W.; Lee, C.M. Native Language Identification from Raw Waveforms Using Deep Convolutional Neural Networks with Attentive Pooling. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019—Proceeding, Singapore, 14–18 December 2019; pp. 403–410. [Google Scholar]
Ubale, R.; Qian, Y.; Evanini, K. Exploring End-To-End Attention-Based Neural Networks for Native Language Identification. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop SLT 2018—Proceeding, Athens, Greece, 18–21 December 2018; pp. 84–91. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceeding, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Kalluri, S.B.; Vijayasenan, D.; Ganapathy, S.M.R.R.; Krishnan, P. NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Kalluri, S.B.; Vijayasenan, D.; Ganapathy, S. Automatic speaker profiling from short duration speech data. Speech Commun. 2020, 121, 16–28. [Google Scholar] [CrossRef]
Jurafsky, D.; Martin, J. Speech and Language Processing, 2nd ed.; Prentice Hall: Hoboken, NJ, USA, 2008. [Google Scholar]
Humayun, M.A.; Yassin, H.; Abas, P.E. Dialect classification using acoustic and linguistic features in Arabic speech. IAES Int. J. Artif. Intell. 2023, 12, 739. [Google Scholar] [CrossRef]
Milani MG, M.; Abas, P.E.; De Silva, L.C.; Nanayakkara, N.D. Abnormal heart sound classification using phonocardiography signals. Smart Health 2021, 21, 100194. [Google Scholar] [CrossRef]
Ramashini, M.; Abas, P.E.; Mohanchandra, K.; De Silva, L.C. Robust cepstral feature for bird sound classification. Int. J. Electr. Comput. Eng. 2022, 12, 1477. [Google Scholar] [CrossRef]
Brown, G.; Wormald, J. Automatic sociophonetics: Exploring corpora with a forensic accent recognition system. J. Acoust. Soc. Am. 2017, 142, 422–433. [Google Scholar] [CrossRef]
Upadhyay, R.; Lui, S. Foreign English Accent Classification Using Deep Belief Networks. In Proceedings of the 12th IEEE International Conference on Semantic Computing, ICSC 2018, Laguna Hills, CA, USA, 31 January–2 February 2018. [Google Scholar]
Saleem, S.; Subhan, F.; Naseer, N.; Bais, A.; Imtiaz, A. Forensic speaker recognition: A new method based on extracting accent and language information from short utterances. Forensic. Sci. Int. Digit. Investig. 2020, 34, 300982. [Google Scholar] [CrossRef]
Zheng, F.; Zhang, G.; Song, Z. Comparison of different implementations of MFCC. J. Comput. Sci. Technol. 2001, 16, 582–589. [Google Scholar] [CrossRef]
Karthikeyan, V.; Suja Priyadharsini, S. Adaptive Boosted Random Forest-Support Vector Machine Based Classification Scheme for Speaker Identification. Appl. Soft Comput. 2022, 131, 109826. [Google Scholar]

Figure 1. Speaker profiling model.

Figure 2. Waves (a,c,e,g,i) and spectrums (b,d,f,h,j) of vowels.

Figure 3. Log filter bank (a,c,e,g,i) and MFCC (b,d,f,h,j).

Figure 4. Speech segment and corresponding short-term features. (a) Raw audio; (b) spectral extraction window; (c) frequency spectrum; (d) frequency bin energies; (e) log filter energies; (f) MFCC.

Figure 5. Speech segment and its corresponding long-term features. (a) Raw audio, (b) mel-spectrogram (FBANK), (c) LOG-FBANK, (d) MFCC.

Figure 6. Proposed methodology.

Figure 7. Average accuracy for different time durations, features, and feature resolutions.

Figure 8. Vowel-specific accuracies across window duration, features, and resolutions.

Figure 9. The statistical importance of spectral filters. (a) Mutual information for 64 filters; (b) Mutual information for 128 filters.

Figure 10. Confusion matrices for vowels (a) AE, (b) AH, (c) IH, (d) IY, and (e) UW for different native Indian languages.

Table 1. Speech features used for speaker profiling.

Reference	Dataset	Features	Acc.
[21]	TOEFL	Raw Wave, I-vector	86.05%
[22]	TOEFL	Log Filter bank, I-vector	83.32%
[20]	TIMIT	MFCC, deltas	77.88%
[19]	AISEB	MFCC, ACCDIST	86.70%
[18]	ABI	PPRLM, I-vector	84.87%

Table 2. Speaker distribution across native languages.

Native Language	Speakers
Hindi	103
Kannada	60
Malayalam	60
Telugu	57
Tamil	65
Total Speakers	345

Table 3. Vowel frequency.

Vowel	Example-Usage	Instances
AE	Bat	37,498
AH	Butt	45,720
IH	Bit	45,238
IY	Beat	43,856
UW	Boot	32,842

Table 4. Accuracy for filter ranges.

Vowel	Accuracy % for Filter Ranges
Vowel	0–43	43–86	86–128
AE	53.6	42.5	37.1
AH	53.5	41.6	37.8
IH	51.8	42.8	37.7
IY	51.4	42.2	37.0
UW	53.4	39.8	35.9
Average	52.8	41.8	37.1

Table 5. Accuracy for SVM kernels.

Vowel	Accuracy % for SVM Kernels
Vowel	Linear	RBF	Poly_9	Poly_12
AE	61.7	66.6	68.3	65.0
AH	56.7	60.0	65.0	63.3
IH	60.0	60.0	65.0	65.0
IY	58.3	58.3	66.7	65.0
UW	56.7	60.0	58.3	60.0
Average	58.7	61.0	64.7	63.7

Table 6. Accuracies of the proposed model vs. SVM with MFCC (as the benchmark).

Vowel	Accuracy %
Vowel	Proposed	MFCC
AE	65.0	35.0
AH	63.3	33.3
IH	65.0	36.7
IY	66.7	33.3
UW	65.0	31.7

Table 7. F-score across native languages.

Vowel	F-Score Across Native Languages
Vowel	Hin	Kan	Mal	Tam	Tel
AE	68.7	66.7	28.6	57.1	84.6
AH	68.7	64.3	15.4	50.0	88.9
IH	70.6	66.7	15.4	57.1	88.0
IY	68.7	72.0	15.4	69.6	74.1
UW	68.6	61.5	14.3	54.5	78.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Humayun, M.A.; Shuja, J.; Abas, P.E. Speaker Profiling Based on the Short-Term Acoustic Features of Vowels. Technologies 2023, 11, 119. https://doi.org/10.3390/technologies11050119

AMA Style

Humayun MA, Shuja J, Abas PE. Speaker Profiling Based on the Short-Term Acoustic Features of Vowels. Technologies. 2023; 11(5):119. https://doi.org/10.3390/technologies11050119

Chicago/Turabian Style

Humayun, Mohammad Ali, Junaid Shuja, and Pg Emeroylariffion Abas. 2023. "Speaker Profiling Based on the Short-Term Acoustic Features of Vowels" Technologies 11, no. 5: 119. https://doi.org/10.3390/technologies11050119

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speaker Profiling Based on the Short-Term Acoustic Features of Vowels

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Dataset

3.2. Speech Features Analysis

3.3. Accent Classification Model

4. Results and Discussion

4.1. Time and Frequency Resolution Analysis

4.2. Comparison of Spectral Ranges

4.3. Speaker-Level Accent Classification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI