Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (18)

Search Parameters:
Keywords = acoustic tokenizer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 3293 KiB  
Article
Phonetically Based Corpora for Anglicisms: A Tijuana–San Diego Contact Outcome
by Ruben Roberto Peralta-Rivera, Carlos Ivanhoe Gil-Burgoin and Norma Esthela Valenzuela-Miranda
Languages 2025, 10(6), 143; https://doi.org/10.3390/languages10060143 - 16 Jun 2025
Viewed by 1096
Abstract
Research in Loanword Phonology has extensively examined the adaptation processes of Anglicisms into recipient languages. In the Tijuana–San Diego border region, where English and Spanish have reciprocally existed, Anglicisms exhibit two main phonetic patterns: some structures exhibit Spanish phonetic properties, while others preserve [...] Read more.
Research in Loanword Phonology has extensively examined the adaptation processes of Anglicisms into recipient languages. In the Tijuana–San Diego border region, where English and Spanish have reciprocally existed, Anglicisms exhibit two main phonetic patterns: some structures exhibit Spanish phonetic properties, while others preserve English phonetic features. This study analyzes 131 vowel tokens drawn from spontaneous conversations with 28 bilingual speakers in Tijuana, recruited via the sociolinguistic ‘friend-of-a-friend’ approach. Specifically, it focuses on monosyllabic Anglicisms with monophthongs by examining the F1 and F2 values using Praat. The results were compared with theoretical vowel targets in English and Spanish through Euclidean distance analysis. Dispersion plots generated in R further illustrate the acoustic distribution of vowel realizations. The results reveal that some vowels closely match Spanish targets, others align with English, and several occupy intermediate acoustic spaces. Based on these patterns, the study proposes two phonetically based corpora—Phonetically Adapted Anglicisms (PAA) and Phonetically Non-Adapted Anglicisms (PNAA)—to capture the nature of Anglicisms in this contact setting. This research offers an empirically grounded basis for cross-dialectal comparison and language contact studies from a phonetically based approach. Full article
Show Figures

Figure 1

18 pages, 2018 KiB  
Article
Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare
by Suresh Neethirajan
AI 2025, 6(4), 65; https://doi.org/10.3390/ai6040065 - 25 Mar 2025
Cited by 2 | Viewed by 1362
Abstract
Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, [...] Read more.
Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, to decode chicken vocalizations. Our primary objective was to determine whether Whisper could effectively identify acoustic patterns associated with emotional and physiological states in poultry, thereby enabling real-time, non-invasive welfare assessments. To achieve this, chicken vocal data were recorded under diverse experimental conditions, including healthy versus unhealthy birds, pre-stress versus post-stress scenarios, and quiet versus noisy environments. The audio recordings were processed through Whisper, producing text-like outputs. Although these outputs did not represent literal translations of chicken vocalizations into human language, they exhibited consistent patterns in token sequences and sentiment indicators strongly correlated with recognized poultry stressors and welfare conditions. Sentiment analysis using standard NLP tools (e.g., polarity scoring) identified notable shifts in “negative” and “positive” scores that corresponded closely with documented changes in vocal intensity associated with stress events and altered physiological states. Despite the inherent domain mismatch—given Whisper’s original training on human speech—the findings clearly demonstrate the model’s capability to reliably capture acoustic features significant to poultry welfare. Recognizing the limitations associated with applying English-oriented sentiment tools, this study proposes future multimodal validation frameworks incorporating physiological sensors and behavioral observations to further strengthen biological interpretability. To our knowledge, this work provides the first demonstration that Transformer-based architectures, even without species-specific fine-tuning, can effectively encode meaningful acoustic patterns from animal vocalizations, highlighting their transformative potential for advancing productivity, sustainability, and welfare practices in precision poultry farming. Full article
(This article belongs to the Special Issue Artificial Intelligence in Agriculture)
Show Figures

Figure 1

36 pages, 4793 KiB  
Article
Cross-Regional Patterns of Obstruent Voicing and Gemination: The Case of Roman and Veneto Italian
by Angelo Dian, John Hajek and Janet Fletcher
Languages 2024, 9(12), 383; https://doi.org/10.3390/languages9120383 - 20 Dec 2024
Viewed by 1779
Abstract
Italian has a length contrast in its series of voiced and voiceless obstruents while also presenting phonetic differences across regional varieties. Northern varieties of the language, including Veneto Italian (VI), are described as maintaining the voicing contrast but, in some cases, not the [...] Read more.
Italian has a length contrast in its series of voiced and voiceless obstruents while also presenting phonetic differences across regional varieties. Northern varieties of the language, including Veneto Italian (VI), are described as maintaining the voicing contrast but, in some cases, not the length contrast. In central and southern varieties, the opposite trend may occur. For instance, Roman Italian (RI) is reported to optionally pre-voice intervocalic voiceless singleton obstruents whilst also maintaining the length contrast for this consonant class. This study looks at the acoustic realization of selected obstruents in VI and RI and investigates (a) prevoicing patterns and (b) the effects and interactions of regional variety, gemination, and (phonological and phonetic) voicing on consonant (C) and preceding-vowel (V) durations, as well as the ratio between the two (C/V), with a focus on that particular measure. An acoustic phonetic analysis is conducted on 3703 tokens from six speakers from each variety, producing eight repetitions of 40 real CV́C(C)V and CVC(C)V́CV words embedded in carrier sentences, with /p, pp, t, tt, k, kk, b, bb, d, dd, ɡ, ɡɡ, f, ff, v, vv, t∫, tt∫, dʒ, ddʒ/ as the target intervocalic consonants. The results show that both VI and RI speakers produce geminates, yielding high C/V ratios in both varieties, although there are cross-regional differences in the realization of singletons. On the one hand, RI speakers tend to pre-voice voiceless singletons and produce overall shorter C durations and lower C/V ratios for these consonants. On the other hand, VI speakers produce longer C durations and higher C/V ratios for all voiceless singletons, triggering some overlap between the C length categories, which results in partial degemination through singleton lengthening, although only for voiceless obstruents. The implications of a trading relationship between phonetic voicing and duration of obstruents in Italian gemination are discussed. Full article
(This article belongs to the Special Issue Speech Variation in Contemporary Italian)
Show Figures

Figure 1

23 pages, 3561 KiB  
Article
“It’s a Bit Tricky, Isn’t It?”—An Acoustic Study of Contextual Variation in /ɪ/ in the Conversational Speech of Young People from Perth
by Gerard Docherty, Paul Foulkes and Simon Gonzalez
Languages 2024, 9(11), 343; https://doi.org/10.3390/languages9110343 - 31 Oct 2024
Viewed by 1230
Abstract
This study presents an acoustic analysis of vowel realisations in contexts where, in Australian English, a historical contrast between unstressed /ɪ/ and /ə/ has largely diminished in favour of a central schwa-like variant. The study is motivated by indications that there is greater [...] Read more.
This study presents an acoustic analysis of vowel realisations in contexts where, in Australian English, a historical contrast between unstressed /ɪ/ and /ə/ has largely diminished in favour of a central schwa-like variant. The study is motivated by indications that there is greater complexity in this area of vowel variation than has been conventionally set out in the existing literature, and our goal is to shed new light by studying a dataset of conversational speech produced by 40 young speakers from Perth, WA. In doing so, we also offer some critical thoughts on the use of Wells’ lexical sets as a framework for analysis in work of this kind, in particular with reference to the treatment of items in unstressed position, and of grammatical (or function) words. The acoustic analysis focused on the realisation in F1/F2 space of a range of /ɪ/ and /ə/ variants in both accented and unaccented syllables (thus a broader approach than a focus on stressed kit vowels). For the purposes of comparison, we also analysed tokens of the fleece and happy-tensing lexical sets. Grammatical and non-grammatical words were analysed independently in order to understand the extent to which a high-frequency grammatical word such as it might contribute to the overall pattern of vowel alternation. Our findings are largely consistent with the small amount of previous work that has been carried out in this area, pointing to a continuum of realisations across a range of accented and unaccented contexts. The data suggest that the reduced historical /ɪ/ vowel encountered in unaccented syllables cannot be straightforwardly analysed as a merger with /ə/. We also highlight the way in which the grammatical word it participates in this alternation. Full article
(This article belongs to the Special Issue Advances in Australian English)
Show Figures

Figure 1

39 pages, 6630 KiB  
Article
‘No’ Dimo’ par de Botella’ y Ahora Etamo’ Al Garete’: Exploring the Intersections of Coda /s/, Place, and the Reggaetón Voice
by Derrek Powell
Languages 2024, 9(9), 292; https://doi.org/10.3390/languages9090292 - 30 Aug 2024
Viewed by 2813
Abstract
The rebranding of reggaetón towards Latin urban has been criticized for tokenizing Afro-Caribbean linguistic and cultural practices as symbolic resources recruitable by non-Caribbean artists/executives in the interest of profit. Consumers are particularly critical of an audible phonological homogeneity in the performances of ethnonationally [...] Read more.
The rebranding of reggaetón towards Latin urban has been criticized for tokenizing Afro-Caribbean linguistic and cultural practices as symbolic resources recruitable by non-Caribbean artists/executives in the interest of profit. Consumers are particularly critical of an audible phonological homogeneity in the performances of ethnonationally distinct mainstream performers, framed as a form of linguistic minstrelsy popularly termed a ‘Caribbean Blaccent’ that facilitates capitalization on the genre’s popularity by tapping into the covert prestige of distinctive phonological elements of Insular Caribbean Spanish otherwise stigmatized. This work pairs acoustic analysis with quantitative statistical modeling to compare the use of lenited coronal sibilant allophones popularly considered indexical of Hispano-Caribbean origins in the spoken and sung speech of four of the genre’s top-charting female performers. A general pattern of style-shifting from interview to sung speech wherein sibilance is favored in the former and phonetic zeros in the latter is revealed. Moreover, a statistically significant increased incidence of [-] across time shows the most recent records to uniformly deploy near-categorical reduction independent of artists’ sociocultural and linguistic backgrounds. The results support the enregisterment of practices popularized by the genre’s San Juan-based pioneers as a stylistic resource—a reggaetón voice—for engaging the images of vernacularity sustaining and driving the contemporary, mainstream popularity of música urbana. Full article
(This article belongs to the Special Issue Interface between Sociolinguistics and Music)
Show Figures

Figure 1

23 pages, 5891 KiB  
Article
The Role of (Re)Syllabification on Coarticulatory Nasalization: Aerodynamic Evidence from Spanish
by Ander Beristain
Languages 2024, 9(6), 219; https://doi.org/10.3390/languages9060219 - 17 Jun 2024
Viewed by 2045
Abstract
Tautosyllabic segment sequences exhibit greater gestural overlap than heterosyllabic ones. In Spanish, it is presumed that word-final consonants followed by a word-initial vowel undergo resyllabification, and generative phonology assumes that canonical CV.CV# and derived CV.C#V onsets are structurally [...] Read more.
Tautosyllabic segment sequences exhibit greater gestural overlap than heterosyllabic ones. In Spanish, it is presumed that word-final consonants followed by a word-initial vowel undergo resyllabification, and generative phonology assumes that canonical CV.CV# and derived CV.C#V onsets are structurally identical. However, recent studies have not found evidence of this structural similarity in the acoustics. The current goal is to investigate anticipatory and carryover vowel nasalization patterns in tautosyllabic, heterosyllabic, and resyllabified segment sequences in Spanish. Nine native speakers of Peninsular Spanish participated in a read-aloud task. Nasal airflow data were extracted using pressure transducers connected to a vented mask. Each participant produced forty target tokens with CV.CV# (control), CVN# (tautosyllabic), CV.NV# (heterosyllabic), and CV.N#V (resyllabification) structures. Forty timepoints were obtained from each vowel to observe airflow dynamics, resulting in a total of 25,200 datapoints analyzed. Regarding anticipatory vowel nasalization, the CVN# sequence shows an earlier onset of nasalization, while CV.NV# and CV.N#V sequences illustrate parallel patterns among them. Carryover vowel nasalization exhibited greater nasal spreading than anticipatory nasalization, and vowels in CV.NV# and CV.N#V structures showed symmetrical nasalization patterns. These results imply that syllable structure affects nasal gestural overlap and that aerodynamic characteristics of vowels are unaffected across word boundaries. Full article
(This article belongs to the Special Issue Phonetics and Phonology of Ibero-Romance Languages)
Show Figures

Figure 1

18 pages, 3932 KiB  
Article
Phonation Patterns in Spanish Vowels: Spectral and Spectrographic Analysis
by Carolina González, Susan L. Cox and Gabrielle R. Isgar
Languages 2024, 9(6), 214; https://doi.org/10.3390/languages9060214 - 12 Jun 2024
Viewed by 2067
Abstract
This article provides a detailed examination of voice quality in word-final vowels in Spanish. The experimental task involved the pronunciation of words in two prosodic contexts by native Spanish speakers from diverse dialects. A total of 400 vowels (10 participants × 10 words [...] Read more.
This article provides a detailed examination of voice quality in word-final vowels in Spanish. The experimental task involved the pronunciation of words in two prosodic contexts by native Spanish speakers from diverse dialects. A total of 400 vowels (10 participants × 10 words × 2 contexts × 2 repetitions) were analyzed acoustically in Praat. Waveforms and spectrograms were inspected visually for voice, creak, breathy voice, and devoicing cues. In addition, the relative amplitude difference between the first two harmonics (H1–H2) was obtained via FFT spectra. The findings reveal that while creaky voice is pervasive, breathy voice is also common, and devoicing occurs in 11% of tokens. We identify multiple phonation types (up to three) within the same vowel, of which modal voice followed by breathy voice was the most common combination. While creaky voice was more frequent overall for males, modal voice tended to be more common in females. In addition, creaky voice was significantly more common at the end of higher prosodic constituents. The analysis of spectral tilt shows that H1–H2 clearly distinguishes breathy voice from modal voice in both males and females, while H1–H2 values consistently discriminate creaky and modal voice in male participants only. Full article
(This article belongs to the Special Issue Phonetics and Phonology of Ibero-Romance Languages)
Show Figures

Figure 1

15 pages, 1575 KiB  
Article
Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition
by Geon Woo Lee and Hong Kook Kim
Sensors 2024, 24(8), 2573; https://doi.org/10.3390/s24082573 - 17 Apr 2024
Viewed by 1678
Abstract
This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. [...] Read more.
This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores. Full article
(This article belongs to the Special Issue Feature Papers in Intelligent Sensors 2024)
Show Figures

Graphical abstract

16 pages, 6324 KiB  
Article
Simultaneous High-Speed Video Laryngoscopy and Acoustic Aerodynamic Recordings during Vocal Onset of Variable Sound Pressure Level: A Preliminary Study
by Peak Woo
Bioengineering 2024, 11(4), 334; https://doi.org/10.3390/bioengineering11040334 - 29 Mar 2024
Cited by 3 | Viewed by 1541
Abstract
Voicing: requires frequent starts and stops at various sound pressure levels (SPL) and frequencies. Prior investigations using rigid laryngoscopy with oral endoscopy have shown variations in the duration of the vibration delay between normal and abnormal subjects. However, these studies were not physiological [...] Read more.
Voicing: requires frequent starts and stops at various sound pressure levels (SPL) and frequencies. Prior investigations using rigid laryngoscopy with oral endoscopy have shown variations in the duration of the vibration delay between normal and abnormal subjects. However, these studies were not physiological because the larynx was viewed using rigid endoscopes. We adapted a method to perform to perform simultaneous high-speed naso-endoscopic video while simultaneously acquiring the sound pressure, fundamental frequency, airflow rate, and subglottic pressure. This study aimed to investigate voice onset patterns in normophonic males and females during the onset of variable SPL and correlate them with acoustic and aerodynamic data. Materials and Methods: Three healthy males and three healthy females were studied by simultaneous high-speed video laryngoscopy and recording with the production of the gesture [pa:pa:] at soft, medium, and loud voices. The fiber optic endoscope was threaded through a pneumotachograph mask for the simultaneous recording and analysis of acoustic and aerodynamic data. Results: The average increase in the sound pressure level (SPL) for the group was 15 dB, from 70 to 85 dB. The fundamental frequency increased by an average of 10 Hz. The flow was increased in two subjects, reduced in two subjects, and remained the same in two subjects as the SPL increased. There was a steady increase in the subglottic pressure from soft to loud phonation. Compared to soft to medium phonation, a significant increase in glottal resistance was observed with medium-to-loud phonation. Videokymogram analysis showed the onset of vibration for all voiced tokens without the need for full glottis closure. In loud phonation, there is a more rapid onset of a larger amplitude and prolonged closure of the glottal cycle; however, more cycles are required to achieve the intended SPL. There was a prolonged closed phase during loud phonation. Fast Fourier transform (FFT) analysis of the kymography waveform signal showed a more significant second- and third-harmonic energy above the fundamental frequency with loud phonation. There was an increase in the adjustments in the pharynx with the base of the tongue tilting, shortening of the vocal folds, and pharyngeal constriction. Conclusion: Voice onset occurs in all modalities, without the need for full glottal closure. There was a more significant increase in glottal resistance with loud phonation than that with soft or middle phonation. Vibration analysis of the voice onset showed that more time was required during loud phonation before the oscillation stabilized to a steady state. With increasing SPL, there were significant variations in vocal tract adjustments. The most apparent change was the increase in tongue tension with posterior displacement of the epiglottis. There was an increase in pre-phonation time during loud phonation. Patterns of muscle tension dysphonia with laryngeal squeezing, shortening of the vocal folds, and epiglottis tilting with increasing loudness are features of loud phonation. These observations show that flexible high-speed video laryngoscopy can reveal observations that cannot be observed with rigid video laryngoscopy. An objective analysis of the digital kymography signal can be conducted in selected cases. Full article
(This article belongs to the Special Issue The Biophysics of Vocal Onset)
Show Figures

Figure 1

19 pages, 4712 KiB  
Article
Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset
by Nourah M. Almarshady, Adal A. Alashban and Yousef A. Alotaibi
Appl. Sci. 2023, 13(17), 9567; https://doi.org/10.3390/app13179567 - 24 Aug 2023
Cited by 8 | Viewed by 3084
Abstract
The rapid momentum of deep neural networks (DNNs) in recent years has yielded state-of-the-art performance in various machine-learning tasks using speaker identification systems. Speaker identification is based on the speech signals and the features that can be extracted from them. In this article, [...] Read more.
The rapid momentum of deep neural networks (DNNs) in recent years has yielded state-of-the-art performance in various machine-learning tasks using speaker identification systems. Speaker identification is based on the speech signals and the features that can be extracted from them. In this article, we proposed a speaker identification system using the developed DNNs models. The system is based on the acoustic and prosodic features of the speech signal, such as pitch frequency (vocal cords vibration rate), energy (loudness of speech), their derivations, and any additional acoustic and prosodic features. Additionally, the article investigates the existing recurrent neural networks (RNNs) models and adapts them to design a speaker identification system using the public YOHO LDC dataset. The average accuracy of the system was 91.93% in the best experiment for speaker identification. Furthermore, this paper helps uncover reasons for analyzing speakers and tokens yielding major errors to increase the system’s robustness regarding feature selection and system tune-up. Full article
(This article belongs to the Special Issue Automatic Speech Signal Processing)
Show Figures

Figure 1

14 pages, 3419 KiB  
Article
DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
by Junxiao Yu, Zhengyuan Xu, Xu He, Jian Wang, Bin Liu, Rui Feng, Songsheng Zhu, Wei Wang and Jianqing Li
Entropy 2023, 25(1), 41; https://doi.org/10.3390/e25010041 - 26 Dec 2022
Cited by 9 | Viewed by 5044
Abstract
Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly [...] Read more.
Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token–frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness. Full article
(This article belongs to the Special Issue Machine and Deep Learning for Affective Computing)
Show Figures

Figure 1

15 pages, 1964 KiB  
Article
Ethnicity and Tone Production on Singlish Particles
by Ying Qi Soh, Junwen Lee and Ying-Ying Tan
Languages 2022, 7(3), 243; https://doi.org/10.3390/languages7030243 - 19 Sep 2022
Cited by 2 | Viewed by 4878
Abstract
Recent research on Singlish, also known as Colloquial Singapore English, suggests that it is subject to ethnic variation across the three major ethnic groups in Singapore, namely Chinese, Malay, and Indian. Discourse particles, said to be one of the most distinctive features of [...] Read more.
Recent research on Singlish, also known as Colloquial Singapore English, suggests that it is subject to ethnic variation across the three major ethnic groups in Singapore, namely Chinese, Malay, and Indian. Discourse particles, said to be one of the most distinctive features of the language, are nevertheless commonly used by bilinguals across all three ethnic groups. This study seeks to determine whether there are ethnic differences in the pitch contours of Singlish discourse particles produced by Singlish speakers. Four hundred and forty-four tokens of three Singlish particles, sia24, meh55, and what21, produced by the three groups of English-speaking bilingual speakers were drawn from the National Speech Corpus, and the pitch contours extracted and normalized. Statistical analysis of the overall pitch contours, the three acoustic parameters of mean pitch, pitch range, and pitch movement, and the variability of these parameters showed that the effect of ethnicity on the three acoustic parameters was not statistically significant and that the pitch contours of each particle were generally similar across ethnic groups. The results of this study suggest that Singlish may be acquired as a first language by Singaporean speakers, pre-empting any ethnic differences in the production of the particles that might otherwise have resulted from the speakers’ differing language repertoires. Full article
Show Figures

Figure 1

15 pages, 1829 KiB  
Article
LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
by Feng Liu, Si-Yuan Shen, Zi-Wang Fu, Han-Yang Wang, Ai-Min Zhou and Jia-Yin Qi
Entropy 2022, 24(7), 1010; https://doi.org/10.3390/e24071010 - 21 Jul 2022
Cited by 17 | Viewed by 3530
Abstract
Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech [...] Read more.
Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product. Full article
(This article belongs to the Topic Complex Systems and Artificial Intelligence)
Show Figures

Figure 1

16 pages, 1569 KiB  
Article
Learning to Perceive Non-Native Tones via Distributional Training: Effects of Task and Acoustic Cue Weighting
by Liquan Liu, Chi Yuan, Jia Hoong Ong, Alba Tuninetti, Mark Antoniou, Anne Cutler and Paola Escudero
Brain Sci. 2022, 12(5), 559; https://doi.org/10.3390/brainsci12050559 - 27 Apr 2022
Cited by 4 | Viewed by 3131
Abstract
As many distributional learning (DL) studies have shown, adult listeners can achieve discrimination of a difficult non-native contrast after a short repetitive exposure to tokens falling at the extremes of that contrast. Such studies have shown using behavioural methods that a short distributional [...] Read more.
As many distributional learning (DL) studies have shown, adult listeners can achieve discrimination of a difficult non-native contrast after a short repetitive exposure to tokens falling at the extremes of that contrast. Such studies have shown using behavioural methods that a short distributional training can induce perceptual learning of vowel and consonant contrasts. However, much less is known about the neurological correlates of DL, and few studies have examined non-native lexical tone contrasts. Here, Australian-English speakers underwent DL training on a Mandarin tone contrast using behavioural (discrimination, identification) and neural (oddball-EEG) tasks, with listeners hearing either a bimodal or a unimodal distribution. Behavioural results show that listeners learned to discriminate tones after both unimodal and bimodal training; while EEG responses revealed more learning for listeners exposed to the bimodal distribution. Thus, perceptual learning through exposure to brief sound distributions (a) extends to non-native tonal contrasts, and (b) is sensitive to task, phonetic distance, and acoustic cue-weighting. Our findings have implications for models of how auditory and phonetic constraints influence speech learning. Full article
(This article belongs to the Special Issue Auditory and Phonetic Processes in Speech Perception)
Show Figures

Figure 1

9 pages, 1103 KiB  
Article
Acoustic Word Embeddings for End-to-End Speech Synthesis
by Feiyu Shen, Chenpeng Du and Kai Yu
Appl. Sci. 2021, 11(19), 9010; https://doi.org/10.3390/app11199010 - 27 Sep 2021
Cited by 3 | Viewed by 3563
Abstract
The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained [...] Read more.
The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding. Full article
Show Figures

Figure 1

Back to TopTop