MDPI - Publisher of Open Access Journals

12 pages, 2278 KB

Open AccessArticle

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications

by Yiru Zhang, Bijing Liu, Yong Yang and Qun Yang

Electronics 2024, 13(11), 2046; https://doi.org/10.3390/electronics13112046 - 24 May 2024

Cited by 1 | Viewed by 2153

Current target-speaker extraction (TSE) models have achieved good performance in separating target speech from highly overlapped multi-talker speech. However, in real-world applications, multi-talker speech is often sparsely overlapped, and the target speaker may be absent from the speech mixture, making it difficult for the model to extract the desired speech in such situations. To optimize models for various scenarios, universal speaker extraction has been proposed. However, current models do not distinguish between the presence or absence of the target speaker, resulting in suboptimal performance. In this paper, we propose a gated cross-attention network for universal speaker extraction. In our model, the cross-attention mechanism learns the correlation between the target speaker and the speech to determine whether the target speaker is present. Based on this correlation, the gate mechanism enables the model to focus on extracting speech when the target is present and filter out features when the target is absent. Additionally, we propose a joint loss function to evaluate both the reconstructed target speech and silence. Experiments on the WSJ0-2mix-extr and LibriMix datasets show that our proposed method achieves superior performance over comparison approaches in terms of SI-SDR and WER. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

18 pages, 4569 KB

Open AccessArticle

Deep Learning for Neuromuscular Control of Vocal Source for Voice Production

by Anil Palaparthi, Rishi K. Alluri and Ingo R. Titze

Appl. Sci. 2024, 14(2), 769; https://doi.org/10.3390/app14020769 - 16 Jan 2024

Cited by 2 | Viewed by 2946

Abstract

A computational neuromuscular control system that generates lung pressure and three intrinsic laryngeal muscle activations (cricothyroid, thyroarytenoid, and lateral cricoarytenoid) to control the vocal source was developed. In the current study, LeTalker, a biophysical computational model of the vocal system was used as the physical plant. In the LeTalker, a three-mass vocal fold model was used to simulate self-sustained vocal fold oscillation. A constant /ə/ vowel was used for the vocal tract shape. The trachea was modeled after MRI measurements. The neuromuscular control system generates control parameters to achieve four acoustic targets (fundamental frequency, sound pressure level, normalized spectral centroid, and signal-to-noise ratio) and four somatosensory targets (vocal fold length, and longitudinal fiber stress in the three vocal fold layers). The deep-learning-based control system comprises one acoustic feedforward controller and two feedback (acoustic and somatosensory) controllers. Fifty thousand steady speech signals were generated using the LeTalker for training the control system. The results demonstrated that the control system was able to generate the lung pressure and the three muscle activations such that the four acoustic and four somatosensory targets were reached with high accuracy. After training, the motor command corrections from the feedback controllers were minimal compared to the feedforward controller except for thyroarytenoid muscle activation. Full article

(This article belongs to the Special Issue Computational Methods and Engineering Solutions to Voice III)

► Show Figures

Figure 1

17 pages, 2857 KB

Open AccessArticle

The Benefit of Bimodal Training in Voice Learning

by Serena Zadoorian and Lawrence D. Rosenblum

Brain Sci. 2023, 13(9), 1260; https://doi.org/10.3390/brainsci13091260 - 30 Aug 2023

Cited by 3 | Viewed by 1458

Abstract

It is known that talkers can be recognized by listening to their specific vocal qualities—breathiness and fundamental frequencies. However, talker identification can also occur by focusing on the talkers’ unique articulatory style, which is known to be available auditorily and visually and can be shared across modalities. Evidence shows that voices heard while seeing talkers’ faces are later recognized better on their own compared to the voices heard alone. The present study investigated whether the facilitation of voice learning through facial cues relies on talker-specific articulatory or nonarticulatory facial information. Participants were initially trained to learn the voices of ten talkers presented either on their own or together with (a) an articulating face, (b) a static face, or (c) an isolated articulating mouth. Participants were then tested on recognizing the voices on their own regardless of their training modality. Consistent with previous research, voices learned with articulating faces were recognized better on their own compared to voices learned alone. However, isolated articulating mouths did not provide an advantage in learning the voices. The results demonstrated that learning voices while seeing faces resulted in better voice learning compared to the voices learned alone. Full article

(This article belongs to the Special Issue Advances in Understanding the Phenomena and Processing in Audiovisual Speech Perception)

► Show Figures

Figure 1

11 pages, 1736 KB

Open AccessArticle

Benefits of Auditory Training with an Open-Set Sentences-in-Babble-Noise

by Ayelet Barda, Yair Shapira and Leah Fostick

Appl. Sci. 2023, 13(16), 9126; https://doi.org/10.3390/app13169126 - 10 Aug 2023

Cited by 2 | Viewed by 2319

Abstract

Auditory training (AT) has limited generalization to non-trained stimuli. Therefore, in the current study, we tested the effect of stimuli similar to that used in daily life: sentences in background noise. The sample consisted of 15 Hebrew-speaking adults aged 61–88 years with bilateral hearing impairment who engaged in computerized auditory training at home four times per week over a two-month period. Significant improvements were observed in sentences comprehension (Hebrew AzBio (HeBio) sentences test) with both four-talker-babble-noise (4TBN) and speech-shaped-noise (SSN) and in words comprehension (consonant-vowel-consonant (CVC) words test), following one month of AT. These improvements were sustained for two months after completing the AT. No evidence of spontaneous learning was observed in the month preceding training, nor was there an additional training effect in the additional month. Participants’ baseline speech perception abilities predicted their post-training speech perception improvements in the generalization tasks. The findings suggest that top-down generalization occurs from sentences to words and from babble noise to SSN and quiet conditions. Consequently, synthetic training tasks focusing on sentence-level comprehension accompanied by multi-talker babble noise should be prioritized. Moreover, an individualized approach to AT has demonstrated effectiveness and should be considered in both clinical and research settings. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

14 pages, 341 KB

Open AccessReview

The Role of Talking Faces in Infant Language Learning: Mind the Gap between Screen-Based Settings and Real-Life Communicative Interactions

by Joan Birulés, Louise Goupil, Jérémie Josse and Mathilde Fort

Brain Sci. 2023, 13(8), 1167; https://doi.org/10.3390/brainsci13081167 - 5 Aug 2023

Cited by 8 | Viewed by 3691

Abstract

Over the last few decades, developmental (psycho) linguists have demonstrated that perceiving talking faces audio-visually is important for early language acquisition. Using mostly well-controlled and screen-based laboratory approaches, this line of research has shown that paying attention to talking faces is likely to be one of the powerful strategies infants use to learn their native(s) language(s). In this review, we combine evidence from these screen-based studies with another line of research that has studied how infants learn novel words and deploy their visual attention during naturalistic play. In our view, this is an important step toward developing an integrated account of how infants effectively extract audiovisual information from talkers’ faces during early language learning. We identify three factors that have been understudied so far, despite the fact that they are likely to have an important impact on how infants deploy their attention (or not) toward talking faces during social interactions: social contingency, speaker characteristics, and task- dependencies. Last, we propose ideas to address these issues in future research, with the aim of reducing the existing knowledge gap between current experimental studies and the many ways infants can and do effectively rely upon the audiovisual information extracted from talking faces in their real-life language environment. Full article

(This article belongs to the Special Issue Advances in Understanding the Phenomena and Processing in Audiovisual Speech Perception)

22 pages, 420 KB

Open AccessArticle

Integrating Gestures and Words to Communicate in Full-Term and Low-Risk Preterm Late Talkers

by Chiara Suttora, Annalisa Guarini, Mariagrazia Zuccarini, Arianna Aceti, Luigi Corvaglia and Alessandra Sansavini

Int. J. Environ. Res. Public Health 2022, 19(7), 3918; https://doi.org/10.3390/ijerph19073918 - 25 Mar 2022

Cited by 3 | Viewed by 3424

Abstract

Young children use gestures to practice communicative functions that foster their receptive and expressive linguistic skills. Studies investigating the use of gestures by late talkers are limited. This study aimed to investigate the use of gestures and gesture–word combinations and their associations with word comprehension and word and sentence production in late talkers. A further purpose was to examine whether a set of individual and environmental factors accounted for interindividual differences in late talkers’ gesture and gesture–word production. Sixty-one late talkers, including 35 full-term and 26 low-risk preterm children, participated in the study. Parents filled out the Italian short forms of the MacArthur–Bates Communicative Development Inventories (MB–CDI), “Gesture and Words” and “Words and Sentences” when their children were 30-months-old, and they were then invited to participate in a book-sharing session with their child. Children’s gestures and words produced during the book-sharing session were transcribed and coded into CHAT of CHILDES and analyzed with CLAN. Types of spontaneous gestures (pointing and representational gestures) and gesture–word combinations (complementary, equivalent, and supplementary) were coded. Measures of word tokens and MLU were also computed. Correlational analyses documented that children’s use of gesture–word combinations, particularly complementary and supplementary forms, in the book-sharing session was positively associated with linguistic skills both observed during the session (word tokens and MLU) and reported by parents (word comprehension, word production, and sentence production at the MB–CDI). Concerning individual factors, male gender was negatively associated with gesture and gesture–word use, as well as with MB–CDI action/gesture production. In contrast, having a low-risk preterm condition and being later-born were positively associated with the use of gestures and pointing gestures, and having a family history of language and/or learning disorders was positively associated with the use of representational gestures. Furthermore, a low-risk preterm status and a higher cognitive score were positively associated with gesture–word combinations, particularly complementary and supplementary types. With regard to environmental factors, older parental age was negatively associated with late talkers’ use of gestures and pointing gestures. Interindividual differences in late talkers’ gesture and gesture–word production were thus related to several intertwined individual and environmental factors. Among late talkers, use of gestures and gesture–word combinations represents a point of strength promoting receptive and expressive language acquisition. Full article

(This article belongs to the Special Issue The First 1000 Days of Life: Investigating Early Markers for Promoting Healthy Language Development)

22 pages, 372 KB

Open AccessArticle

Speech and Language Skills of Low-Risk Preterm and Full-Term Late Talkers: The Role of Child Factors and Parent Input

by Chiara Suttora, Annalisa Guarini, Mariagrazia Zuccarini, Arianna Aceti, Luigi Corvaglia and Alessandra Sansavini

Int. J. Environ. Res. Public Health 2020, 17(20), 7684; https://doi.org/10.3390/ijerph17207684 - 21 Oct 2020

Cited by 29 | Viewed by 5351

Abstract

Among children in the third year of life, late talkers comprise from 9% to 20%. This range seems to increase when addressing preterm children. This study examined video-recorded child spontaneous speech during parent–child book sharing as well as linguistic skills reported through the MacArthur Bates Communicative Development Inventories (MB-CDI) Short Form in 61 late talkers aged 30 months old (26 low-risk preterm, 8 females; 35 full-term, 12 females). Differences between low-risk preterm and full-term late talkers in child language measures and parental speech input were tested, as were the roles of child and parent factors on child language. Low-risk preterm and full-term late talkers showed similar speech and language skills. Similarly, no differences were found in measures of parental speech between groups. Child cognitive score, chronological age, and low-risk preterm status were positively associated with lexical diversity, rate, and composition of child speech production, whereas family history for language and/or learning disorders as well as parent measures of lexical diversity, rate, and grammatical complexity were negatively associated with the above child variables. In addition, child cognitive score and low-risk preterm status were positively associated with the MB-CDI measures of word and sentence production. Findings are discussed in terms of the need of good practices when following up on low-risk preterm children and of interventions targeting parents’ input to preterm and full-term late talkers. Full article

(This article belongs to the Special Issue Preterm Birth: Research, Intervention and Developmental Outcomes)

Search Results (7)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (7)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI