MDPI - Publisher of Open Access Journals

18 pages, 697 KiB

Open AccessReview

Lip-Reading: Advances and Unresolved Questions in a Key Communication Skill

by Martina Battista, Francesca Collesei, Eva Orzan, Marta Fantoni and Davide Bottari

Audiol. Res. 2025, 15(4), 89; https://doi.org/10.3390/audiolres15040089 - 21 Jul 2025

Viewed by 356

Lip-reading, i.e., the ability to recognize speech using only visual cues, plays a fundamental role in audio-visual speech processing, intelligibility, and comprehension. This capacity is integral to language development and functioning; it emerges in early development, and it slowly evolves. By linking psycholinguistics, [...] Read more.

Lip-reading, i.e., the ability to recognize speech using only visual cues, plays a fundamental role in audio-visual speech processing, intelligibility, and comprehension. This capacity is integral to language development and functioning; it emerges in early development, and it slowly evolves. By linking psycholinguistics, psychophysics, and neurophysiology, the present narrative review explores the development and significance of lip-reading across different stages of life, highlighting its role in human communication in both typical and atypical development, e.g., in the presence of hearing or language impairments. We examined how relying on lip-reading becomes crucial when communication occurs in noisy environments and, on the contrary, the impacts that visual barriers can have on speech perception. Finally, this review highlights individual differences and the role of cultural and social contexts for a better understanding of the visual counterpart of speech. Full article

(This article belongs to the Special Issue Breaking Down Listening Barriers for Students with Hearing Difficulties)

► Show Figures

Figure 1

17 pages, 439 KiB

Open AccessEditor’s ChoiceArticle

MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

by Shad Torrie, Kimi Wright and Dah-Jye Lee

Electronics 2025, 14(12), 2310; https://doi.org/10.3390/electronics14122310 - 6 Jun 2025

Viewed by 838

Abstract

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, [...] Read more.

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables. Full article

(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)

► Show Figures

Figure 1

15 pages, 4273 KiB

Open AccessArticle

Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models

by Jamsher Bhanbhro, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur and Madeha Memon

Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022 - 9 May 2025

Cited by 1 | Viewed by 1749

Abstract

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. [...] Read more.

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service. Full article

► Show Figures

Figure 1

20 pages, 20407 KiB

Open AccessArticle

VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection

by Andrea Appiani and Cigdem Beyan

Information 2025, 16(3), 233; https://doi.org/10.3390/info16030233 - 16 Mar 2025

Viewed by 1626

Abstract

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining [...] Read more.

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets. Full article

(This article belongs to the Special Issue Application of Machine Learning in Human Activity Recognition)

► Show Figures

Figure 1

25 pages, 2851 KiB

Open AccessArticle

Self-Created Film as a Resource in a Multimodal Conversational Narrative

by Mascha Legel, Stijn R. J. M. Deckers, Gloria Soto, Nicola Grove, Annalu Waller, Hans van Balkom, Ronald Spanjers, Christopher S. Norrie and Bert Steenbergen

Multimodal Technol. Interact. 2025, 9(3), 25; https://doi.org/10.3390/mti9030025 - 10 Mar 2025

Cited by 1 | Viewed by 966

Abstract

When access to natural speech is limited or challenging, as is the case for people with complex communication needs, self-created digital film can be practical to use as one of the resources within a multimodal conversation about a personal experience. The detailed and [...] Read more.

When access to natural speech is limited or challenging, as is the case for people with complex communication needs, self-created digital film can be practical to use as one of the resources within a multimodal conversation about a personal experience. The detailed and contextual information such audiovisual media offers with today’s available technology may assist the utility of other communication modes, such as (computerized) spoken, written or signed language, to foster mutual understanding and story growth. To promote the use of self-created film, here named a personal-video-scene (PVS), in the practice of augmentative and alternative communication (AAC), a greater understanding is required of how such media can operate as a resource within social interactions, such as daily conversations. This study therefore introduces a multimodal coding model developed to study the employment of a PVS within a film-elicited conversational narrative, relating to four aspects of conversational control: (a) topic development, (b) conversational structure, (c) conversational repair and (d) conversational maintenance. A case study illustrates how the use of a PVS in story-sharing was instrumental in establishing common ground between narrators, boosting the frequency of comments and questions, mitigating instances of conversational repair and expanding topic development. Full article

► Show Figures

Graphical abstract

19 pages, 1687 KiB

Open AccessArticle

Impact of Gentle Touch Stimulation Combined with Advanced Sensory Stimulation in Patients in a Minimally Conscious State: A Quasi-Randomized Clinical Trial

by Mirjam Bonanno, Antonio Gangemi, Rosa Angela Fabio, Marco Tramontano, Maria Grazia Maggio, Federica Impellizzeri, Alfredo Manuli, Daniele Tripoli, Angelo Quartarone, Rosaria De Luca and Rocco Salvatore Calabrò

Life 2025, 15(2), 280; https://doi.org/10.3390/life15020280 - 11 Feb 2025

Viewed by 1640

Abstract

Touch, particularly affective touch mediated by C-tactile fibers, plays a key role in emotional regulation and therapeutic interventions. However, tactile stimulation is underutilized in sensory stimulation (SS) protocols for brain injury patients, despite its potential to enhance consciousness and promote recovery through neural [...] Read more.

Touch, particularly affective touch mediated by C-tactile fibers, plays a key role in emotional regulation and therapeutic interventions. However, tactile stimulation is underutilized in sensory stimulation (SS) protocols for brain injury patients, despite its potential to enhance consciousness and promote recovery through neural and autonomic regulation. Tools like the Neurowave enable advanced multisensory stimulation, including audio-visual and emotional inputs, but lack tactile components. Integrating gentle touch stimulation with such systems could further enhance neuroplasticity, improve heart rate regulation, and support recovery in patients with disorders of consciousness. In this study, twenty patients affected by minimally conscious state (MCS) were divided into two groups: an experimental group (EG n.10) and a control group (CG n.10). Both groups underwent standard neurorehabilitation, including conventional physiotherapy and speech therapy. The key difference was in the type of sensory stimulation. The EG received advanced sensory stimulation with the Neurowave system (which provides audio-visual and emotional sensory stimulation) in addition to gentle touch stimulation. The CG received conventional sensory stimulation without the Neurowave and neutral gentle touch stimulation. Each patient was evaluated by a multidisciplinary rehabilitation team, using clinical scales such as coma recovery scale—revised (CSR-R), level of cognitive functioning (LCF), before (T0) and after (T1) treatment. Additionally, heart rate (HR) and neurophysiological outcomes (P300) were also recorded for both groups (EG and CG). The MANOVA model revealed a significant interaction effect between group and phase on P300 latency (F (1, 18) = 10.23, p < 0.001, η² = 0.09), indicating that the intervention involving gentle touch stimulation significantly influenced the P300 latency in the EG. The findings of this study contribute to our understanding of the therapeutic potential of emotional multisensory stimulation, which also includes gentle touch stimulation, in MCS rehabilitation. By demonstrating significant effects on both neurophysiological and functional measures, our results support the integration of tactile interventions into comprehensive neurorehabilitation programs. Full article

(This article belongs to the Special Issue Innovative Perspectives in Physical Therapy and Health)

► Show Figures

Figure 1

20 pages, 917 KiB

Open AccessArticle

Developing a Dataset of Audio Features to Classify Emotions in Speech

by Alvaro A. Colunga-Rodriguez, Alicia Martínez-Rebollar, Hugo Estrada-Esquivel, Eddie Clemente and Odette A. Pliego-Martínez

Computation 2025, 13(2), 39; https://doi.org/10.3390/computation13020039 - 5 Feb 2025

Cited by 1 | Viewed by 2601

Abstract

Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify [...] Read more.

Emotion recognition in speech has gained increasing relevance in recent years, enabling more personalized interactions between users and automated systems. This paper presents the development of a dataset of features obtained from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) to classify emotions in speech. The paper highlights audio processing techniques such as silence removal and framing to extract features from the recordings. The features are extracted from the audio signals using spectral techniques, time-domain analysis, and the discrete wavelet transform. The resulting dataset is used to train a neural network and the support vector machine learning algorithm. Cross-validation is employed for model training. The developed models were optimized using a software package that performs hyperparameter tuning to improve results. Finally, the emotional classification outcomes were compared. The results showed an emotion classification accuracy of 0.654 for the perceptron neural network and 0.724 for the support vector machine algorithm, demonstrating satisfactory performance in emotion classification. Full article

(This article belongs to the Section Computational Engineering)

► Show Figures

Figure 1

13 pages, 2145 KiB

Open AccessArticle

Visual and Acoustic Aspects of Face Masks Affect Speech Intelligibility in Listeners with Different Hearing Statuses

by Pauline Rohner, Rasmus Sönnichsen, Sabine Hochmuth and Andreas Radeloff

Audiol. Res. 2025, 15(1), 7; https://doi.org/10.3390/audiolres15010007 - 21 Jan 2025

Viewed by 1300

Abstract

Background: When speaking while wearing a face mask, sound transmission is attenuated, and visual cues are lost due to the covered facial movements of the speaker. In this study, we investigated the extent to which different face masks alter speech intelligibility in individuals [...] Read more.

Background: When speaking while wearing a face mask, sound transmission is attenuated, and visual cues are lost due to the covered facial movements of the speaker. In this study, we investigated the extent to which different face masks alter speech intelligibility in individuals with different degrees of hearing impairment. Methods: A total of fifty participants were divided into four hearing status groups according to the degree of hearing loss: normal levels (16), mild (13), moderate (11), and severe (10). A modified version of the Audiovisual German Matrix Sentence Test (AV-OLSA) was used to assess speech perception in noise in five conditions (audiovisual, audio-only, visual-only, surgical mask, and FFP2 mask). Results: Our results show that acoustic attenuations of face masks cause a small but similar decrease in speech reception thresholds (SRTs) in listeners of different hearing statuses. The effect of visual cues (visual benefit) on SRTs was stronger than the effect of acoustic attenuation but also did not differ significantly between the different hearing status groups, with a median difference of 1.5 dB for mild hearing loss, 2.9 dB for moderate hearing loss, and 2.7 dB for severe hearing loss. The best-aided hearing status did not correlate with visual benefit. Conclusions: Our research confirms the importance of providing visual cues for speech reception in noisy environments, especially for individuals with impaired hearing, regardless of their degree of hearing loss. Full article

► Show Figures

Figure 1

18 pages, 5732 KiB

Open AccessArticle

AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition

by Na Che, Yiming Zhu, Haiyan Wang, Xianwei Zeng and Qinsheng Du

Appl. Sci. 2025, 15(1), 199; https://doi.org/10.3390/app15010199 - 29 Dec 2024

Cited by 2 | Viewed by 1863

Abstract

Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention [...] Read more.

Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases. Full article

(This article belongs to the Special Issue Advances in Audio/Image Signals Processing)

► Show Figures

Figure 1

25 pages, 2085 KiB

Open AccessArticle

How Much Does the Dynamic F0 Curve Affect the Expression of Emotion in Utterances?

by Tae-Jin Yoon

Appl. Sci. 2024, 14(23), 10972; https://doi.org/10.3390/app142310972 - 26 Nov 2024

Viewed by 1134

Abstract

The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying [...] Read more.

The modulation of vocal elements, such as pitch, loudness, and duration, plays a crucial role in conveying both linguistic information and the speaker’s emotional state. While acoustic features like fundamental frequency (F0) variability have been widely studied in emotional speech analysis, accurately classifying emotion remains challenging due to the complex and dynamic nature of vocal expressions. Traditional analytical methods often oversimplify these dynamics, potentially overlooking intricate patterns indicative of specific emotions. This study examines the influences of emotion and temporal variation on dynamic F0 contours in the analytical framework, utilizing a dataset valuable for its diverse emotional expressions. However, the analysis is constrained by the limited variety of sentences employed, which may affect the generalizability of the findings to broader linguistic contexts. We utilized the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), focusing on eight distinct emotional states performed by 24 professional actors. Sonorant segments were extracted, and F0 measurements were converted into semitones relative to a 100 Hz baseline to standardize pitch variations. By employing Generalized Additive Mixed Models (GAMMs), we modeled non-linear trajectories of F0 contours over time, accounting for fixed effects (emotions) and random effects (individual speaker variability). Our analysis revealed that incorporating emotion-specific, non-linear time effects and individual speaker differences significantly improved the model’s explanatory power, ultimately explaining up to 66.5% of the variance in the F0. The inclusion of random smooths for time within speakers captured individual temporal modulation patterns, providing a more accurate representation of emotional speech dynamics. The results demonstrate that dynamic modeling of F0 contours using GAMMs enhances the accuracy of emotion classification in speech. This approach captures the nuanced pitch patterns associated with different emotions and accounts for individual variability among speakers. The findings contribute to a deeper understanding of the vocal expression of emotions and offer valuable insights for advancing speech emotion recognition systems. Full article

(This article belongs to the Special Issue Advances and Applications of Audio and Speech Signal Processing)

► Show Figures

Figure 1

19 pages, 3739 KiB

Open AccessArticle

Segmenting Speech: The Role of Resyllabification in Spanish Phonology

by Iván Andreu Rascón

Languages 2024, 9(11), 346; https://doi.org/10.3390/languages9110346 - 7 Nov 2024

Cited by 1 | Viewed by 1631

Abstract

Humans segment speech naturally based on the transitional probabilities between linguistic elements. For bilingual speakers navigating between a first (L1) and a second language (L2), L1 knowledge can influence their perception, leading to transfer effects based on phonological similarities or differences. Specifically, in [...] Read more.

Humans segment speech naturally based on the transitional probabilities between linguistic elements. For bilingual speakers navigating between a first (L1) and a second language (L2), L1 knowledge can influence their perception, leading to transfer effects based on phonological similarities or differences. Specifically, in Spanish, resyllabification occurs when consonants at the end of a syllable or word boundary are repositioned as the onset of the subsequent syllable. While the process can lead to ambiguities in perception, current academic discussions debate the duration of canonical and resyllabified productions. However, the role of bilingualism in the visual perception of syllable and word segmentation remains unknown to date. The present study explores the use of bilingual skills in the perception of articulatory movements and visual cues in speech perception, addressing the gap in the literature regarding the visibility of syllable pauses in lipreading. The participants in this study, 80 native Spanish speakers and 195 L2 learners, were subjected to audio, visual-only, and audiovisual conditions to assess their segmentation accuracy. The results indicated that both groups could segment speech effectively, with audiovisual cues providing the most significant benefit. Native speakers performed more consistently, while proficiency influenced L2 learners’ accuracy. The results show that aural syllabic segmentation is acquired at early stages of proficiency, while visual syllabic segmentation is acquired at higher levels of proficiency. Full article

(This article belongs to the Special Issue The Effects of Language Experience on Speech Perception and Speech Production)

► Show Figures

Figure 1

15 pages, 1799 KiB

Open AccessArticle

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

by Ju Huang, Wenkang Chen, Fangyi Wang and Haijun Zhang

Electronics 2024, 13(20), 4137; https://doi.org/10.3390/electronics13204137 - 21 Oct 2024

Viewed by 1854

Abstract

Multimodal sentiment analysis models can determine users’ sentiments by utilizing rich information from various sources (e.g., textual, visual, and audio). However, there are two key challenges when deploying the model in real-world environments: (1) the limitations of relying on the performance of automatic [...] Read more.

Multimodal sentiment analysis models can determine users’ sentiments by utilizing rich information from various sources (e.g., textual, visual, and audio). However, there are two key challenges when deploying the model in real-world environments: (1) the limitations of relying on the performance of automatic speech recognition (ASR) models can lead to errors in recognizing sentiment words, which may mislead the sentiment analysis of the textual modality, and (2) variations in information density across modalities complicate the development of a high-quality fusion framework. To address these challenges, this paper proposes a novel Multimodal Sentiment Word Optimization Module and a heterogeneous hierarchical fusion (MSWOHHF) framework. Specifically, the proposed Multimodal Sentiment Word Optimization Module optimizes the sentiment words extracted from the textual modality by the ASR model, thereby reducing sentiment word recognition errors. In the multimodal fusion phase, a heterogeneous hierarchical fusion network architecture is introduced, which first utilizes a Transformer Aggregation Module to fuse the visual and audio modalities, enhancing the high-level semantic features of each modality. A Cross-Attention Fusion Module then integrates the textual modality with the audiovisual fusion. Next, a Feature-Based Attention Fusion Module is proposed that enables fusion by dynamically tuning the weights of both the combined and unimodal representations. It then predicts sentiment polarity using a nonlinear neural network. Finally, the experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets show that the MSWOHHF outperforms several baselines, demonstrating better performance. Full article

(This article belongs to the Special Issue New Advances in Affective Computing)

► Show Figures

Figure 1

19 pages, 11059 KiB

Open AccessArticle

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

by Li Liu, Jinhui Wang, Shijuan Chen and Zongmei Li

Electronics 2024, 13(18), 3657; https://doi.org/10.3390/electronics13183657 - 14 Sep 2024

Cited by 1 | Viewed by 5174

Abstract

Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion [...] Read more.

Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model’s robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

18 pages, 7475 KiB

Open AccessData Descriptor

BELMASK—An Audiovisual Dataset of Adversely Produced Speech for Auditory Cognition Research

by Cleopatra Christina Moshona, Frederic Rudawski, André Fiebig and Ennes Sarradj

Data 2024, 9(8), 92; https://doi.org/10.3390/data9080092 - 24 Jul 2024

Viewed by 1997

Abstract

In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of [...] Read more.

In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of audio and video recordings of 10 German native speakers (4 female, 6 male) with a mean age of 30.2 years (SD: 6.3 years), uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with a Filtering Facepiece P2 (FFP2) mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, (iv) without an FFP2 mask while exposed to noise. Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect. All conditions are readily available in face-and-voice and voice-only formats. The speech material is annotated, employing a multi-layer architecture, and was originally conceptualized to be used for the administration of a working memory task. The dataset is stored in a restricted-access Zenodo repository and is available for academic research in the area of speech communication, acoustics, psychology and related disciplines upon request, after signing an End User License Agreement (EULA). Full article

► Show Figures

Figure 1

16 pages, 1243 KiB

Open AccessArticle

5G Enabled Dual Vision and Speech Enhancement Architecture for Multimodal Hearing-Aids

by Xianpo Ni, Yang Cen, Tushar Tyagi, Godwin Enemali and Tughrul Arslan

Electronics 2024, 13(13), 2588; https://doi.org/10.3390/electronics13132588 - 1 Jul 2024

Cited by 1 | Viewed by 1358

Abstract

This paper presents the algorithmic framework for a multimodal hearing aid (HA) prototype designed on a Field Programmable Gate Array (FPGA), specifically the RFSOC4*2 AMD FPGA, and evaluates the transmitter performance through simulation studies. The proposed architecture integrates audio and video inputs, processes [...] Read more.

This paper presents the algorithmic framework for a multimodal hearing aid (HA) prototype designed on a Field Programmable Gate Array (FPGA), specifically the RFSOC4*2 AMD FPGA, and evaluates the transmitter performance through simulation studies. The proposed architecture integrates audio and video inputs, processes them using advanced algorithms, and employs the 5G New Radio (NR) communication protocol for uploading the processed signal to the cloud. The core transmission utilizes Orthogonal Frequency Division Multiplexing (OFDM), an algorithm that effectively multiplexes the processed signals onto various orthogonal frequencies, enhancing bandwidth efficiency and reducing interference. The design is divided into different modules such as Sound reference signal (SRS), demodulation reference signal (DMRS), physical broadcast channel (PBCH), and physical uplink shared channel (PUSCH). The modulation algorithm has been optimized for FPGA parallel processing capabilities, making it better suited for the hearing aid requirements for low latency. The optimized algorithm achieves a transmission time of only 4.789 ms and uses fewer hardware resources, enhancing performance in a cost-effective and energy-efficient manner. Full article

(This article belongs to the Section Circuit and Signal Processing)

► Show Figures

Figure 1

Search Results (91)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (91)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI