MDPI - Publisher of Open Access Journals

16 pages, 7008 KiB

Open AccessArticle

Improving Top-Down Attention Network in Speech Separation by Employing Hand-Crafted Filterbank and Parameter-Sharing Transformer

by Aye Nyein Aung and Jeih-weih Hung

Electronics 2024, 13(21), 4174; https://doi.org/10.3390/electronics13214174 - 24 Oct 2024

Viewed by 1249

Abstract

The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing intricate relationships between mixed audio signals and their respective speech sources, enabling them to effectively separate overlapping speech signals in challenging acoustic environments. Recent advances in speech separation systems have drawn inspiration from the brain’s hierarchical sensory information processing, incorporating top-down attention mechanisms. The top-down attention network (TDANet) employs an encoder–decoder architecture with top-down attention to enhance feature modulation and separation performance. By leveraging attention signals from multi-scale input features, TDANet effectively modifies features across different scales using a global attention (GA) module in the encoder–decoder design. Local attention (LA) layers then convert these modulated signals into high-resolution auditory characteristics. In this study, we propose two key modifications to TDANet. First, we substitute the fully trainable convolutional encoder with a deterministic hand-crafted multi-phase gammatone filterbank (MP-GTF), which mimics human hearing. Experimental results demonstrated that this substitution yielded comparable or even slightly superior performance to the original TDANet with a trainable encoder. Second, we replace the single multi-head self-attention (MHSA) layer in the global attention module with a transformer encoder block consisting of multiple MHSA layers. To optimize GPU memory utilization, we introduce a parameter sharing mechanism, dubbed “Reverse Cycle”, across layers in the transformer-based encoder. Our experimental findings indicated that these proposed modifications enabled TDANet to achieve competitive separation performance, rivaling state-of-the-art techniques, while maintaining superior computational efficiency. Full article

(This article belongs to the Special Issue Natural Language Processing Method: Deep Learning and Deep Semantics)

► Show Figures

Figure 1

16 pages, 1437 KiB

Open AccessArticle

Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network

by Aye Nyein Aung, Che-Wei Liao and Jeih-Weih Hung

Future Internet 2024, 16(5), 151; https://doi.org/10.3390/fi16050151 - 28 Apr 2024

Cited by 2 | Viewed by 1842

Abstract

Speech separation, sometimes known as the “cocktail party problem”, is the process of separating individual speech signals from an audio mixture that includes ambient noises and several speakers. The goal is to extract the target speech in this complicated sound scenario and either make it easier to understand or increase its quality so that it may be used in subsequent processing. Speech separation on overlapping audio data is important for many speech-processing tasks, including natural language processing, automatic speech recognition, and intelligent personal assistants. New speech separation algorithms are often built on a deep neural network (DNN) structure, which seeks to learn the complex relationship between the speech mixture and any specific speech source of interest. DNN-based speech separation algorithms outperform conventional statistics-based methods, although they typically need a lot of processing and/or a larger model size. This study presents a new end-to-end speech separation network called ESC-MASD-Net (effective speaker separation through convolutional multi-view attention and SuDoRM-RF network), which has relatively fewer model parameters compared with the state-of-the-art speech separation architectures. The network is partly inspired by the SuDoRM-RF++ network, which uses multiple time-resolution features with downsampling and resampling for effective speech separation. ESC-MASD-Net incorporates the multi-view attention and residual conformer modules into SuDoRM-RF++. Additionally, the U-Convolutional block in ESC-MASD-Net is refined with a conformer layer. Experiments conducted on the WHAM! dataset show that ESC-MASD-Net outperforms SuDoRM-RF++ significantly in the SI-SDRi metric. Furthermore, the use of the conformer layer has also improved the performance of ESC-MASD-Net. Full article

(This article belongs to the Special Issue AI and Security in 5G Cooperative Cognitive Radio Networks)

► Show Figures

Figure 1

17 pages, 2715 KiB

Open AccessArticle

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

by Guizhu Li, Min Fu, Mengnan Sun, Xuefeng Liu and Bing Zheng

Sensors 2023, 23(21), 8770; https://doi.org/10.3390/s23218770 - 27 Oct 2023

Viewed by 1656

Abstract

The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

14 pages, 2596 KiB

Open AccessArticle

Audio–Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem

by Zhanbo Shi, Lin Zhang and Dongqing Wang

Appl. Sci. 2023, 13(10), 6056; https://doi.org/10.3390/app13106056 - 15 May 2023

Cited by 9 | Viewed by 4242

Abstract

Locating the sound source is one of the most important capabilities of robot audition. In recent years, single-source localization techniques have increasingly matured. However, localizing and tracking specific sound sources in multi-source scenarios, which is known as the cocktail party problem, is still unresolved. In order to address this challenge, in this paper, we propose a system for dynamically localizing and tracking sound sources based on audio–visual information that can be deployed on a mobile robot. Our system first locates specific targets using pre-registered voiceprint and face features. Subsequently, the robot moves to track the target while keeping away from other sound sources in the surroundings instructed by the motion module, which helps the robot gather clearer audio data of the target to perform downstream tasks better. Its effectiveness has been verified via extensive real-world experiments with a 20% improvement in the success rate of specific speaker localization and a 14% reduction in word error rate in speech recognition compared to its counterparts. Full article

(This article belongs to the Special Issue Advances in Speech and Language Processing)

► Show Figures

Figure 1

13 pages, 1733 KiB

Open AccessArticle

Familiarity of Background Music Modulates the Cortical Tracking of Target Speech at the “Cocktail Party”

by Jane A. Brown and Gavin M. Bidelman

Brain Sci. 2022, 12(10), 1320; https://doi.org/10.3390/brainsci12101320 - 29 Sep 2022

Cited by 13 | Viewed by 3346

Abstract

The “cocktail party” problem—how a listener perceives speech in noisy environments—is typically studied using speech (multi-talker babble) or noise maskers. However, realistic cocktail party scenarios often include background music (e.g., coffee shops, concerts). Studies investigating music’s effects on concurrent speech perception have predominantly used highly controlled synthetic music or shaped noise, which do not reflect naturalistic listening environments. Behaviorally, familiar background music and songs with vocals/lyrics inhibit concurrent speech recognition. Here, we investigated the neural bases of these effects. While recording multichannel EEG, participants listened to an audiobook while popular songs (or silence) played in the background at a 0 dB signal-to-noise ratio. Songs were either familiar or unfamiliar to listeners and featured either vocals or isolated instrumentals from the original audio recordings. Comprehension questions probed task engagement. We used temporal response functions (TRFs) to isolate cortical tracking to the target speech envelope and analyzed neural responses around 100 ms (i.e., auditory N1 wave). We found that speech comprehension was, expectedly, impaired during background music compared to silence. Target speech tracking was further hindered by the presence of vocals. When masked by familiar music, response latencies to speech were less susceptible to informational masking, suggesting concurrent neural tracking of speech was easier during music known to the listener. These differential effects of music familiarity were further exacerbated in listeners with less musical ability. Our neuroimaging results and their dependence on listening skills are consistent with early attentional-gain mechanisms where familiar music is easier to tune out (listeners already know the song’s expectancies) and thus can allocate fewer attentional resources to the background music to better monitor concurrent speech material. Full article

(This article belongs to the Section Behavioral Neuroscience)

► Show Figures

Figure 1

22 pages, 13778 KiB

Open AccessArticle

An Advanced Phase Synchronization Scheme Based on Coherent Integration and Waveform Diversity for Bistatic SAR

by Da Liang, Heng Zhang, Yonghua Cai, Kaiyu Liu and Ke Zhang

Remote Sens. 2021, 13(5), 981; https://doi.org/10.3390/rs13050981 - 5 Mar 2021

Cited by 13 | Viewed by 3229

Abstract

In the bistatic synthetic aperture radar (BiSAR) system, the deviation between two oscillators in different platforms will cause an additional modulation of BiSAR echoes. Therefore, phase synchronization is one of the key issues that must be addressed for the BiSAR system. The oscillator phase error model and the principle of phase synchronization are firstly described. The waveform diversity technology has been widely used in many fields, for example, the hearing aids device and the recognition of auditory input source in cocktail party problem. Inspired by this, an advanced phase synchronization scheme based on coherent integration and waveform diversity is proposed. The synchronization signal and radar signal are orthogonal signals which can be separated by using waveform diversity technique. After extracting the synchronization signal, the phase synchronization accuracy can be further improved by coherent integration. The transmission of synchronization signals between two synchronization antennas is analyzed, followed by the theoretical error analysis. Then, the processing of separating the echo signal and synchronization signal is described in detail. The simulation experiments are performed. The accuracy of phase synchronization can reach 1 degree, which verifies the effectiveness of the proposed synchronization scheme. Full article

(This article belongs to the Special Issue Processing of Bi-static, Geo-Synchronous and Multi-Satellite SAR Constellation Data)

► Show Figures

Figure 1

Search Results (6)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (6)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI