Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (6)

Search Parameters:
Keywords = cocktail party problem

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 7008 KiB  
Article
Improving Top-Down Attention Network in Speech Separation by Employing Hand-Crafted Filterbank and Parameter-Sharing Transformer
by Aye Nyein Aung and Jeih-weih Hung
Electronics 2024, 13(21), 4174; https://doi.org/10.3390/electronics13214174 - 24 Oct 2024
Viewed by 1249
Abstract
The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing [...] Read more.
The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing intricate relationships between mixed audio signals and their respective speech sources, enabling them to effectively separate overlapping speech signals in challenging acoustic environments. Recent advances in speech separation systems have drawn inspiration from the brain’s hierarchical sensory information processing, incorporating top-down attention mechanisms. The top-down attention network (TDANet) employs an encoder–decoder architecture with top-down attention to enhance feature modulation and separation performance. By leveraging attention signals from multi-scale input features, TDANet effectively modifies features across different scales using a global attention (GA) module in the encoder–decoder design. Local attention (LA) layers then convert these modulated signals into high-resolution auditory characteristics. In this study, we propose two key modifications to TDANet. First, we substitute the fully trainable convolutional encoder with a deterministic hand-crafted multi-phase gammatone filterbank (MP-GTF), which mimics human hearing. Experimental results demonstrated that this substitution yielded comparable or even slightly superior performance to the original TDANet with a trainable encoder. Second, we replace the single multi-head self-attention (MHSA) layer in the global attention module with a transformer encoder block consisting of multiple MHSA layers. To optimize GPU memory utilization, we introduce a parameter sharing mechanism, dubbed “Reverse Cycle”, across layers in the transformer-based encoder. Our experimental findings indicated that these proposed modifications enabled TDANet to achieve competitive separation performance, rivaling state-of-the-art techniques, while maintaining superior computational efficiency. Full article
(This article belongs to the Special Issue Natural Language Processing Method: Deep Learning and Deep Semantics)
Show Figures

Figure 1

16 pages, 1437 KiB  
Article
Effective Monoaural Speech Separation through Convolutional Top-Down Multi-View Network
by Aye Nyein Aung, Che-Wei Liao and Jeih-Weih Hung
Future Internet 2024, 16(5), 151; https://doi.org/10.3390/fi16050151 - 28 Apr 2024
Cited by 2 | Viewed by 1842
Abstract
Speech separation, sometimes known as the “cocktail party problem”, is the process of separating individual speech signals from an audio mixture that includes ambient noises and several speakers. The goal is to extract the target speech in this complicated sound scenario and either [...] Read more.
Speech separation, sometimes known as the “cocktail party problem”, is the process of separating individual speech signals from an audio mixture that includes ambient noises and several speakers. The goal is to extract the target speech in this complicated sound scenario and either make it easier to understand or increase its quality so that it may be used in subsequent processing. Speech separation on overlapping audio data is important for many speech-processing tasks, including natural language processing, automatic speech recognition, and intelligent personal assistants. New speech separation algorithms are often built on a deep neural network (DNN) structure, which seeks to learn the complex relationship between the speech mixture and any specific speech source of interest. DNN-based speech separation algorithms outperform conventional statistics-based methods, although they typically need a lot of processing and/or a larger model size. This study presents a new end-to-end speech separation network called ESC-MASD-Net (effective speaker separation through convolutional multi-view attention and SuDoRM-RF network), which has relatively fewer model parameters compared with the state-of-the-art speech separation architectures. The network is partly inspired by the SuDoRM-RF++ network, which uses multiple time-resolution features with downsampling and resampling for effective speech separation. ESC-MASD-Net incorporates the multi-view attention and residual conformer modules into SuDoRM-RF++. Additionally, the U-Convolutional block in ESC-MASD-Net is refined with a conformer layer. Experiments conducted on the WHAM! dataset show that ESC-MASD-Net outperforms SuDoRM-RF++ significantly in the SI-SDRi metric. Furthermore, the use of the conformer layer has also improved the performance of ESC-MASD-Net. Full article
(This article belongs to the Special Issue AI and Security in 5G Cooperative Cognitive Radio Networks)
Show Figures

Figure 1

17 pages, 2715 KiB  
Article
A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
by Guizhu Li, Min Fu, Mengnan Sun, Xuefeng Liu and Bing Zheng
Sensors 2023, 23(21), 8770; https://doi.org/10.3390/s23218770 - 27 Oct 2023
Viewed by 1656
Abstract
The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the [...] Read more.
The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

14 pages, 2596 KiB  
Article
Audio–Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem
by Zhanbo Shi, Lin Zhang and Dongqing Wang
Appl. Sci. 2023, 13(10), 6056; https://doi.org/10.3390/app13106056 - 15 May 2023
Cited by 9 | Viewed by 4242
Abstract
Locating the sound source is one of the most important capabilities of robot audition. In recent years, single-source localization techniques have increasingly matured. However, localizing and tracking specific sound sources in multi-source scenarios, which is known as the cocktail party problem, is still [...] Read more.
Locating the sound source is one of the most important capabilities of robot audition. In recent years, single-source localization techniques have increasingly matured. However, localizing and tracking specific sound sources in multi-source scenarios, which is known as the cocktail party problem, is still unresolved. In order to address this challenge, in this paper, we propose a system for dynamically localizing and tracking sound sources based on audio–visual information that can be deployed on a mobile robot. Our system first locates specific targets using pre-registered voiceprint and face features. Subsequently, the robot moves to track the target while keeping away from other sound sources in the surroundings instructed by the motion module, which helps the robot gather clearer audio data of the target to perform downstream tasks better. Its effectiveness has been verified via extensive real-world experiments with a 20% improvement in the success rate of specific speaker localization and a 14% reduction in word error rate in speech recognition compared to its counterparts. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

13 pages, 1733 KiB  
Article
Familiarity of Background Music Modulates the Cortical Tracking of Target Speech at the “Cocktail Party”
by Jane A. Brown and Gavin M. Bidelman
Brain Sci. 2022, 12(10), 1320; https://doi.org/10.3390/brainsci12101320 - 29 Sep 2022
Cited by 13 | Viewed by 3346
Abstract
The “cocktail party” problem—how a listener perceives speech in noisy environments—is typically studied using speech (multi-talker babble) or noise maskers. However, realistic cocktail party scenarios often include background music (e.g., coffee shops, concerts). Studies investigating music’s effects on concurrent speech perception have predominantly [...] Read more.
The “cocktail party” problem—how a listener perceives speech in noisy environments—is typically studied using speech (multi-talker babble) or noise maskers. However, realistic cocktail party scenarios often include background music (e.g., coffee shops, concerts). Studies investigating music’s effects on concurrent speech perception have predominantly used highly controlled synthetic music or shaped noise, which do not reflect naturalistic listening environments. Behaviorally, familiar background music and songs with vocals/lyrics inhibit concurrent speech recognition. Here, we investigated the neural bases of these effects. While recording multichannel EEG, participants listened to an audiobook while popular songs (or silence) played in the background at a 0 dB signal-to-noise ratio. Songs were either familiar or unfamiliar to listeners and featured either vocals or isolated instrumentals from the original audio recordings. Comprehension questions probed task engagement. We used temporal response functions (TRFs) to isolate cortical tracking to the target speech envelope and analyzed neural responses around 100 ms (i.e., auditory N1 wave). We found that speech comprehension was, expectedly, impaired during background music compared to silence. Target speech tracking was further hindered by the presence of vocals. When masked by familiar music, response latencies to speech were less susceptible to informational masking, suggesting concurrent neural tracking of speech was easier during music known to the listener. These differential effects of music familiarity were further exacerbated in listeners with less musical ability. Our neuroimaging results and their dependence on listening skills are consistent with early attentional-gain mechanisms where familiar music is easier to tune out (listeners already know the song’s expectancies) and thus can allocate fewer attentional resources to the background music to better monitor concurrent speech material. Full article
(This article belongs to the Section Behavioral Neuroscience)
Show Figures

Figure 1

22 pages, 13778 KiB  
Article
An Advanced Phase Synchronization Scheme Based on Coherent Integration and Waveform Diversity for Bistatic SAR
by Da Liang, Heng Zhang, Yonghua Cai, Kaiyu Liu and Ke Zhang
Remote Sens. 2021, 13(5), 981; https://doi.org/10.3390/rs13050981 - 5 Mar 2021
Cited by 13 | Viewed by 3229
Abstract
In the bistatic synthetic aperture radar (BiSAR) system, the deviation between two oscillators in different platforms will cause an additional modulation of BiSAR echoes. Therefore, phase synchronization is one of the key issues that must be addressed for the BiSAR system. The oscillator [...] Read more.
In the bistatic synthetic aperture radar (BiSAR) system, the deviation between two oscillators in different platforms will cause an additional modulation of BiSAR echoes. Therefore, phase synchronization is one of the key issues that must be addressed for the BiSAR system. The oscillator phase error model and the principle of phase synchronization are firstly described. The waveform diversity technology has been widely used in many fields, for example, the hearing aids device and the recognition of auditory input source in cocktail party problem. Inspired by this, an advanced phase synchronization scheme based on coherent integration and waveform diversity is proposed. The synchronization signal and radar signal are orthogonal signals which can be separated by using waveform diversity technique. After extracting the synchronization signal, the phase synchronization accuracy can be further improved by coherent integration. The transmission of synchronization signals between two synchronization antennas is analyzed, followed by the theoretical error analysis. Then, the processing of separating the echo signal and synchronization signal is described in detail. The simulation experiments are performed. The accuracy of phase synchronization can reach 1 degree, which verifies the effectiveness of the proposed synchronization scheme. Full article
Show Figures

Figure 1

Back to TopTop