Deep Learning Based Speech Enhancement Technology

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: 31 August 2024 | Viewed by 1383

Special Issue Editor

School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK
Interests: signal processing; machine learning; robot perception
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Speech enhancement aims to improve the quality of speech degraded by environmental noise with signal-processing techniques and is used in many applications such as voice communication, hearing aids, speech recognition and human–robot interaction. In recent years, the research in speech enhancement has advanced significantly with deep learning and artificial intelligence techniques. When sufficient training data are available, deep neural networks can learn to predict speech from the noisy signal, achieving promising results in non-stationary and highly noisy acoustic environments. For this reason, deep learning-based speech enhancement has been investigated intensively and is becoming a hot spot in the field of speech processing. A number of methods have been developed with the aim of solving speech enhancement problems in extremely challenging environments, developing new deep architectures, increasing the generality and explainability of the deep model, incorporating deep learning into multi-channel signal processing, and multi-modal speech enhancement. This Special Issue aims to accelerate the research progress by reporting the latest theoretical and practical advances applying deep learning to speech enhancement, discussing emerging problems, creative solutions, and novel insights in the field. This Special Issue will mainly focus on (but is not limited to) the following deep learning-related topics:

  • single-channel speech enhancement;
  • multi-channel speech enhancement;
  • multi-modal speech enhancement;
  • explainable speech enhancement;
  • novel application of speech enhancement.

Dr. Lin Wang
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • explainable AI
  • microphone array
  • multi-modal speech processing
  • noise reduction
  • speech enhancement

Published Papers (1 paper)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 456 KiB  
Article
Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech
by Siyuan Song, Yanjue Song and Nilesh Madhu
Appl. Sci. 2024, 14(2), 609; https://doi.org/10.3390/app14020609 - 10 Jan 2024
Viewed by 656
Abstract
The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different [...] Read more.
The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different strategies to improve the classification accuracy when foreground speech is present. In this paper, we extend these methods to deep-learning (DL)-based ASC systems, for improving foreground speech robustness. ResNet models are proposed as the baseline, in combination with multi-condition training at different signal-to-background ratios (SBRs). For further robustness, we first investigate the noise-floor-based Mel-FilterBank Energies (NF-MFBE) as the input feature of the ResNet model. Next, speech presence information is incorporated within the ASC framework obtained from a speech enhancement (SE) system. As the speech presence information is time-frequency specific, it allows the network to learn to distinguish better between background signal regions and foreground speech. While the proposed modifications improve the performance of ASC systems when foreground speech is dominant, in scenarios with low-level or absent foreground speech, performance is slightly worse. Therefore, as a last consideration, ensemble methods are introduced, to integrate classification scores from different models in a weighted manner. The experimental study systematically validates the contribution of each proposed modification and, for the final system, it is shown that with the proposed input features and meta-learner, the classification accuracy is improved in all tested SBRs. Especially for SBRs of 20 dB, absolute improvements of up to 9% can be obtained. Full article
(This article belongs to the Special Issue Deep Learning Based Speech Enhancement Technology)
Show Figures

Figure 1

Back to TopTop