Deep Learning for Applications in Acoustics: Modeling, Synthesis, and Listening

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (31 July 2020) | Viewed by 53535

Special Issue Editors


E-Mail Website
Guest Editor
Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy

E-Mail Website
Guest Editor
Queen Mary University, London, UK

E-Mail Website
Guest Editor
Korea Advanced Institute of Science and Technology, Korea

Special Issue Information

Dear Colleagues,

Recent introduction of Deep Learning has led to a vast array of breakthroughs in many fields of science and engineering. The data-driven approach has gathered the attention of research communities and has often been successful in yielding solutions to very complex classification and regression problems.

In the fields of audio analysis, processing and acoustic modelling, Deep Learning has been adopted, initially borrowing their methods from the image processing and computer vision field, and then finding creative and innovative solutions to suit domain-specific needs of acoustic research. In this process, researchers are facing two big challenges: learning meaningful spatio-temporal representations of audio signals and making sense of the black-box model of neural networks, i.e. extracting knowledge that is useful for scientific advance.

In this special issue, we welcome the submission of papers dealing with novel computational methods involving modelling, parametrization, and knowledge extraction of acoustic data. The considered topics include, e.g.:

  • Applications of Deep Learning to sound synthesis
  • Control and estimation problems in physical modeling
  • Intelligent music production and novel digital audio effects
  • Representation learning and/or transfer of musical composition and performance characteristics including, timbre, style and playing technique
  • Analysis and modelling of acoustic phenomena including musical acoustics, speech signals, room acoustics, environmental, ecological, medical and machine sounds.
  • Machine listening and perception models inspired by human hearing
  • Application of Deep Learning to wave propagation problems in fluids and solids

We aim at fostering good research practices in Deep Learning. Considering current scientific and ethical concerns with Deep Learning, including reproducibility and explainability, we strongly support works that are based on open datasets and source code, works that excel on the scientific method, and works providing evidences and explanations for the observed phenomena.

Dr. Leonardo Gabrielli
Dr. George Fazekas
Dr. Juhan Nam
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Deep learning
  • Sound synthesis
  • Machine listening
  • Audio signal processing
  • Sound event detection
  • Acoustic modelling
  • Digital audio effects
  • Audio style transfer

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review

4 pages, 164 KiB  
Editorial
Special Issue on Deep Learning for Applications in Acoustics: Modeling, Synthesis, and Listening
by Leonardo Gabrielli, György Fazekas and Juhan Nam
Appl. Sci. 2021, 11(2), 473; https://doi.org/10.3390/app11020473 - 06 Jan 2021
Cited by 2 | Viewed by 2197
Abstract
The recent introduction of Deep Learning has led to a vast array of breakthroughs in many fields of science and engineering [...] Full article

Research

Jump to: Editorial, Review

16 pages, 2685 KiB  
Article
Synthesis of Normal Heart Sounds Using Generative Adversarial Networks and Empirical Wavelet Transform
by Pedro Narváez and Winston S. Percybrooks
Appl. Sci. 2020, 10(19), 7003; https://doi.org/10.3390/app10197003 - 08 Oct 2020
Cited by 13 | Viewed by 3726
Abstract
Currently, there are many works in the literature focused on the analysis of heart sounds, specifically on the development of intelligent systems for the classification of normal and abnormal heart sounds. However, the available heart sound databases are not yet large enough to [...] Read more.
Currently, there are many works in the literature focused on the analysis of heart sounds, specifically on the development of intelligent systems for the classification of normal and abnormal heart sounds. However, the available heart sound databases are not yet large enough to train generalized machine learning models. Therefore, there is interest in the development of algorithms capable of generating heart sounds that could augment current databases. In this article, we propose a model based on generative adversary networks (GANs) to generate normal synthetic heart sounds. Additionally, a denoising algorithm is implemented using the empirical wavelet transform (EWT), allowing a decrease in the number of epochs and the computational cost that the GAN model requires. A distortion metric (mel–cepstral distortion) was used to objectively assess the quality of synthetic heart sounds. The proposed method was favorably compared with a mathematical model that is based on the morphology of the phonocardiography (PCG) signal published as the state of the art. Additionally, different heart sound classification models proposed as state-of-the-art were also used to test the performance of such models when the GAN-generated synthetic signals were used as test dataset. In this experiment, good accuracy results were obtained with most of the implemented models, suggesting that the GAN-generated sounds correctly capture the characteristics of natural heart sounds. Full article
Show Figures

Figure 1

24 pages, 2907 KiB  
Article
BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control
by Maarten Grachten, Stefan Lattner and Emmanuel Deruty
Appl. Sci. 2020, 10(18), 6627; https://doi.org/10.3390/app10186627 - 22 Sep 2020
Cited by 4 | Viewed by 4798
Abstract
Deep learning has given AI-based methods for music creation a boost by over the past years. An important challenge in this field is to balance user control and autonomy in music generation systems. In this work, we present BassNet, a deep learning model [...] Read more.
Deep learning has given AI-based methods for music creation a boost by over the past years. An important challenge in this field is to balance user control and autonomy in music generation systems. In this work, we present BassNet, a deep learning model for generating bass guitar tracks based on musical source material. An innovative aspect of our work is that the model is trained to learn a temporally stable two-dimensional latent space variable that offers interactive user control. We empirically show that the model can disentangle bass patterns that require sensitivity to harmony, instrument timbre, and rhythm. An ablation study reveals that this capability is because of the temporal stability constraint on latent space trajectories during training. We also demonstrate that models that are trained on pop/rock music learn a latent space that offers control over the diatonic characteristics of the output, among other things. Lastly, we present and discuss generated bass tracks for three different music fragments. The work that is presented here is a step toward the integration of AI-based technology in the workflow of musical content creators. Full article
Show Figures

Figure 1

21 pages, 2050 KiB  
Article
Assistive Model to Generate Chord Progressions Using Genetic Programming with Artificial Immune Properties
by María Navarro-Cáceres, Javier Félix Merchán Sánchez-Jara, Valderi Reis Quietinho Leithardt and Raúl García-Ovejero
Appl. Sci. 2020, 10(17), 6039; https://doi.org/10.3390/app10176039 - 31 Aug 2020
Cited by 1 | Viewed by 2975
Abstract
In Western tonal music, tension in chord progressions plays an important role in defining the path that a musical composition should follow. The creation of chord progressions that reflects such tension profiles can be challenging for novice composers, as it depends on many [...] Read more.
In Western tonal music, tension in chord progressions plays an important role in defining the path that a musical composition should follow. The creation of chord progressions that reflects such tension profiles can be challenging for novice composers, as it depends on many subjective factors, and also is regulated by multiple theoretical principles. This work presents ChordAIS-Gen, a tool to assist the users to generate chord progressions that comply with a concrete tension profile. We propose an objective measure capable of capturing the tension profile of a chord progression according to different tonal music parameters, namely, consonance, hierarchical tension, voice leading and perceptual distance. This measure is optimized into a Genetic Program algorithm mixed with an Artificial Immune System called Opt-aiNet. Opt-aiNet is capable of finding multiple optima in parallel, resulting in multiple candidate solutions for the next chord in a sequence. To validate the objective function, we performed a listening test to evaluate the perceptual quality of the candidate solutions proposed by our system. Most listeners rated the chord progressions proposed by ChordAIS-Gen as better candidates than the progressions discarded. Thus, we propose to use the objective values as a proxy for the perceptual evaluation of chord progressions and compare the performance of ChordAIS-Gen with chord progressions generators. Full article
Show Figures

Figure 1

24 pages, 5190 KiB  
Article
A Comparison of Human against Machine-Classification of Spatial Audio Scenes in Binaural Recordings of Music
by Sławomir K. Zieliński, Hyunkook Lee, Paweł Antoniuk and Oskar Dadan
Appl. Sci. 2020, 10(17), 5956; https://doi.org/10.3390/app10175956 - 28 Aug 2020
Cited by 9 | Viewed by 3930
Abstract
The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions. The three scenes were subject to [...] Read more.
The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions. The three scenes were subject to classification: (1) music ensemble (a group of musical sources) located in the front, (2) music ensemble located at the back, and (3) music ensemble distributed around a listener. In the listening test, undertaken remotely over the Internet, human listeners reached the classification accuracy of 42.5%. For the listeners who passed the post-screening test, the accuracy was greater, approaching 60%. The above classification task was also undertaken automatically using four machine learning algorithms: convolutional neural network, support vector machines, extreme gradient boosting framework, and logistic regression. The machine learning algorithms substantially outperformed human listeners, with the classification accuracy reaching 84%, when tested under the binaural-room-impulse-response (BRIR) matched conditions. However, when the algorithms were tested under the BRIR mismatched scenario, the accuracy obtained by the algorithms was comparable to that exhibited by the listeners who passed the post-screening test, implying that the machine learning algorithms capability to perform in unknown electro-acoustic conditions needs to be further improved. Full article
Show Figures

Figure 1

24 pages, 2773 KiB  
Article
Low-Order Spherical Harmonic HRTF Restoration Using a Neural Network Approach
by Benjamin Tsui, William A. P. Smith and Gavin Kearney
Appl. Sci. 2020, 10(17), 5764; https://doi.org/10.3390/app10175764 - 20 Aug 2020
Cited by 3 | Viewed by 2652
Abstract
Spherical harmonic (SH) interpolation is a commonly used method to spatially up-sample sparse head related transfer function (HRTF) datasets to denser HRTF datasets. However, depending on the number of sparse HRTF measurements and SH order, this process can introduce distortions into high frequency [...] Read more.
Spherical harmonic (SH) interpolation is a commonly used method to spatially up-sample sparse head related transfer function (HRTF) datasets to denser HRTF datasets. However, depending on the number of sparse HRTF measurements and SH order, this process can introduce distortions into high frequency representations of the HRTFs. This paper investigates whether it is possible to restore some of the distorted high frequency HRTF components using machine learning algorithms. A combination of convolutional auto-encoder (CAE) and denoising auto-encoder (DAE) models is proposed to restore the high frequency distortion in SH-interpolated HRTFs. Results were evaluated using both perceptual spectral difference (PSD) and localisation prediction models, both of which demonstrated significant improvement after the restoration process. Full article
Show Figures

Figure 1

22 pages, 2689 KiB  
Article
Bioacoustic Classification of Antillean Manatee Vocalization Spectrograms Using Deep Convolutional Neural Networks
by Fernando Merchan, Ariel Guerra, Héctor Poveda, Héctor M. Guzmán and Javier E. Sanchez-Galan
Appl. Sci. 2020, 10(9), 3286; https://doi.org/10.3390/app10093286 - 08 May 2020
Cited by 11 | Viewed by 3632
Abstract
We evaluated the potential of using convolutional neural networks in classifying spectrograms of Antillean manatee (Trichechus manatus manatus) vocalizations. Spectrograms using binary, linear and logarithmic amplitude formats were considered. Two deep convolutional neural networks (DCNN) architectures were tested: linear (fixed filter [...] Read more.
We evaluated the potential of using convolutional neural networks in classifying spectrograms of Antillean manatee (Trichechus manatus manatus) vocalizations. Spectrograms using binary, linear and logarithmic amplitude formats were considered. Two deep convolutional neural networks (DCNN) architectures were tested: linear (fixed filter size) and pyramidal (incremental filter size). Six experiments were devised for testing the accuracy obtained for each spectrogram representation and architecture combination. Results show that binary spectrograms with both linear and pyramidal architectures with dropout provide a classification rate of 94–99% on the training and 92–98% on the testing set, respectively. The pyramidal network presents a shorter training and inference time. Results from the convolutional neural networks (CNN) are substantially better when compared with a signal processing fast Fourier transform (FFT)-based harmonic search approach in terms of accuracy and F1 Score. Taken together, these results prove the validity of using spectrograms and using DCNNs for manatee vocalization classification. These results can be used to improve future software and hardware implementations for the estimation of the manatee population in Panama. Full article
Show Figures

Figure 1

21 pages, 4045 KiB  
Article
Designing Audio Equalization Filters by Deep Neural Networks
by Giovanni Pepe, Leonardo Gabrielli, Stefano Squartini and Luca Cattani
Appl. Sci. 2020, 10(7), 2483; https://doi.org/10.3390/app10072483 - 04 Apr 2020
Cited by 16 | Viewed by 5090
Abstract
Audio equalization is an active research topic aiming at improving the audio quality of a loudspeaker system by correcting the overall frequency response using linear filters. The estimation of their coefficients is not an easy task, especially in binaural and multipoint scenarios, due [...] Read more.
Audio equalization is an active research topic aiming at improving the audio quality of a loudspeaker system by correcting the overall frequency response using linear filters. The estimation of their coefficients is not an easy task, especially in binaural and multipoint scenarios, due to the contribution of multiple impulse responses to each listening point. This paper presents a deep learning approach for tuning filter coefficients employing three different neural networks architectures—the Multilayer Perceptron, the Convolutional Neural Network, and the Convolutional Autoencoder. Suitable loss functions are proposed for each architecture, and are formulated in terms of spectral Euclidean distance. The experiments were conducted in the automotive scenario, considering several loudspeakers and microphones. The obtained results show that deep learning techniques give superior performance compared to baseline methods, achieving almost flat magnitude frequency response. Full article
Show Figures

Figure 1

22 pages, 2090 KiB  
Article
An Analysis of Rhythmic Patterns with Unsupervised Learning
by Matevž Pesek, Aleš Leonardis and Matija Marolt
Appl. Sci. 2020, 10(1), 178; https://doi.org/10.3390/app10010178 - 25 Dec 2019
Cited by 7 | Viewed by 6138
Abstract
This paper presents a model capable of learning the rhythmic characteristics of a music signal through unsupervised learning. The model learns a multi-layer hierarchy of rhythmic patterns ranging from simple structures on lower layers to more complex patterns on higher layers. The learned [...] Read more.
This paper presents a model capable of learning the rhythmic characteristics of a music signal through unsupervised learning. The model learns a multi-layer hierarchy of rhythmic patterns ranging from simple structures on lower layers to more complex patterns on higher layers. The learned hierarchy is fully transparent, which enables observation and explanation of the structure of the learned patterns. The model employs tempo-invariant encoding of patterns and can thus learn and perform inference on tempo-varying and noisy input data. We demonstrate the model’s capabilities of learning distinctive rhythmic structures of different music genres using unsupervised learning. To test its robustness, we show how the model can efficiently extract rhythmic structures in songs with changing time signatures and live recordings. Additionally, the model’s time-complexity is empirically tested to show its usability for analysis-related applications. Full article
Show Figures

Figure 1

13 pages, 2806 KiB  
Article
Noise-Robust Voice Conversion Using High-Quefrency Boosting via Sub-Band Cepstrum Conversion and Fusion
by Xiaokong Miao, Meng Sun, Xiongwei Zhang and Yimin Wang
Appl. Sci. 2020, 10(1), 151; https://doi.org/10.3390/app10010151 - 23 Dec 2019
Cited by 10 | Viewed by 3413
Abstract
This paper presents a noise-robust voice conversion method with high-quefrency boosting via sub-band cepstrum conversion and fusion based on the bidirectional long short-term memory (BLSTM) neural networks that can convert parameters of vocal tracks of a source speaker into those of a target [...] Read more.
This paper presents a noise-robust voice conversion method with high-quefrency boosting via sub-band cepstrum conversion and fusion based on the bidirectional long short-term memory (BLSTM) neural networks that can convert parameters of vocal tracks of a source speaker into those of a target speaker. With the implementation of state-of-the-art machine learning methods, voice conversion has achieved good performance given abundant clean training data. However, the quality and similarity of the converted voice are significantly degraded compared to that of a natural target voice due to various factors, such as limited training data and noisy input speech from the source speaker. To address the problem of noisy input speech, an architecture of voice conversion with statistical filtering and sub-band cepstrum conversion and fusion is introduced. The impact of noises on the converted voice is reduced by the accurate reconstruction of the sub-band cepstrum and the subsequent statistical filtering. By normalizing the mean and variance of the converted cepstrum to those of the target cepstrum in the training phase, a cepstrum filter was constructed to further improve the quality of the converted voice. The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers. Full article
Show Figures

Figure 1

Review

Jump to: Editorial, Research

16 pages, 231 KiB  
Review
A Review of Deep Learning Based Methods for Acoustic Scene Classification
by Jakob Abeßer
Appl. Sci. 2020, 10(6), 2020; https://doi.org/10.3390/app10062020 - 16 Mar 2020
Cited by 97 | Viewed by 11841
Abstract
The number of publications on acoustic scene classification (ASC) in environmental audio recordings has constantly increased over the last few years. This was mainly stimulated by the annual Detection and Classification of Acoustic Scenes and Events (DCASE) competition with its first edition in [...] Read more.
The number of publications on acoustic scene classification (ASC) in environmental audio recordings has constantly increased over the last few years. This was mainly stimulated by the annual Detection and Classification of Acoustic Scenes and Events (DCASE) competition with its first edition in 2013. All competitions so far involved one or multiple ASC tasks. With a focus on deep learning based ASC algorithms, this article summarizes and groups existing approaches for data preparation, i.e., feature representations, feature pre-processing, and data augmentation, and for data modeling, i.e., neural network architectures and learning paradigms. Finally, the paper discusses current algorithmic limitations and open challenges in order to preview possible future developments towards the real-life application of ASC systems. Full article
Show Figures

Figure 1

Back to TopTop