Special Issue "Machine Learning Applied to Music/Audio Signal Processing"

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: 31 August 2021.

Special Issue Editors

Dr. Alexander Lerch
E-Mail Website1 Website2
Guest Editor
Center for Music Technology, Georgia Institute of Technology, Atlanta, GA 30332, USA
Interests: audio content analysis; music information retrieval; semantic audio; audio signal processing; music generation
Dr. Peter Knees
E-Mail Website
Guest Editor
Faculty of Informatics, Institute of Information Systems Engineering, TU Wien Informatics, Favoritenstraße 9-11, 1040 Vienna, Austria
Interests: information retrieval; user interfaces; music information retrieval; recommender systems; artificial intelligence; machine learning

Special Issue Information

Dear Colleagues,

The applications of audio and music processing range from music discovery and recommendation systems over speech enhancement, audio event detection, and music transcription, to creative applications such as sound synthesis and morphing.

The last decade has seen a paradigm shift from expert-designed algorithms to data-driven approaches. Machine learning approaches, and Deep Neural Networks specifically, have been shown to outperform traditional approaches on a large variety of tasks including audio classification, source separation, enhancement, and content analysis. With data-driven approaches, however, came a set of new challenges. Two of these challenges are training data and interpretability. As supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. The lack of understanding of how data are modeled by neural networks can lead to unexpected results and open vulnerabilities for adversarial attacks.

The main aim of this Special Issue is to seek high-quality submissions that present novel data-driven methods for audio/music signal processing and analysis and address main challenges of applying machine learning to audio signals. Within the general area of audio and music information retrieval as well as audio and music processing, the topics of interest include, but are not limited to, the following:

  • unsupervised and semi-supervised systems for audio/music processing and analysis
  • machine learning methods for raw audio signal analysis and transformation
  • approaches to understanding and controlling the behavior of audio processing systems such as visualization, auralization, or regularization methods
  • generative systems for sound synthesis and transformation
  • adversarial attacks and the identification of 'deepfakes' in audio and music
  • audio and music style transfer methods
  • audio recording and music production parameter estimation
  • data collection methods, active learning, and interactive machine learning for data-driven approaches

Dr. Peter Knees
Dr. Alexander Lerch
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • music information retrieval
  • machine learning for audio
  • intelligent audio signal processing
  • audio analysis and transformation

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Article
User-Driven Fine-Tuning for Beat Tracking
Electronics 2021, 10(13), 1518; https://doi.org/10.3390/electronics10131518 - 23 Jun 2021
Viewed by 243
Abstract
The extraction of the beat from musical audio signals represents a foundational task in the field of music information retrieval. While great advances in performance have been achieved due the use of deep neural networks, significant shortcomings still remain. In particular, performance is [...] Read more.
The extraction of the beat from musical audio signals represents a foundational task in the field of music information retrieval. While great advances in performance have been achieved due the use of deep neural networks, significant shortcomings still remain. In particular, performance is generally much lower on musical content that differs from that which is contained in existing annotated datasets used for neural network training, as well as in the presence of challenging musical conditions such as rubato. In this paper, we positioned our approach to beat tracking from a real-world perspective where an end-user targets very high accuracy on specific music pieces and for which the current state of the art is not effective. To this end, we explored the use of targeted fine-tuning of a state-of-the-art deep neural network based on a very limited temporal region of annotated beat locations. We demonstrated the success of our approach via improved performance across existing annotated datasets and a new annotation-correction approach for evaluation. Furthermore, we highlighted the ability of content-specific fine-tuning to learn both what is and what is not the beat in challenging musical conditions. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Analyzing and Visualizing Deep Neural Networks for Speech Recognition with Saliency-Adjusted Neuron Activation Profiles
Electronics 2021, 10(11), 1350; https://doi.org/10.3390/electronics10111350 - 05 Jun 2021
Viewed by 516
Abstract
Deep Learning-based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain a better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, several introspection methods have been proposed. However, established introspection techniques are mostly designed for computer [...] Read more.
Deep Learning-based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain a better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, several introspection methods have been proposed. However, established introspection techniques are mostly designed for computer vision tasks and rely on the data being visually interpretable, which limits their usefulness for understanding speech recognition models. To overcome this limitation, we developed a novel neuroscience-inspired technique for visualizing and understanding ANNs, called Saliency-Adjusted Neuron Activation Profiles (SNAPs). SNAPs are a flexible framework to analyze and visualize Deep Neural Networks that does not depend on visually interpretable data. In this work, we demonstrate how to utilize SNAPs for understanding fully-convolutional ASR models. This includes visualizing acoustic concepts learned by the model and the comparative analysis of their representations in the model layers. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks
Electronics 2021, 10(11), 1349; https://doi.org/10.3390/electronics10111349 - 05 Jun 2021
Viewed by 602
Abstract
Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of [...] Read more.
Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep-learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a stochastic generator of a Generative Adversarial Network (GAN) architecture for this task. Such a stochastic generator, conditioned on highly compressed musical audio signals, could one day generate outputs indistinguishable from high-quality releases. Therefore, the present study may yield insights into more efficient musical data storage and transmission. We train stochastic and deterministic generators on MP3-compressed audio signals with 16, 32, and 64 kbit/s. We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests. We find that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Singing Voice Detection in Opera Recordings: A Case Study on Robustness and Generalization
Electronics 2021, 10(10), 1214; https://doi.org/10.3390/electronics10101214 - 20 May 2021
Viewed by 338
Abstract
Automatically detecting the presence of singing in music audio recordings is a central task within music information retrieval. While modern machine-learning systems produce high-quality results on this task, the reported experiments are usually limited to popular music and the trained systems often overfit [...] Read more.
Automatically detecting the presence of singing in music audio recordings is a central task within music information retrieval. While modern machine-learning systems produce high-quality results on this task, the reported experiments are usually limited to popular music and the trained systems often overfit to confounding factors. In this paper, we aim to gain a deeper understanding of such machine-learning methods and investigate their robustness in a challenging opera scenario. To this end, we compare two state-of-the-art methods for singing voice detection based on supervised learning: A traditional approach relying on hand-crafted features with a random forest classifier, as well as a deep-learning approach relying on convolutional neural networks. To evaluate these algorithms, we make use of a cross-version dataset comprising 16 recorded performances (versions) of Richard Wagner’s four-opera cycle Der Ring des Nibelungen. This scenario allows us to systematically investigate generalization to unseen versions, musical works, or both. In particular, we study the trained systems’ robustness depending on the acoustic and musical variety, as well as the overall size of the training dataset. Our experiments show that both systems can robustly detect singing voice in opera recordings even when trained on relatively small datasets with little variety. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Informing Piano Multi-Pitch Estimation with Inferred Local Polyphony Based on Convolutional Neural Networks
Electronics 2021, 10(7), 851; https://doi.org/10.3390/electronics10070851 - 02 Apr 2021
Viewed by 593
Abstract
In this work, we propose considering the information from a polyphony for multi-pitch estimation (MPE) in piano music recordings. To that aim, we propose a method for local polyphony estimation (LPE), which is based on convolutional neural networks (CNNs) trained in a supervised [...] Read more.
In this work, we propose considering the information from a polyphony for multi-pitch estimation (MPE) in piano music recordings. To that aim, we propose a method for local polyphony estimation (LPE), which is based on convolutional neural networks (CNNs) trained in a supervised fashion to explicitly predict the degree of polyphony. We investigate two feature representations as inputs to our method, in particular, the Constant-Q Transform (CQT) and its recent extension Folded-CQT (F-CQT). To evaluate the performance of our method, we conduct a series of experiments on real and synthetic piano recordings based on the MIDI Aligned Piano Sounds (MAPS) and the Saarland Music Data (SMD) datasets. We compare our approaches with a state-of-the art piano transcription method by informing said method with the LPE knowledge in a postprocessing stage. The experimental results suggest that using explicit LPE information can refine MPE predictions. Furthermore, it is shown that, on average, the CQT representation is preferred over F-CQT for LPE. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
An Interpretable Deep Learning Model for Automatic Sound Classification
Electronics 2021, 10(7), 850; https://doi.org/10.3390/electronics10070850 - 02 Apr 2021
Cited by 1 | Viewed by 1055
Abstract
Deep learning models have improved cutting-edge technologies in many research areas, but their black-box structure makes it difficult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or [...] Read more.
Deep learning models have improved cutting-edge technologies in many research areas, but their black-box structure makes it difficult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or the reinforcement of biases. There is still a lack of research in the audio domain, despite the increasing interest in developing deep learning models that provide explanations of their decisions. To reduce this gap, we propose a novel interpretable deep learning model for automatic sound classification, which explains its predictions based on the similarity of the input to a set of learned prototypes in a latent space. We leverage domain knowledge by designing a frequency-dependent similarity measure and by considering different time-frequency resolutions in the feature space. The proposed model achieves results that are comparable to that of the state-of-the-art methods in three different sound classification tasks involving speech, music, and environmental audio. In addition, we present two automatic methods to prune the proposed model that exploit its interpretability. Our system is open source and it is accompanied by a web application for the manual editing of the model, which allows for a human-in-the-loop debugging approach. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast
Electronics 2021, 10(7), 827; https://doi.org/10.3390/electronics10070827 - 31 Mar 2021
Viewed by 837
Abstract
Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to [...] Read more.
Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to accomplish this task. However, broadcast audio is generally well-mixed and copyrighted, which makes it challenging to share across research groups. In this study, we address the challenges encountered in automatically synthesising data that resembles a radio broadcast. Firstly, we compare state-of-the-art neural network architectures such as CNN, GRU, LSTM, TCN, and CRNN. Later, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Thirdly, we examine how the quantity of synthetic training data impacts the results. Finally, we evaluate the effectiveness of synthesised, real-world, and combined approaches for training models, to understand if the synthetic data presents any additional value. Amongst the network architectures, CRNN was the best performing network. Results also show that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. After testing our model on in-house and public datasets, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
A Comparison of Deep Learning Methods for Timbre Analysis in Polyphonic Automatic Music Transcription
Electronics 2021, 10(7), 810; https://doi.org/10.3390/electronics10070810 - 29 Mar 2021
Viewed by 694
Abstract
Automatic music transcription (AMT) is a critical problem in the field of music information retrieval (MIR). When AMT is faced with deep neural networks, the variety of timbres of different instruments can be an issue that has not been studied in depth yet. [...] Read more.
Automatic music transcription (AMT) is a critical problem in the field of music information retrieval (MIR). When AMT is faced with deep neural networks, the variety of timbres of different instruments can be an issue that has not been studied in depth yet. The goal of this work is to address AMT transcription by analyzing how timbre affect monophonic transcription in a first approach based on the CREPE neural network and then to improve the results by performing polyphonic music transcription with different timbres with a second approach based on the Deep Salience model that performs polyphonic transcription based on the Constant-Q Transform. The results of the first method show that the timbre and envelope of the onsets have a high impact on the AMT results and the second method shows that the developed model is less dependent on the strength of the onsets than other state-of-the-art models that deal with AMT on piano sounds such as Google Magenta Onset and Frames (OaF). Our polyphonic transcription model for non-piano instruments outperforms the state-of-the-art model, such as for bass instruments, which has an F-score of 0.9516 versus 0.7102. In our latest experiment we also show how adding an onset detector to our model can outperform the results given in this work. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Jazz Bass Transcription Using a U-Net Architecture
Electronics 2021, 10(6), 670; https://doi.org/10.3390/electronics10060670 - 12 Mar 2021
Viewed by 598
Abstract
In this paper, we adapt a recently proposed U-net deep neural network architecture from melody to bass transcription. We investigate pitch shifting and random equalization as data augmentation techniques. In a parameter importance study, we study the influence of the skip connection strategy [...] Read more.
In this paper, we adapt a recently proposed U-net deep neural network architecture from melody to bass transcription. We investigate pitch shifting and random equalization as data augmentation techniques. In a parameter importance study, we study the influence of the skip connection strategy between the encoder and decoder layers, the data augmentation strategy, as well as of the overall model capacity on the system’s performance. Using a training set that covers various music genres and a validation set that includes jazz ensemble recordings, we obtain the best transcription performance for a downscaled version of the reference algorithm combined with skip connections that transfer intermediate activations between the encoder and decoder. The U-net based method outperforms previous knowledge-driven and data-driven bass transcription algorithms by around five percentage points in overall accuracy. In addition to a pitch estimation improvement, the voicing estimation performance is clearly enhanced. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation
Electronics 2021, 10(3), 298; https://doi.org/10.3390/electronics10030298 - 26 Jan 2021
Viewed by 555
Abstract
Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of [...] Read more.
Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of a singing voice. Therefore, reducing the interference of accompaniment is beneficial to pitch estimation of the singing voice. In this paper, we first adopted a high-resolution network (HRNet) to separate vocals from polyphonic music, then designed an encoder-decoder network to estimate the vocal F0 values. Experiment results demonstrate that the effectiveness of the HRNet-based singing voice separation method in reducing the interference of accompaniment on the extraction of vocal melody, and the proposed vocal melody extraction (VME) system outperforms other state-of-the-art algorithms in most cases. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Article
Sigmoidal NMFD: Convolutional NMF with Saturating Activations for Drum Mixture Decomposition
Electronics 2021, 10(3), 284; https://doi.org/10.3390/electronics10030284 - 25 Jan 2021
Cited by 1 | Viewed by 558
Abstract
In many types of music, percussion plays an essential role to establish the rhythm and the groove of the music. Algorithms that can decompose the percussive signal into its constituent components would therefore be very useful, as they would enable many analytical and [...] Read more.
In many types of music, percussion plays an essential role to establish the rhythm and the groove of the music. Algorithms that can decompose the percussive signal into its constituent components would therefore be very useful, as they would enable many analytical and creative applications. This paper describes a method for the unsupervised decomposition of percussive recordings, building on the non-negative matrix factor deconvolution (NMFD) algorithm. Given a percussive music recording, NMFD discovers a dictionary of time-varying spectral templates and corresponding activation functions, representing its constituent sounds and their positions in the mix. We observe, however, that the activation functions discovered using NMFD do not show the expected impulse-like behavior for percussive instruments. We therefore enforce this behavior by specifying that the activations should take on binary values: either an instrument is hit, or it is not. To this end, we rewrite the activations as the output of a sigmoidal function, multiplied with a per-component amplitude factor. We furthermore define a regularization term that biases the decomposition to solutions with saturated activations, leading to the desired binary behavior. We evaluate several optimization strategies and techniques that are designed to avoid poor local minima. We show that incentivizing the activations to be binary indeed leads to the desired impulse-like behavior, and that the resulting components are better separated, leading to more interpretable decompositions. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Graphical abstract

Article
Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing
Electronics 2020, 9(9), 1458; https://doi.org/10.3390/electronics9091458 - 07 Sep 2020
Cited by 2 | Viewed by 944
Abstract
Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between [...] Read more.
Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Back to TopTop