Blind Source Separation in Polyphonic Music Recordings Using Deep Neural Networks Trained via Policy Gradients

We propose a method for the blind separation of sounds of musical instruments in audio signals. We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics. The model parameters are predicted via a U-Net, which is a type of deep neural network. The network is trained without ground truth information, based on the difference between the model prediction and the individual time frames of the short-time Fourier transform. Since some of the model parameters do not yield a useful backpropagation gradient, we model them stochastically and employ the policy gradient instead. To provide phase information and account for inaccuracies in the dictionary-based representation, we also let the network output a direct prediction, which we then use to resynthesize the audio signals for the individual instruments. Due to the flexibility of the neural network, inharmonicity can be incorporated seamlessly and no preprocessing of the input spectra is required. Our algorithm yields high-quality separation results with particularly low interference on a variety of different audio samples, both acoustic and synthetic, provided that the sample contains enough data for the training and that the spectral characteristics of the musical instruments are sufficiently stable to be approximated by the dictionary.


Introduction
We address the problem of unmixing the contributions of multiple different musical instruments from a single-channel audio recording.We assume that each instrument only plays a single musical tone at a time and that the sound of the instruments follows a stationary tone model aimed at woodwind, brass, and string instruments.
Since we perform blind separation, we do not make any prior assumptions specific to the sounds of the individual instruments, but we distinguish them based on the proximity to the entries of a dictionary which we learn in the process.
For the time-frequency representation of the audio signals, we use the sampled complex-valued output of the short-time Fourier transform, which can be interpreted as the analysis coefficients of a Gabor frame.This representation has the advantage of being perfectly linear and easy to project back to a time-domain signal, but it is not pitch-invariant; that is, the distance of the frequency axis corresponding to a certain musical interval varies based on the pitch of the tones.
The problem of identifying the pitch of the tones is non-convex on a global scale and possesses a large number of local minima.Therefore, general numerical optimization methods are not appropriate.Instead, we predict the parameters via a U-Net [1], which is a type of deep neural network.For the problematic parameters like pitch, we use policy gradients for training, which is a technique originating from deep reinforcement learning [cf.2].arXiv:2107.04235v2[eess.AS] 9 Aug 2021

Related Work
The audio source separation problem can be formulated in a variety of settings.(See [3][4][5] for a thorough overview.)For the purpose of this work, we only regard the case that the input signal is single-channel and the separation is therefore always underdetermined.Different algorithms can be used if multiple channels are available (typically corresponding to microphones simultaneously recording the audio scene).Also, we always assume melodic instruments rather than speech or percussive instruments.While different kinds of prior information can be considered (such as specific characteristics of the sounds of the instruments, training data, or the musical score), we concentrate on the blind case with no prior information and instead rely on a learned parametric model for the sounds of the individual instruments.
Many algorithms for this problem are based on non-negative matrix factorization (NMF) of the spectrogram.In the simplest form, each tone of an instrument at a particular pitch has its own representation as a dictionary atom [6,7].To make the representation of the sound of a particular instrument applicable at arbitrary pitch (pitch-invariance), one often employs tensor factorization [cf. 5].In this case, the use of a log-frequency spectrogram (such as the constant-Q transform or the mel spectrogram) can be helpful, since it is also pitch-invariant in the sense that changing the fundamental frequency of a tone merely causes a shift in the representation [8][9][10][11].
A separation approach that is mathematically equivalent to NMF is probabilistic latent component analysis (PLCA) [12,13] which also exists in variants that use a pitch-invariant model on top of a log-frequency spectrogram [14,15].The next step in abstraction is to model the individual harmonics separately while enforcing sparsity in the spectral representation [16].
Decreasing the variability, e.g., the number of parameters, in a model is generally beneficial since it reduces the risk of overfitting.However, we can go one step further by employing an explicit physical model for the tones of the instruments involving a minimum number of parameters.This has the additional advantage that while the previously discussed approaches with a log-frequency spectrogram can only be pitch-invariant if the instruments are tuned to the same log-frequency scale (equal-temperament tuning [cf.17] 1 ), a physical model can be evaluated at any fundamental frequencies and sampled arbitrarily.This is crucial when dealing with acoustic instruments that either deliberately deviate from equal temperament or might simply be slightly out of tune.
With a continuous model, it is thus possible to use a pitch-invariant representation on a linear-frequency spectrogram.However, the challenge then is to identify the fundamental frequencies on a continuous domain.Duan et al. [18] use a peak detection and clustering algorithm to reduce the problem to a combinatorial one that can be approached via appropriate heuristics.Hennequin et al. [19], in a polyphonic single-instrument setting, consider the fundamental frequency as an optimizable parameter and use an NMF-type update rule, but they remark that this is only possible on a local scale due to the high number of local minima.
Even with a physical model, it turns out that the log-frequency spectrogram can still be helpful: Schulze and King [20] use its pitch-invariance property to obtain the approximate fundamental frequency of a tone via simple cross-correlation.After that, numerical optimization is performed to improve the estimate on a local scale.The optimization procedure also incorporates inharmonicity, which violates pitch-invariance, and variable width of the peaks in the spectrum which can occur, for instance, at tone boundaries.However, as explained there, using a log-frequency spectrogram inevitably results in a loss in frequency resolution which in this case needs to be mitigated via heavy preprocessing.Also, phase information is lost completely.Therefore, if given the choice, we argue that a linear-frequency representation should be preferred.
Due to the general success of deep neural networks, it is not surprising that they have also been applied to audio source separation problems.In fact, when it comes to supervised separation (with labeled training data), they dominate the state of the art [21][22][23][24][25].While supervised training is the "classical" way in which neural networks are used, it was demonstrated by Ulyanov el al. with the deep image prior (DIP) approach [26] that the structure of (convolutional) neural networks is inherently useful for representing natural images.This technique was used for image decomposition by Gandelsman et al. via the double-DIP algorithm [27].Given this success, it was natural to also apply this method to audio data, leading to the deep audio prior approach by Tian et al. [28].Based on this, Narayanaswamy et al. [29] used generative adversarial networks (GANs) trained on unlabeled training data as priors, further improving the quality of the output signals.
The problem with all the previously presented separation algorithms based on unsupervised training of neural networks is that they make the assumption that the separated signals are stochastically independent.This case is often referred to as the cocktail party problem, but it is different from polyphonic music, in which the tones are usually both rhythmically and harmonically aligned.Therefore, rather than relying on general stastistical properties of the signals, we use deep neural networks in conjunction with a parametric model.
Policy gradients were pioneered within reinforcement learning with the REINFORCE algorithm [30], which is designed to train a neural network to predict a discrete variable.While reinforcement learning has progressed towards actor-critic methods [cf.2] and, famously, the AlphaGo Zero [31], AlphaZero [32], and MuZero [33] algorithms based on Monte Carlo tree search (MCTS), we stay relatively close to the original approach, but we extend the formulation by adding deterministic values, combining policy gradients with backpropagation gradients.

Tone Model
An idealized model for the tones of woodwind, brass, and string instruments is that of the wave equation, which is a hyperbolic 2nd-order partial differential equation.However, for certain string instruments, the stiffness in the strings is non-negligible, and this leads to the introduction of 4th-order terms which cause inharmonicity [cf. 34].The corresponding solution consists of real-valued sinusoids, which consequently also appear in the audio signal.However, due to we can also express them as complex exponentials.Since we do not need the negative exponential, our model of a tone of a musical instrument is as follows: with where h = 1, . . ., N har are the harmonics (with N har ∈ N), a h ∈ C is the complex amplitude, f h is the frequency for the specific harmonic, f • 1 > 0 is the fundamental frequency of the tone, and b ≥ 0 is the inharmonicity.
For illustration, an artificial application of the tone model is provided in Figure 1.While the inharmonicity is exaggerated in comparison to real acoustic pianos, the increase in distance between the harmonics in the frequency domain is clearly visible.In the time domain, this has the effect that the overall signal is no longer periodic, even though the signals stemming from the individual harmonics are.

Time-Frequency Representation
Due to the sinusoidal nature of the tone model (2), it is advantageous to consider the signal in the frequency domain rather than the time domain.However, in reality, music is not stationary over longer periods of time; tones start, end, and change in volume or frequency.Therefore, we only consider the frequency spectrum of short excerpts in time, which is called a time-frequency representation.We compute the short-time Fourier transform where t, f ∈ R correspond to the time and frequency axes and w ∈ L 2 (R) is the real-valued analysis window [cf.35].
We sample the representation as: with the time and frequency constants α, β > 0. For αβ < 1 and with the Gaussian window a so-called Gabor frame is formed, meaning that any function X ∈ L 2 (R) is uniquely determined by its respective set of analysis coefficients Z[k, l], k, l ∈ Z [cf . 35].In practice, we only consider a finite number of indices k = 1, . . ., n len (for time) and l = 0, . . ., n spc − 1 (for frequency) with n len , n spc ∈ N, where we again neglect negative frequencies.If X is real-valued, we have For the tone model (2), the STFT yields: with ζσ = 1/(2π). 3When sampling according to (5), we obtain: Thus, for each harmonic h, the phase is constant for a fixed time index k.
Our default choice, assuming a sampling frequency of f s = 48 kHz, is: The value of ζ is short enough to capture rhythm, while the quantity 1/(2πζ) ≈ 7.46 Hz, that will become important later, is well below the fundamental frequencies considered.
Our choice of α ensures that Gaussian windows spaced by α overlap narrowly.

Dictionary Representation
In order to differentiate between instruments in a music recording, we make the simplifying assumption that the tones for each instrument η = 1, . . ., N ins (where N ins is the total number of instruments) follow a characteristic pattern, namely that we can express the amplitudes of the harmonics as: where a ≥ 0 is the global amplitude of the tone, D ∈ [0, 1] N har ×N ins is the dictionary containing the relative amplitudes for the harmonics, and φh ∈ [−π, π) is the phase angle for the respective harmonic.
In a realistic music recording, the tones of the different instruments overlap and their parameters change over time.Thus, to construct an appropriate time-frequency model, we must equip the parameters with indices relating to the tones j = 1, . . ., m (where m ∈ N is the total number of simultaneously played tones) and the time frames k = 1, . . ., n len .We define the tone-wise and global model spectrograms, respectively, as: incorporating the phase shift via ϕ j,h,k := φj,h,k + 2π f j,h,k αk, and: For signals consisting only of sinusoids, the spectrum is modeled precisely by (8), where the standard deviation σ j,k of the Gaussians is given as σ j,k = 1/(2πζ).However, especially at the beginning and end of a tone, boundary effects can occur, leading to the spectrum being better approximated by different values for σ j,k .Therefore we include those as free parameters.

Distance Function
The parametrized model z[k, l] from (11) should match the time-frequency representation Z[k, l] from (5) as closely as possible.To formalize this, we need to define a distance function.While the 2 distance would be simple and "canonical," the problem is that it overemphasizes the correctness of the high-volume parts of the representation rather than the structural similarity.A common alternative is the β-divergence [cf.36] (which is a generalization of the 2 distance, the Kullback-Leibler divergence, and the Itakura-Saito divergence), but this leads to problems with unexplained noise in the spectrum.Instead, we use the distance measure introduced in [20]; for a given time frame k, we set y = z[k, •] and Y = Z[k, •] and define: The q exponent has the purpose of lifting the low-volume parts of the representation in order to increase their relevance.The canonical choice is q = 1/2, since this is the lowest value to keep the expression convex in y.The value for δ can be low, as it is merely there to ensure differentiability at y[l] = 0. Unlike the β-divergence, the distance function in (12) is symmetric, but it is still not a metric in the mathematical sense.

Model Fitting
Even though (12) is convex in y for q = 1/2, the spectrum y itself is not point-wise globally convex in all the parameters appearing in (11).Therefore, conventional optimization methods based on gradient descent are not a good choice for minimizing d q,abs 2,δ (Y, y).Instead, we use a deep neural network to predict the tone parameters.While some of these parameters (the deterministic parameters) can be trained normally via backpropagation, we treat the "problematic" parameters as stochastic parameters which are trained via policy gradients [30].
The neural network is applied in such the way that it predicts the parameters for one tone at a time.After each prediction, the spectrum for that tone is computed and provided back to the network for the following steps.

Parameter Representation
We first have to decide which parameters are stochastic and which ones are deterministic.The fundamental frequency parameter f • j,1,k is clearly one of those in which y is not convex; however, it does have a useful gradient on a local scale.We therefore split this parameter into two parameters as f , where the values for ν j,k ∈ N are discrete and those for νj,k ∈ R are continuous.We treat ν j,k as a stochastic parameter and νj,k as a deterministic one.
If we limit each instrument to exactly one tone (therefore, m = N ins ), the instrument parameter η j,k is not required mathematically.However, it makes sense to include it from a practical algorithmic perspective, since this allows for a network architecture that sequentially extracts one tone after another while freely choosing the extraction order (see Section 3.2.2).As η j,k is discrete, it is impossible to obtain a gradient.Thus, we model it as a stochastic parameter following a categorical distribution.
For σ j,k , while it is possible to find examples in which the distance d q,abs 2,δ (Y, y) is not convex with respect to it, the gradient around the theoretical value is usually good, so we can treat it as a deterministic parameter.
Probability density functions of gamma distributions for different parameter choices.For α Γ ≥ 1, the function has the mode at (α Γ − 1)/β Γ , while for α Γ < 1, the function tends to ∞ at zero.In our proposed method, the network selects a distribution shape for each inharmonicity coefficient b j by its outputs α Γ j , β Γ j .
The inharmonicity parameter b j,k is also continuous and has a good gradient around the optimum, but depending on the characteristics of the instrument, there can exist local optima.Therefore we do not rely on backpropagation and instead treat it as a stochastic parameter following a gamma distribution (see Figure 2).The reason why we chose a gamma distribution is that it is non-negative and it can have two qualitatively different shapes: Either it tends towards ∞ at zero (which is useful to model tones without inharmonicity), or it is bell-shaped around a finite maximum (to model tones with inharmonicity).Approaching infinity, it always decays exponentially, and in the edge case between the two shapes, it matches an exponential distribution.
In the tone amplitudes a j,k , the problem is convex, and therefore they are treated as deterministic parameters.In total, the stochastic parameters for each tone are s,j,k = (ν j,k , η j,k , b j,k ), and the deterministic parameters are d,j,k = (a j,k , νj,k , σ j,k ).When it is clear which time frame we are considering, we can drop the dependency on k and summarize:

Policy Gradients
We use a neural network with network parameters θ both to give the policy π θ ( s |Y), which is the (discrete or continuous) probability density for the stochastic parameters s given the input spectrum Y and also to compute the deterministic parameters d from the input and the stochastic parameters.With this, we can express the loss function as: For the expected loss, we follow the usual computation [cf.2] but also apply the product rule for the deterministic parameters: where we refer to the first term in the sum as the policy gradient and on the second term as the backpropagation gradient.The applicability of the Leibniz integral rule can be shown under realistic conditions.The total dimensionality of s is too high to represent π θ ( s |Y) as a whole.Therefore, we decompose the log-probability density into: In practice, we first sample the stochastic parameters s,1 for the first tone, then obtain the deterministic parameters d,1 , and from these compute a model spectrum y 1 .For each additional tone j = 2, . . ., m, we gain s,j and d,j depending on the previous spectra y 1 , . . ., y j−1 , which encapsulate all the relevant information about the parameters of the previous tones.
The parameters ν j , η j are sampled via a joint categorical distribution, making sure that no instrument plays more than one tone.For each possible value of (ν j , η j ), the network gives a value for the deterministic parameters d,j as well as for the parameters α Γ j , β Γ j for the gamma distribution for b j .

Phase Prediction
Of the tone parameters for (10), we still need the phase angles ϕ j,h .Canonically, we could represent them as a vector that is output by the network for each possible choice of ν j , η j , but this would lead to high dimensionality.Instead, we let the network emit a single artificial spectrum v j ∈ C n spc for each possible instrument choice η j .This is used as the right-hand side of a least-squares problem for determining the coefficients c j,h ∈ C: from which we extract the phase angles as ϕ j,h = arg c j,h .We apply some 2 regularization to improve the condition number of the system.For c j,h = 0, the phase would be ill-defined, but since the magnitudes of these coefficients are stabilized via the objectives introduced in Section 3.4, we do not realistically expect this case to happen.With this approach, the frequency dimension of the network output is used differently than for the other tone parameters.Instead of providing an output for each possible tone choice ν j , the network computes a single spectrum, which then determines the phases of all harmonics.We expect the spatial structure of the U-Net architecture with respect to the frequency dimension to be beneficial for this task.While c j,h is not explicitly given as a conditional value depending on s , the computation of c j,h in (17) does depend on all the stochastic parameters (even b j ), and since typically n spc > N har , the network has some freedom to output v j such that c j,h take different values depending on the other parameters.
Since v j is the right-hand side of a linear least-squares system, optimizing y with respect to v j is convex.Therefore, training is done deterministically via backpropagation through the pseudo-inverse.Also, we include the additional gradient that occurs with respect to the left-hand-side parameters that appear inside the exponential. 4n the case where the positions of the peaks from the model perfectly match those from the spectrum Y without any overlap or additive noise, the choice v j = Y gives the ideal phase values ϕ j,h for all j = 1, . . ., m and h = 1, . . ., N har .While this exact case is not realistic (not least since Gaussians have unbounded support), a network with skip connections can quickly learn to predict good approximate phase values.

Complex Objectives
So far, we have only used the parameters c j,h from ( 17) in order to determine the phase values ϕ j,h .However, we can also use them for a different purpose: While the dictionary representation ( 10) is necessary in order to distinguish the instruments, it is never fully accurate since the relation of the amplitudes of the harmonics can vary slightly even for the same instrument.The typical remedy for this is spectral masking, but as explained in [20], this process does not properly deal with interference.Instead, we create a direct prediction, in which we replace the tone amplitudes and dictionary entries in (11) with the parameters c j,h,k (with time-frame dependency added back in): For a fixed k, we set The distance function from (12) only considers the absolute value and ignores the phase entirely.This is not necessarily a problem, but knowing the exact phase angles would be useful for resynthesis, and it makes more sense as a training objective.Therefore, we introduce a modified distance function that respects the phase: To avoid division by zero, we add a tiny positive constant to the denominators.
We can now compare y to Y (as before), y dir to Y, and y to y dir , and we can choose between the distance functions d q,abs 2,δ and d q,rad 2,δ as defined in (12), (19).While it would be ideal to have the phases matching in all spectra, there are some caveats:

•
It takes a number of training iterations for v j to give a useful value.In the meantime, the training of the other parameters can go in a bad direction.

•
If the discrepancy between y and y dir is high and there is a lot of overlap between the peaks (typically from different tones), the optimal phase values for y and y dir may be significantly different.An example for this is displayed in Figure 3: The two peaks (red and blue) each have different phases, but by design, those are identical between the predictions.However, since the dictionary prediction is less flexible, its amplitude magnitudes of the harmonics often do not accurately match the input spectrum Y, which shifts the phase in the overlapping region.Thus, attempting to minimize both  8)).The left plots show a part of a direct prediction y dir , which is assumed to equal the true spectrum Y for this example, and the right plots show a dictionary-based prediction y with deviating amplitudes.Due to the different amplitudes, also the phases mix differently, leading to a high value of d q,rad 2,δ (Y, y).The phases could be optimized for y (by increasing the phase for peak 1 and/or peak 2), but this would lead to suboptimal phases in y dir .In contrast, the used loss d q,abs 2,δ (Y, y) does not depend on the phase.In the direct prediction (left) the separation of the two instruments is different from the dictionary-based prediction y (right).For this example, we assume the dictionary-based separation to be correct, so ideally the direct separation would be the same.However, the total spectrum (Sum) of the incorrect separation equals the true total spectrum and thus also achieves the optimal loss value d q,rad 2,δ (Y, y dir ).This motivates to regularize the individual tones y dir j of the direct prediction using y j .The imaginary parts of the spectra are assumed to be all zero for this example.Therefore, we continue to compare y and Y via d q,abs 2,δ (without the phase), and we use d q,rad 2,δ for the comparison between y dir and Y, since y dir is the spectrum that we will end up using for resynthesis, so we aim for the phase to be correct.Between y and y dir , we can compare tone-wise, but since some discrepancy is to be expected, we only associate it with a small penalty; the purpose is to regularize y dir for the case that the peaks of different tones overlap so much that the c j,h are not unique (see Figure 4).For this task, we employ the loss terms d q,rad 2,δ (y dir j , y j ) to compare the tone spectra y dir j := z dir j [k, •] and y j := z j [k, •] in both magnitude and phase.However, since the individual peaks making up y dir j and y j necessarily have the same phases, the difference between d q,rad 2,δ and d q,abs 2,δ only matters if there is significant overlap between the peaks within the same tone, which is the case at very low fundamental frequencies.
The loss functions d q,rad 2,δ and d q,abs 2,δ are both based on the 2 loss, so they do not induce sparsity.Thus, if there is linear dependency in the dictionary or, more likely, there is a discrepancy between the dictionary model for an instrument and an actual tone played by that instrument, then one single tone played by an instrument may get identified as multiple ones, either with the same fundamental frequency or with overlapping harmonics.Thus, we introduce an additional sparsity parameter u j ∈ {0, 1} that indicates whether a certain tone is present at all, and we henceforth include it in the set of stochastic parameters s,j .Whenever u j = 0, we discard the tone in the sparse prediction given by and discount part of the loss.In terms of architecture, the parameter u j is modeled as a Bernoulli distribution. With •], and y dir = z dir [k, •] again, we define the loss as: choosing λ = 0.9.This value is somewhat arbitrary, so we do not use it to enforce sparsity directly in y.Instead, we compute the loss in the first term based on y spr so that the parameters are compatible with a sparse solution and do not rely on redundant tones.However, additional tones can still appear in y dir to reduce the distance to Y and, by extension, also in y to reduce the distance to y dir .
We give both distances to Y the same loss coefficients while the regularization is supposed to be small.In practice, we find that µ 1 = 10, µ 2 = 10, µ 3 = 1 is a good choice.

Sampling for Gradient Estimation
In order to apply a gradient descent method, we need to compute the expectation in (15).However, the set of possible parameters s is much too large to do so analytically, so we have to estimate it instead.For this, we use: with 1 s , . . ., S s ∼ π θ ( s |Y), where S ∈ N is the number of samples and C(θ, D, Y) is the baseline [cf.2].The baseline plays a crucial role in reducing the variance of the gradient, and a common choice is: As long as the baseline is independent of the samples 1 s , . . ., S s , the estimator ( 22) is unbiased.However, the easiest way to estimate the baseline is via: which obviously depends on the samples and therefore introduces a bias factor of (S − 1)/S for the policy gradient.If desired, one could correct for this bias by dividing by this factor, but it is generally advisable to be conservative about the policy gradient, so we choose to leave it uncorrected.However, the value for S should be chosen large enough for (24) to be a good estimator.
During the training, it is expected that the policy will become increasingly deterministic, so less exploration will be performed.The problem with this is that multiple parameters need to be trained, and it is easy for the training process to get stuck in local minima.Therefore, we should encourage the exploration of more parameter choices even when the policy has stabilized.Inspired by the algorithm [31], we thus define: The r j parameter can be chosen separately for each tone; with R = (r 1 , . . ., r m ), we further define: We set S = 3 m and let R 1 , . . ., R S be all the combinations of m elements out of the set {1, 0.1, 0.01}, which turn out to be reasonable magnitudes.Additionally, we do not only accept the bias which causes an underestimation of the policy gradient, but we also artificially scale it down by a factor of 10.Since the policy gradient and the backpropagation gradient model different parameters, this is conceptually not a problem; however, if the gradient is scaled too much, the interdependencies of the values in the network can then cause instabilities in the output of the stochastic parameters.Our modified gradient estimator is: where each tone parameter set i s,j is sampled according to the value of r j inside R i .This includes the sparsity parameter u j , but the inharmonicity b j is exempt from this modification altogether and always sampled according to π θ .
Since the way of sampling also affects the empirical baseline Ĉ(θ, D, Y), it is no longer an estimator for C(θ, D, Y) but simply the mean loss among the samples.We observe this to be a good stabilizer for the gradient.Via our choice of R 1 , . . ., R S we make sure that one sample is drawn with (r 1 , . . ., r m ) = (1, . . ., 1) from the original distribution π θ .Also, even with increased exploration in some of the tones, there is always a combination with r j = 1 for the other tones.Therefore, rather than destabilizing all tones at the same time, part of the exploration is only performed selectively on specific tones with the other ones still sampled via π θ .
We also need to train the dictionary D; since the representation of D is not conditional but simply a variable of size N har × N ins , exploration is not desired.In fact, even though the policy gradient on the dictionary exists for all tones but the first (via the dependency of y j on D), we ignore it to increase training stability.However, the variables i s were sampled according to π R i θ rather than π θ , so to get a stable estimate for D, we "undo" this logits ν j , η j Sampling architecture modification by multiplying with the ratio between the probability densities.We estimate the gradient in D via: where π R i θ is defined according to (26).This method is called weighted importance sampling [cf.2].

Network Architecture
We use a U-Net architecture with 7 downsampling/upsampling steps; strides of 4, 4, 4, 4, 4, 3, 2; and 80, 160, . . ., 560 one-dimensional filters with a size of 5. Finally, we add two more convolutional layers with 80 filters of sizes 3, 1, respectively, before the linear output layer.All of the hidden layers have ReLU activation.The first hidden layer is a CoordConv layer [37], which means that it is supplied with a linear range from 0.01 to 0 as an additional input channel.These design parameters were obtained via manual experimentation.
For the first tone, the network is supplied with the input spectrum Y and the absolute value spectrum |Y|.From this, the spectra y 1 and y dir 1 are computed.For each following tone j = 2, . . ., m, the network receives the residuals Y − y 1 − . . .− y j−1 and Y − y dir 1 − . . .− y dir j−1 along with their absolute values as well as the computed tone spectra y 1 , . . ., y j−1 and y dir 1 , . . ., y dir j−1 .From these values, it then yields the spectra y j and y dir j .For those input components of the network that do not yet receive a spectrum for a particular tone, a constant 0 vector is given as input to the network instead.The data flow is illustrated in Figure 5.
Since the amplitudes (a j ) are supposed to be non-negative, we apply the absolute value function to the respective output components of the network.The widths (σ j ) are kept positive via softplus, and they are clipped such that the value does not get too close to 0. For the continuous frequency offsets ( νj ), a tanh function is used to keep them inside the interval (−5, 5).The positive parameters (α Γ j , β Γ j ) are obtained after applying the exponential function, and the probabilities for the Bernoulli distribution for the sparsity parameters (u j ) are mapped into the interval (0, 1) via a sigmoid function.The joint categorical distribution for the discrete frequencies (ν j ) and instrument indices (η j ) is given in vectorial form as non-normalized log-probabilities, so we apply the softmax mapping in order to obtain a valid discrete distribution.Doing this, we have to make sure that each instrument can only play one tone at a time by excluding that instruments that have already been assigned a tone from the sampling.For each parameter output by the network, we add a trainable scaling layer.
In order to protect against potentially degenerate network output or gradients in the case of zeros in the input vector, a minimal amount of Gaussian noise is always added to Y prior to prediction, on a level that is negligable for any normal audio signals.
The dictionary entries are supposed to be elements of the interval [0, 1].Non-negativity is usually satisfied automatically due to a j ≥ 0, and for the upper bound, we add the loss 1 2 to the training of the dictionary.

Training
The network weights are Glorot-initialized and the biases are initially set to 0. The initial values for the instruments in the dictionary are exponentially decaying sequences along the harmonics: We partition the spectra Z[k, •], k = 1, . . ., n len , into random batches of size 6.We then train on each batch with the AdaMax algorithm [38].For each epoch, new random batches are assigned.For the dictionary, we also use AdaMax, but with a reduced learning rate of 10 −4 .Also, analogously to [20], for the denominator in AdaMax, we consider the maximum over all the harmonics of the particular instrument when training D. The entire procedure is outlined in Algorithm 1.

Resynthesis
After training, we apply the network once again on all the time frames Z[k, •], k = 1, . . ., m.To prevent randomness in the output, rather than sampling according to π θ ( s |Y), we use the mode of π θ (ν j , η j |Y) and then, with ν j , η j fixed, the modes of π θ (b j |Y, ν j , η j ) and of π θ (u j |Y, ν j , η j ).
We project the thereby obtained time-frequency coefficients z dir j [k, l] back into realvalued time-domain signals for each instrument via [cf.35,39]: where is the synthesis window.If the support of w is cut to [−1/(2β), 1/(2β)], then it is also the Gabor canonical dual window of w, which is the miminum-L 2 -norm window to invert the transformation from X to Z according to (5).
For all the samples with 2 instruments, we train for 100 000 iterations (independently of the size of one epoch), but we always use the result after 70 000 iterations (early stopping), since it appears that this usually gives better results.We conjecture that the partially trained network itself provides a good regularization to which separations are realistic, while simply minimizing the loss itself can lead to degenerate results.Regularization via network architecture in unsupervised learning has been prominently pioneered via the deep image prior approach [26].
For assessing separation quality, we use the SDR (signal-to-distortion ratio), which measures the overall similarity between the original and the resynthesized signal, the SIR (signal-to-interference ratio), which gives the interference from the other instrument tracks in the considered signal, and the SAR (signal-to-artifacts ratio), which disregards interference and compares the given signal to all the original ones [40].These are wellestablished figures in blind separation from the BSS Eval software package [41]. 5For all of them, a higher number corresponds to better quality.
When training non-trivial neural networks, it is always, to varying extent, a matter of chance if the optimization process will converge to a good value.We therefore train an ensemble of neural networks by running the process with 6 different random seeds (affecting the network initialization, the batch partitioning, and the random sampling of the stochastic parameters) 6 and choose the best one in terms of mean SDR over the instruments.In a realistic blind scenario without ground truth, this figure would not be available, but we deem it acceptable to let the user choose (for instance, by listening comparison) between a small number of different output results.Due to the aforementioned regularization by architecture, the value of the loss function is not a reliable indicator of separation quality, but it could be integrated into an end-user interface.
Similarly, the results taken from [20] are always the best-case training outcomes out of 10 runs.By contrast, the algorithm from [18] does not rely on randomness but instead on the hand-optimized hyperparameters.

Mozart's Duo for Two Instruments
Like in [20], we use the 8th piece from the 12 Basset Horn Duos by Wolfgang A. Mozart (K.487) in an arrangement by Alberto Gomez Gomez for two recorders 7 as an example piece.In one sample, it is played with recorder and violin, and in the second sample, it is played with clarinet and piano.All the instruments are acoustic.
In Table 1, we compare the performance of our algorithm to that of the other two on both samples.The sample with recorder and violin is comparatively "easy."While our proposed algorithm universally gives the best SIR values (indicating low interference between the instrument tracks), the algorithm from [20] outperforms it for the recorder track in terms of SDR and SAR.Possible explanations include a more effective optimization process of the respective objective function, but it could also be a result of the different data representation used in that method (namely the special log-frequency spectrogram) or a regularizing effect of the sparse pursuit algorithm.In preliminary experiments, we found that in this particular sample, a lower value for µ 1 can increase separation quality (putting more emphasis on the correctness of the direct prediction rather than the dictionary-based prediction), but requiring the user to guess the difficulty beforehand violates our notion of blind separation.
The first few seconds of the separation result are displayed in spectrogram form in Figure 6.Overall, the direct predictions are very accurate, but like it was observed in [20], the last tone visible in Figure 6b jumps up a fifth, which it does not do in the ground truth.This is because the recorder tone is actually one octave above the violin tone, and the overlap is virtually perfect.In such cases, since the dictionary model is never fully accurate, it may happen that an incorrect solution actually yields a lower loss.
By comparison, the sample with clarinet and piano is rather "hard," and our algorithm clearly shows superior performance, especially in the separation quality of the piano track.An interesting observation in Figure 7a is that while the separation performance first reaches a plateau around 10 dB (in the mean), it then declines to around 8 dB.However, as shown in Figure 7b, the values of both d q,abs 2,δ (Y, y spr ) and d q,rad 2,δ (Y, y dir ) decrease, indicating this is not an optimization failure.At the same time, the value of the regularization loss (which, with µ 3 = 1, has a much lower weight than the others) increases.Therefore, this particular sample could potentially benefit from more reguralization.
In Figure 8, the learning curves for both samples over all the respective runs are displayed.It is obvious that the results for the sample with recorder and violin (Figure 8a) are more consistent than those for the sample with clarinet and piano (Figure 8b).Also, we can see that in the former sample, separation quality generally deteriorates after too many iterations while in the latter, some runs only achieve peak performance near the end of the training.Thus, different samples can benefit from taking the results at different points in the training process.Some curves in Figure 8b terminate early, which was due to numerical failures in the training process.One difficulty with the piano as an instrument is that it exhibits significant inharmonicity [cf. 34].While the algorithm from [20] has been shown to correctly identify the inharmonicity parameter on the isolated piano track, it relies on cross-correlation without inharmonicity for tone detection.By contrast, the algorithm presented here is based on a neural network, so correctly dealing with inharmonicity at the input stage is merely a matter of training.Since inharmonicity mostly affects higher harmonics which have low volume, the ability to represent it in the output stage does not show up as much in the 2 -based SDR/SAR/SIR figures.However, due to the lifting property of d q,rad 2,δ and d q,abs 2,δ (with q = 0.5 < 1), it does influence the losses.
To illustrate the effect, we also ran our algorithm on the isolated piano track, once with and once without the inharmonicity parameter in the model, with 4 distinct random seeds each.The results for the random seeds of 11 are displayed in Figure 9.While the difference appears small on a global scale, it is consistent.

URMP
The URMP dataset [42] consists of samples with two or more acoustic instruments.It was not created for blind separation, so the samples are generally too "hard" to be used in that context.Nevertheless, in [20], a subset of potentially appropriate samples was determined, and we compare the performance of our new algorithm on these samples.
As can be seen in Table 2, the results vary widely.For the first and the fourth sample, we have to declare a failure compared to the other two algorithms.On the second and the third sample, however, our algorithm is universally dominant.The good performance on the sample with trumpet and saxophone is especially surprising, since we deemed this a very challenging sample due to the similarity of the sounds of the instruments.

Oracle Dictionary
In order to investigate the failure of the separation method on the first and the fourth sample, we first train oracle dictionaries by providing the algorithm with the ground-truth individual tracks for the respective instruments.We then supply the separation procedures with these dictionaries as initial values.In one instance, we keep the dictionaries fixed throughout the training, and in another one, we train then at the normal rate, starting from oracle dictionaries.For each separation, we use 4 different random seeds.The results are displayed in Table 3.While we usually use the direct prediction z dir [k, l] for resynthesis, we here also include the resynthesis based on the dictionary prediction z[k, l] for analysis.
With the fixed oracle dictionary, the results are generally much better than with the normal training in Table 2.However, when allowing training from the oracle dictionary, the separated flute and the cello tracks become unacceptably bad again.When using the dictionary prediction for resynthesis, all the results are of very poor quality, indicating that the dictionary model ( 10) is not an accurate representation of the spectral characteristics of the tones.
Upon manual inspection of the flute track, we noticed that it contains a number of tones which are "half-overblown," such that the spectra of both the higher octave and of the lower octave are present.This does not represent the normal spectral characterics of the flute sound, so the oracle dictionary contains a "compromise," while the trained dictionary fails to represent these half-overblown tones.In the cello track, there is no obvious technical peculiarity, but the tones are simply very diverse, involving different open strings and also different articulation between the tones, so even the training of the oracle dictionary is problematic.
Generally, for instruments with inconsistent spectral characterics, the method from [20] may be at an advantage since it prunes and randomly reinitializes parts of the dictionary in regular intervals; so, given enough tries, it can reach an appropriate dictionary by chance, even if it is potentially suboptimal with respect to the loss function.In [18], a number of original samples are used.As we mentioned, their algorithm has the unique ability of separating a representable instrument track out of a mixture with a non-representable residual, which can, for instance, be a singing voice.Since our algorithm is not designed for such signals, we selected the samples for which all the instruments can be represented.In total, these are one sample with acoustic oboe and euphonium, one sample with synthetic piccolo and organ, and a third sample with a synthetic oboe track added to the previous sample.
The problems with these samples are that they have a different sampling frequency ( f s = 22.05 kHz) and they are also very short.Whereas in [20], the signals were converted to a different sampling frequency as a preprocessing step in order to reduce the loss of resolution due to smoothing, we do not have this problem here.While we keep the value of ζ from (9) constant in terms of absolute units, the relation to the sampling frequency consequently changes to ζ f s = 470.4.The frequency constant changes proportionately with the sampling frequency to β = 1.794 433 593 75 Hz.Since the samples are short, we choose an even smaller time constant compared to (9) by setting α = 16/ f s ≈ 0.73 ms in the acoustic sample and α = 128/ f s ≈ 5.80 ms in the synthetic ones, each with α = α/4 for training.The results are displayed in Table 4.
While the results for the acoustic sample are better than in the original publication, they are still not nearly as good as those in [20].Our explanation is that while the time resolution of the spectrogram is almost as high as that of the time-domain signal (α = 4/ f s ), there is still just not enough data in the sample to train the neural network, and thus hand-crafted methods are at an advantage.Table 3. Separation with an oracle dictionary on a selection of samples from the URMP [42] dataset.The "Fix" column indicates whether the dictionary is kept constant during the separation, and the "Pred."column specifies whether the direct or the dictionary prediction is used.Best numbers are marked when they also exceed the performance from Table 2 By contrast, our method delivers very good results with synthetic instruments, clearly and universally outperforming the other methods on the sample with two instruments and providing the best average performance on the sample with three instruments.

Conclusion
We have developed a blind source separation method that unmixes the contributions of different instruments in a polyphonic music recording via a parametric model and a dictionary.The model parameters are predicted by a deep convolutional neural network, and with respect to those that do not possess a useful backpropagation gradient, we use the policy gradient instead.
Unlike other algorithms, ours operates directly on the complex output of the STFT, which is linear and preserves the phase.Rather than using spectral masking, we let the network give a direct prediction for the complex amplitudes of the harmonics.
In general, the algorithm exhibits very good performance on a variety of samples.It is especially dominant in terms of SIR, which is relevant since eliminating cross-talk is the main objective in separation.We attribute this to the use of a complex-valued direct prediction for the individual instruments that can properly handle interference between the instrument tones in the spectral domain.Such interference is particularly prevalent in synthetic samples, on which the performance of our algorithm surpasses that of the competing methods.It also clearly outperforms the other methods on the sample with the acoustic piano, which poses the challenge of detecting tones with inharmonicity.
As is usual with blind separation, however, problems arise when the structural assumptions are not satisfied.In two samples from the URMP database, there was one instrument each whose sound in the recording varied too much to be accurately represented by the dictionary.While we found that using oracle dictionaries, satisfactory separation can be achieved, these dictionaries are not attained as the result of training, even when they are supplied as the initial value.
Due to the use of neural networks, our approach is very flexible with respect to the choice of the loss function.In the spirit of blind separation, we chose the weights for the respective distances such that they constitute a reasonable comprise for all the samples on which we tested them, but this choice is not necessarily optimal for the individual samples.While a linear combination is the most straight-forward way to account for all the considered distances, a non-linear mapping could potentially be better.Also, even though we have not found any distance functions yielding better performance than the ones we use, more experiments could be conducted.
With our approach, we hope to provide a blueprint for the combined use of backpropagation gradients and policy gradients in the application of neural networks on non-convex parameter identification problems.

Figure 1 .
Figure 1.Illustrative example for the signal model with a fundamental frequency of f • 1 = 440 Hz and an inharmonicity parameter of b = 10 −2

Figure 3 .
Figure 3. Example showing the interference of two overlapping peaks.Each peak models the contribution of one harmonic of a tone to the spectrum (cf.(8)).The left plots show a part of a direct prediction y dir , which is assumed to equal the true spectrum Y for this example, and the right plots show a dictionary-based prediction y with deviating amplitudes.Due to the different amplitudes, also the phases mix differently, leading to a high value of d

Figure 4 .
Figure 4.Example showing non-uniqueness of the tone separation in the direct prediction.In the direct prediction (left) the separation of the two instruments is different from the dictionary-based prediction y (right).For this example, we assume the dictionary-based separation to be correct, so ideally the direct separation would be the same.However, the total spectrum (Sum) of the incorrect separation equals the true total spectrum and thus also achieves the optimal loss value d (Y, y dir ) and d q,rad 2,δ (Y, y) would lead to a conflict regarding the choice of common phase values.

Figure 6 .Figure 7 .Figure 8 .Figure 9 .
Figure 6.Excerpt of the separation result for the piece by Mozart, played on recorder and violin.Displayed are the original STFT magnitude spectrogram as well as the direct predictions for each instrument.In the highlighted section, the last tone is supposed to be a constant octave interval between the violin and the recorder, but the prediction for the recorder contains an erroneous jump.The color axes of the plots are normalized individually to a dynamic range of 100 dB.

Table 1 .
Comparison of the separation algorithms on the samples based on the piece by Mozart.Best numbers are marked.

Table 2 .
[42]arison of the separation algorithms on a selection of samples from the URMP[42]dataset.Best numbers are marked. .

Table 4 .
[18]arison of the separation algorithms on the data by[18].Instruments labeled as "s." are synthetic, those labeled as "a." are acoustic.Best numbers are marked.