Sigmoidal NMFD: Convolutional NMF with Saturating Activations for Drum Mixture Decomposition

: In many types of music, percussion plays an essential role to establish the rhythm and the groove of the music. Algorithms that can decompose the percussive signal into its constituent components would therefore be very useful, as they would enable many analytical and creative applications. This paper describes a method for the unsupervised decomposition of percussive recordings, building on the non-negative matrix factor deconvolution (NMFD) algorithm. Given a percussive music recording, NMFD discovers a dictionary of time-varying spectral templates and corresponding activation functions, representing its constituent sounds and their positions in the mix. We observe, however, that the activation functions discovered using NMFD do not show the expected impulse-like behavior for percussive instruments. We therefore enforce this behavior by specifying that the activations should take on binary values: either an instrument is hit, or it is not. To this end, we rewrite the activations as the output of a sigmoidal function, multiplied with a per-component amplitude factor. We furthermore deﬁne a regularization term that biases the decomposition to solutions with saturated activations, leading to the desired binary behavior. We evaluate several optimization strategies and techniques that are designed to avoid poor local minima. We show that incentivizing the activations to be binary indeed leads to the desired impulse-like behavior, and that the resulting components are better separated, leading to more interpretable decompositions.


Introduction 1.Drum Mixture Decomposition
In this paper, we consider the task of automatic unsupervised drum mixture decomposition, in which a percussive recording is decomposed into its constituent parts, while transcribing the onset locations of those instruments.A hypothetical example of such a decomposition is shown in Figure 1.
The drum mixture decomposition problem described above is closely related to the problem of automatic drum transcription (ADT).ADT aims to detect and classify drum sounds events within a music recording, resulting in a list of onset locations for each transcribed instrument.Wu et al. [3] give a comprehensive overview of the state-of-the-art in ADT, and perform an in-depth comparison of these methods.They identify two classes of "activation-based" methods that currently dominate the state-of-the-art, namely, on the one hand neural network based systems using Recurrent Neural Network [4,5] or Convolutional Neural Network [6] architectures, and on the other hand methods based on non-negative matrix factorization (NMF) [3,7].According to their analysis, neural networkbased approaches outperform NMF-based methods in terms of transcription accuracy when a large and diverse training dataset with high-quality annotations is available.Such a dataset is not always available, however, and in the absence of such a dataset, NMFbased approaches offer a good alternative, as they do not require a data-hungry training procedure and still provide adequate performance if appropriately initialized.They are also more robust for drum mixtures with previously unseen instruments [3].Unsupervised transcription systems, such as the ones based on NMF, can furthermore be used to improve supervised approaches by leveraging them in semi-supervised learning schemes such as student-teacher learning [8].In this work, we build upon the non-negative matrix factor deconvolution (NMFD) algorithm, an extension of NMF that explicitly models sounds with a temporal structure [9].We use NMFD not only to discover the onset locations within the mixture: at the same time, it discovers a "template" of the constituent sounds that are responsible for each set of onsets.We furthermore use NMFD in an unsupervised fashion, i.e., after a best-effort initialization with templates of common percussive sounds, we allow the model to freely optimize the discovered templates to the target mixture without imposing any constraints on the sonic characteristics of the instruments that we expect to find.This is in contrast with most of existing ADT work, where often a predefined and fixed set of percussive instruments is considered.Therefore, we use the term automatic drum mixture decomposition in order to distinguish this use case from ADT, which is concerned with discovering the onset locations of an often predefined and fixed set of percussive instruments.In the next section, we describe the NMFD algorithm and present an overview of related work using NMFD for drum mixture transcription and decomposition.

Non-Negative Matrix Factor Deconvolution for Drum Mixture Decomposition
The non-negative matrix factor deconvolution (NMFD) algorithm [9] can be used for the drum mixture decomposition problem introduced in Section 1.1.It decomposes a nonnegative matrix X ∈ R N×T ≥0 with N frequency bins and T time frames into a dictionary of K time-varying spectral templates W (k) ∈ R N×L ≥0 , each L time frames long, and an activation matrix H ∈ R K×T ≥0 .The matrix X is modeled as the convolution of the templates with the activation matrix: where H k,t−τ is zero when t < τ.W (k) and H are updated iteratively using multiplicative updates in order to minimize a divergence measure L(X, X), typically the least squares loss, Kullback-Leibler (KL) divergence or Itakura-Saito divergence [10].There also exists a two-dimensional variant of NMFD [11].
The templates W (k) can be interpreted as short spectrograms of length L that model the constituent sounds of the mixture.The corresponding activation curves H k , i.e., the rows of H, describe where in the recording these sounds occur.In order for this interpretation to make sense, each sound should repeat itself almost unaltered throughout the recording, so that the templates captured by W (k) can be "copied and pasted" at locations specified by H k .This is a reasonable assumption for percussive instruments: hits on the same instrument will all sound approximately the same and will decay approximately equally fast, provided that the playing technique is consistent.
NMFD has already been applied successfully for automated drum transcription and drum separation tasks [1,10,[12][13][14][15].For example, Laroche et al. [13] use a combination of NMF and NMFD in order to perform harmonic-percussive sound separation, modeling the non-percussive sounds using NMF and the percussive sounds using NMFD with predefined and fixed templates W (k) .The percussive template dictionary is constructed by hand prior to decomposition, and the separated harmonic and percussive audio are obtained by means of Wiener filtering [16].Lindsay-Smith et al. [12] investigate the use of sparsity constraints on the activations H in order to obtain impulse-like onsets.Ueda et al. [14] rewrite NMFD for drum transcription within a Bayesian framework, and impose a constraint on the timequantized score S, which is derived from the activations H.
The aforementioned works apply NMFD to discover the activations H, for the purpose of ADT or audio source separation.In other works, NMFD has also been applied to capture the constituent drum samples in a recording as faithfully as possible in W; this is typically done in a score-informed setting, wherein the exact onsets of each instrument (and consequently H) are assumed to be known.NMFD is used as such in Dittmar and Müller [15], where the extracted percussive sounds are subsequently used to validate a transient restoration technique that is applied when converting spectral representation back to an audible waveform.In Dittmar and Müller [1], the authors apply NMFD to estimate the drum sounds in the Amen Break, a well-known drum solo recording, in a score-informed setting.They observe that the unconstrained application of NMFD can lead to cross-talk artifacts, and they therefore propose two extensions to purify the extracted templates.Vande Veire et al. [17] apply NMFD in an uninformed setting, and they use an ad hoc modification of the update procedure for the templates W (k) in order to ensure that only a single drum hit is captured per template when using a long template length L.
While the cited works illustrate the effectiveness of NMFD for drum mixture transcription and (score-informed) decomposition, they also share the shortcoming that NMFD is usually applied in a constrained setting.When NMFD is applied for ADT, i.e., to discover the activations H, then the templates W are usually predefined and kept fixed during optimization.This limits the application to drum mixtures where a reasonably accurate approximation of the constituent drum sounds is known in advance.On the other hand, when NMFD is applied to discover the constituent sounds, i.e., the templates W, then this is done in a score-informed setting, where H is assumed to be known in advance.With the works in Ueda et al. [14] and Vande Veire et al. [17] as exceptions, we note the absence in the literature of a successful application of NMFD where both W and H are optimized jointly (Note that there are examples in the literature of the application of (regular) NMF for automatic drum transcription with a joint optimization of the (one-dimensional) templates W and the activations H [7,18]).Such a joint optimization could be useful, though, as it would allow to decompose a drum mixture for which neither the exact onsets H nor a sufficient approximation of W are available in advance.
In this paper, we therefore investigate the application of NMFD to jointly decompose a drum mixture into its templates W and their activations H.However, we note from previous work on NMFD for drum mixture decomposition [1,12,17] that applying NMFD unconstrainedly can lead to undesired artifacts or decompositions.Consequently, additional measures are required to guide the optimization to the desired, musically valid, and informative solution.In this work, we therefore enforce impulse-like behavior on the activations H, as detailed in Section 1.3.

Motivation for a Sigmoidal Model for the Activations
Percussive instruments are hit very briefly by some percussion mallet or beater; this implies that the discovered activations should be impulse-like, as the produced sounds results from an excitation that itself is an impulse.This is not enforced by the original NMFD model, however; consequently, when applied to drum mixtures, we observe that NMFD often does not lead to the expected impulse-like activations, as shown in Figure 2.This example illustrates that NMFD discovers activations with a sharp initial peak when a percussive hit occurs in the mixture, succeeded by a pseudo-exponentially decaying "tail' of small activation values.There are also small activations throughout each activation curve that do not succeed a larger peak, which makes the decomposition hard to interpret: do these activations correspond to a detected drum hit, or not?To address these shortcomings, additional constraints are needed to guide the decomposition process.One approach would be to enforce an L1 sparsity constraint on the activations H [12,19].This encourages the algorithm to "move" as much information as possible from the activations H to the templates W, which would lead to sparser and potentially more impulse-like activations.This approach has a drawback, however, i.e., it also penalizes correct activations.This biases the model to capture sequences of successive drum strokes within a single template, in order to keep the activations as sparse as possible [12,17].
In this paper, we use an alternative approach in order to achieve the desired impulselike behavior: we enforce that the activations take binary values, i.e., either an instrument is hit, which yields an activation value of 1, or an instrument is not hit, which yields an activation value of 0. The relation between this constraint and the desired impulse-like behavior becomes clear when considering that enforcing such binary activations rules out the aforementioned "tails" and unclear activations: either an instrument is hit, or it is not, and values in-between 0 (not hit) and 1 (hit) are discouraged.An advantage of this constraint is that it does not penalize legitimate peaks, as opposed to a sparsity constraint.
To achieve the proposed binary activations, we redefine the activations in the model as the logistic function of a logit-activations matrix, and we impose a regularization term that pushes the activations towards the saturating regions of the logistic curve during optimization.As such, non-binary activation values are discouraged, and the activation values will be pushed to either 0 or 1 as much as possible.Of course, different sources can be present in the mix at different volumes: this is modeled by multiplying the binary activations with a per-component amplitude factor.A log-power spectrogram representation is used, as this further reduces the impact of velocity differences and emphasizes the binary behavior of the onsets.Note that we choose to maintain a continuous transition between the two saturated states, instead of choosing for a fully discrete quantization of the activation values: this ensures that the model and objective function remain differentiable so that the optimization procedure is tractable, and additionally allows some flexibility in activation values.Through evaluation on a public dataset, we show that these adaptations lead to decompositions with the desired impulse-like activations, and we illustrate by means of an example that this can make the obtained unsupervised decompositions more interpretable.

Contributions
The main contributions of this paper are the following.

•
We reformulate the activations in the NMFD model as the product of a per-component amplitude factor, representing the relative volume of each component, with the timevarying activations for each component.• These time-varying activations are defined as the output of a saturating sigmoidal function, and we propose a novel regularization term that combined with these saturating activations leads to binary activations.We show that in the context of automatic drum mixture decomposition, the activations are not only binary, but also become impulse-like as a consequence of this method.

•
We propose different strategies and techniques to optimize the proposed model, and we rigorously evaluate their efficacy in minimizing the overall objective function for the decomposition.

•
We propose metrics to evaluate the unsupervised decomposition of drum mixtures.With these, we show that the proposed algorithm achieves more impulse-like activations compared to unconstrained NMFD and sparse NMFD, making it better suited to the properties of percussive mixtures, while yielding a good decomposition and spectrogram reconstruction quality.

Structure of This Paper
The remainder of this paper is structured as follows.Section 2 introduces the modified NMFD algorithm and the procedure that is used to optimize this model.Section 3 describes the baseline models, dataset, metrics, and experimental details.Section 4 discusses the experimental results, and Section 5 concludes the paper and outlines directions for further research.

Sigmoidal NMFD Model
The logistic function σ(•) is defined as For large positive values of the input, σ(x) saturates to 1, and for large negative values of the input, it saturates to 0. If y = σ(x), then x is also called the logit of y.
We rewrite the NMFD model using saturating activations: with X ∈ R N×T ≥0 , W (k) ∈ R N×L ≥0 , G ∈ R K×T and a ∈ R K .X and W (k) are scaled to maximum amplitude 1.By comparing Equations ( 1) and (3), we see that for the sigmoidal model, H k,t = σ(a k )σ(G k,t ).The sigmoidal activations σ(G k,t ) capture the onsets for each component k, while the amplitudes σ(a k ) capture the relative volume of each component, as different components W (k) and W (l) can be present at different volumes in the mix.Note that the logit-activations G k,t and the logit-amplitudes a k can take negative values.

Objective Function
The main objective of the decomposition is to minimize the divergence between the input spectrogram and the approximation.In this paper, we use the KL divergence: We furthermore want the activations to be binary in nature: for each t, the template W (k) should either be fully active, σ(G k,t ) ≈ 1, or not active at all, σ(G k,t ) ≈ 0 (Note that the logistic function σ(x) is never exactly equal to either 0 or 1 for real values of the logit x; this is only the case in the limit for x → −∞ and for x → +∞, respectively.Therefore, it would be more correct to say that the activations take approximately binary values.).In other words, σ(G k,t ) must be in a saturating region of σ for all k and t.To achieve this, we define an additional regularization term L G : Here, σ −1 is the inverse of the logistic function, and α k = 0.5.This regularization term encourages all logit activations G k,t of the k th component to lie as far away as possible from the logit µ k of the center activation value σ(µ k ), and L G (G) is minimal when all activations saturate to either 0 or 1.
The sigmoidal NMFD model, Equation (3), thus optimizes the following objective function: The hyperparameter γ weighs the relative importance of the regularization term L G with respect to the spectrogram reconstruction objective L KL ; in this paper, we set γ = 1.

Optimization Procedure 2.3.1. Optimization Procedure Overview
Like the original NMFD algorithm [9], the model parameters W (k) n,τ , G k,t and a k are optimized in order to obtain a minimal loss L tot by means of an iterative optimization procedure.First, the parameters are initialized as explained in Section 3.2.Then, G, W and a are updated iteratively as follows: 1.
Calculate X using the most recent estimates for W, G and a, as in Equation ( 3 Repeat these steps until convergence.

Additive Gradient-Descent Update for G
L tot is minimized with respect to G using gradient-descent: The learning rate η G is a hyperparameter of the optimization procedure.The partial derivatives in Equation ( 9) expand to (see Appendix A.1): Note that we regard µ k as a constant in the derivation of Equation ( 11).The expression We do so in order to avoid instabilities in the updates for the ultimate values of the activations, see Appendix A.1 for details.

Multiplicative Update for W
W is optimized using a multiplicative update rule, which ensures that W remains strictly positive.Its derivation from Equation ( 7) is analogous to that in Schmidt and Mørup [11], see Appendix A.2.This gives After each update, W (k) is scaled to maximum amplitude 1 for each k.

Additive Gradient-Descent Update for a
L tot is minimized with respect to a using gradient-descent: where η a is the learning rate and with ∂L tot ∂a k given by (see Appendix A.3)

Optimization Strategies to Escape Local Minima
The model parameters are updated by iteratively applying Equations ( 9), (12), and (13).This is, however, a delicate task, as it is prone to converge to poor local minima due to the regularization term L G .In the update for G, this regularization namely pushes G k,t away from µ k , see Equation (11).If the algorithm has not converged yet, then this prevents new peaks to grow or existing peaks to shrink, even if this would eventually lead to a better optimum.Imposing L G too strongly or too early during optimization could therefore hinder convergence to a better local minimum.
We therefore propose and evaluate three optimization strategies that could help to find better local minima, as detailed below.In general, the optimization happens in different stages: an unconstrained warm-up stage , an explore-and-converge stage, and an ultimate finalization stage.
The goal of the unconstrained warm-up stage is to make the algorithm converge from its initialization (see Section 3.2) to a rough approximation of the spectrogram and a first estimation of the activation functions.During this stage, γ = 0, so that in this initial exploration, the activations are free to converge to the values that best approximate the spectrogram.We furthermore set η G = 0.5, and η a is set to 0.02.The warm-up stage is 30 iterations long.Empirically, this leads to good results in our experiments.
After the warm-up stage comes the explore-and-converge stage, wherein the proposed saturation regularization is applied and where an optimal solution is sought for the decomposition problem.In this paper, we consider the following strategies to execute this stage: • Optimization strategy 0: straightforward optimization.
In this strategy, L G is applied with γ = 1.0 at each iteration, and µ k is calculated as in Equation ( 6) with α k = 0.5.

•
Optimization strategy 1: staged application of L G .
In this strategy, we periodically enable and disable L G by alternating between "saturation sub-stages" and "fine-tuning sub-stages", which each last several iterations.During a saturation sub-stage, γ = 1, so that the activations are pushed towards saturation.During a fine-tuning sub-stage, γ = 0, so that the model has time to make peaks grow or shrink against the direction imposed by L G , in order to escape poor local minima.• Optimization strategy 2: moving µ k throughout optimization.
In this strategy, we impose L G at each iteration, i.e., γ = 1.0 for each update.In order to avoid squashing small peaks too early and additionally provide an incentive to escape local minima, we move around the "center point"µ k of L G (Equation ( 6)) by changing α k in each iteration.More specifically, for each update of G and component k, we set α k to a random value drawn from a uniform distribution over the interval (0.05, 0.25).We hypothesize that setting α k to a relatively low value (α k < 0.5) helps to boost relatively small peaks, and that randomly sampling α k could help to escape local optima.• Optimization strategy 3: combine strategy 2 and strategy 3.This strategy combines the two aforementioned strategies: L G is enabled and disabled alternatingly, and when it is applied, µ k is moved around by sampling α k from a uniform distribution over (0.05, 0.25) for each update of G.
During the explore-and-converge stage, we set η G = 0.2, as we find that this leads to a good convergence in our experiments.The learning rate η a for the amplitudes a remains unchanged, i.e., η a = 0.02.We perform 180 iterations in total during this stage.For strategy 1 and strategy 3, each sub-stage is 30 iterations long, so that the explore-and-converge stage consists of three repetitions of alternating saturation and fine-tuning sub-stages.
The finalization stage concludes the optimization process.It consists of a final 30 iterations, in which we set γ = 1, η G = 0.1, η a = 0.02, and α k = 0.5.This allows the algorithm to converge to the final solution.
In our experiments, we found that normalizing the gradients of G and a in Equations ( 9) and ( 13) to a maximum amplitude of 1 for each component is important to ensure that the activations in each activation curve G k grow equally quickly.Otherwise, one component could grow much quicker than the others and start to dominate the decomposition, often resulting in a poor local optimum of L tot where only one component is active (also see Section 4.2).

Baseline Models
We consider two baseline models to compare with.The first baseline model uses the original NMFD formulation from Equation (1) without additional constraints.The second baseline adds an L1 sparsity constraint with weighing factor λ to the objective function, Equation (4): In our experiments, we consider sparsity weights of λ = 1.0 (strong sparsity constraint), λ = 0.1 (medium sparsity constraint), and λ = 0.01 (weak sparsity constraint).
For both the unconstrained NMFD baseline and the L1-constrained baselines, we use the update rules from Schmidt and Mørup [11], as we observed numerical instabilities for the original NMFD update rules from the work in Smaragdis [9] when W (k) contains columns that are much smaller in amplitude than other columns, i.e., when the sample captured by W (k) is silent at certain points.Similar observations were made in Lindsay-Smith et al. [12].As for the sigmoidal model, 240 update iterations are performed.We furthermore evaluate two optimization strategies for the sparse baselines.In the first, L1 regularization is applied throughout the entire optimization, starting from the first iteration.In the second, the L1 regularization is disabled for the first 30 iterations, i.e., λ is set to 0. The reason for this second optimization strategy is that applying the L1 regularization too early might hinder proper convergence.This allows to evaluate whether the baselines would benefit from an "unconstrained warm-up stage" as used for the sigmoidal model.
For all baselines, the spectral templates W (k) are initialized in the same way as for the sigmoidal model, see Section 3.2, and are scaled to max amplitude 1 after each update as in the sigmoidal model.The activations H are initialized with random values drawn from a uniform distribution over (0, 10 −3 ).

Model Initialization
L is set to 50, which at a sample rate of 44,100 Hz and short-time Fourier transform (STFT) hop size of 256 corresponds to a template length of 290 ms.We set the number of components K to the number of percussive instruments in the mixture, which we assume is known in advance.
The templates W (k) are initialized using an averaged spectrogram template of drum hits of four common drum instruments: kick drum, snare drum, hi-hat, and crash cymbal.These average templates are created using a small dataset of individual drum hits [20], by averaging the aligned spectra of the single-hit samples of the desired instrument type.The first four components are initialized with a kick, hi-hat, snare, and crash template; if K > 4, then the excess components are initialized by alternating between the hi-hat template and the snare drum template.Each W (k) is also rescaled to a maximum amplitude of 1.
For the sigmoidal model, the logit-activations G k,t are initialized with random values drawn from a uniform distribution over the interval (−5, −4) so that σ(G k,t ) ∈ (0.0067, 0.018).The logit-amplitudes a k are initialized to 2, so that σ(a k ) ≈ 0.9.

Dataset
The algorithm is evaluated on the ENST dataset [21], which contains annotated recordings of percussion-only pieces performed by three drummers on three different drum kits using a variety of beaters (sticks, rods, mallets, and brushes).We only use the wet mix of the "phrase" recordings, i.e., short drum sequences in various popular styles.There are 135 phrases in total, varying in tempo (labeled "slow", "medium", and "fast" in the dataset), and complexity ("simple", i.e., straight and without ornaments, and "complex", i.e., with fill-ins and ornaments).We modify the recordings slightly by cutting away the last hit of each recording.The motivation for this is that the last hit often "rings out", i.e., it has a long decay, which cannot be appropriately modeled in the NMFD paradigm, given the fixed and limited template length L (Note that, while hits on the same component as the last hit should have the same decay time, we do not observe any issues with modeling these hits throughout the mixture.We note two reasons for this.First, when an instrument with a long decay is hit, it is typically hit again before the first hit can fully "ring out", i.e., its decay is "cut short" by the second hit.Second, the low-volume "tail" of the decay is often masked by the sound of subsequent hits on other components.These two effects effectively reduce the template length L that is required to model the sounds in the mixture, except for the last hit where these effects do not apply.An alternative solution would be to increase L in order to fully capture these hits with a long decay within a single template.We choose not to do so, however, as this makes the optimization of the NMFD algorithm hard and prone to error, as discussed in Vande Veire et al. [17].).

Spectrogram Representation
For the experiments in this paper, we use a custom Mel-frequency-scale log-power spectrogram representation.This spectral representation is designed to reduce computation time to obtain the decomposition, while maintaining a sufficiently fine resolution in order to distinguish all relevant sounds in the mixture.
The audio sample rate is 44,100 Hz.First, the STFT power spectrogram X = |STFT(y)| 2 of the audio y is calculated using a frame length of 2048, a hop size of 256, and a Hann window over the frames.Then, the values of the STFT spectrogram are rescaled to the range (0, 1), and they are summed along the frequency axis over N adjacent, non-overlapping frequency bands.The boundary frequencies of these bands are spaced according to a Mel-scale between 0 Hz and 11,025 Hz.Finally, a small value pre-dB is added to the power spectrogram, which is then converted to a decibel scale and scaled to the range ( post-dB , 1).
Using the Mel scale makes the low to mid frequencies much more prominent in the resulting spectrogram as compared to a linear scale spectrogram, where a higher proportion of the bins would be allocated for higher frequencies.Using non-overlapping windows ensures that each STFT bin only contributes to one Mel-scale bin, so that the resulting spectrogram is less blurred.We find that these properties help to better distinguish between different instruments, especially those with a prominent presence in the low and mid frequencies (kick drum, snare drum, toms, bongos, etc.).A small number of bins N significantly reduces the time needed to decompose the spectrogram.We find that a limited number of bins is sufficient for a good decomposition and choose N = 25.Adding a small value pre-dB before the decibel transformation masks low-valued noise, so that the resulting dB-scale spectrogram optimally uses its value range to differentiate between relevant power differences, making it clearer and easier to decompose.Scaling the spectrogram values to values in ( post-dB , 1) after transforming it to a dB scale is required for the sigmoidal model to be able to approximate all spectrogram values, while a minimum value of post-dB is used to avoid numerical instabilities in various computations.We set pre-dB = 10 −7 and post-dB = 10 −9 .

Evaluation Metrics
In our evaluation, we wish to quantify the quality of unsupervised decompositions of drum mixtures, and compare these quantities for the outcome of the sigmoidal model and of the baseline models.Note that the unsupervised nature of the decomposition makes it more difficult to rely on ground-truth transcriptions of the music, as due to the unsupervised character there is no guarantee that the extracted components would match one-to-one with instruments in the music.Turning NMFD into a transcription algorithm would require incorporating some supervision mechanism that guides the components W (k)  to the desired musical interpretation, which is beyond the scope of this paper.Therefore, alternative evaluation metrics need to be used that quantify the quality or "usefulness" of such unsupervised decompositions.
We consider a decomposition to be of a good quality if The metrics to quantitatively assess these criteria are described in the following subsections.Some metrics are calculated over the activations H k,t ; when evaluated for the sigmoidal model, H k,t should be substituted by σ(G k,t ) in these metrics.

Spectrogram Reconstruction Quality
The spectrogram reconstruction quality is measured using the mean absolute error (MAE) between the target spectrogram and its reconstruction: As all spectrogram values are scaled between ( post-dB , 1), MAE values of different spectrograms can be compared.

Overall Onset Coverage
We measure whether each onset in the drum mixture is accounted for by the decomposition, although without considering instrument information.First, peak picking is performed on each row of H.A value H k,t at an offset t is considered to be a peak if it satisfies three conditions [22]: 1.
H k,t ≥ mean(H k,t−τ avg :t+τ avg ) + θ thr max t (H k,t ), where t prev is the offset of the last peak detected before t and where the hyperparameters are set as τ max = 5 (corresponding to 29 ms), τ avg = τ wait = 10 (58 ms).We vary the value of the peak picking threshold θ thr within the range of (0.1, 0.9) in order to evaluate its influence on the metric proposed below.
The detected peaks are then shifted by the "template offset" τ off , which is calculated as the smallest value of τ for which the envelope of n,τ , is larger than the average envelope value: τ } .This is necessary as the percussive hit modeled by W (k) might be shifted by some offset τ (k) off in the template.These peaks are then compared with the ground-truth annotations.A peak in the decomposition is considered a true positive if there is a ground-truth onset of any instrument within the tolerance interval of 29 ms around that peak; otherwise, it is a false positive.Ground-truth annotations for which there is no peak detected within the tolerance interval around it are false negatives.The precision, recall, and F-measure are calculated using these true positive, false positive, and false negative counts.
Note that this metric allows a ground-truth onset to be "covered" by multiple activation peaks and vice versa, and that peaks from any component can match with onsets from any instrument.We do not attempt to match components with specific instruments in the ground-truth annotations, as this is a difficult task that is prone to error and ambiguity, and we consider this beyond the scope of the unsupervised decomposition that is considered in this paper.

Activation Curve Similarity
This metric quantifies how different the activations from each component are from the activations of any other component in the decomposition.We consider a decomposition to be of higher quality if the different activation curves are disentangled, i.e., they activate often at distinct times in the mixture.Each component then models drum hits that are not modeled by other components.On the other hand, a high similarity between activation curves indicates that multiple components often contribute to the same onsets, so that it could be difficult to figure out the relationship between instruments in the mixture and components in the decomposition.
Note that we expect the activations in the decomposition to have at least a low amount of similarity, as the onsets in different rhythmic instruments are often correlated and will coincide at least sometimes.However, an exceedingly high similarity value would be unexpected, as we expect distinct instruments to have at least some degree of uniqueness to their activations, and it is this undesired behavior that we wish to detect by using this metric.
To quantify activation curve similarity for a given activation matrix H, each activation curve is first smoothed and made non-zero using a running mean operation: We set M to 5, corresponding to a symmetric tolerance interval of 29 ms around each t.This smoothing allows to better compare two activation curves that capture the same onsets but that are slightly shifted with respect to each other.Then, the cosine similarity is calculated between every pair of rows in H.The small value H = 10 −52 ensures that comparing one row of H with an all-zero row in H still results in a meaningful metric value, for example, comparing two all-zero rows in H should result in a similarity value of 1.After calculating the pairwise similarity of all rows in H, we consider the minimum, mean and maximum similarity between any pair of rows to quantify the amount of differentiation between the activations for each decomposition.
A high value for the maximum similarity indicates that there are at least some components that detect more or less the same hits in the mixture, which is undesirable.

Peakedness Measure
This metric quantifies to what extent a decomposition is impulse-like, by comparing the original activation curve with a processed version in which peaks are accentuated and small values are removed.We define the half wave rectification operation HWR(x)[t] as in which x is the smoothed version of x, see Equation (18).We furthermore define the compansion (compression-expansion) operation comp κ (x)[t] with exponent κ as If κ > 1, then comp κ (x) makes relatively small values even smaller compared to the maximum value of x, accentuating large values.If κ < 1, then comp κ (x) makes relatively small values larger.We then calculate the peak-accentuated version of H k as with κ = 3.This operation should be understood as follows.First, the inner compansion accentuates the highest peaks in H k , while making smaller peaks even smaller.The HWR operation then removes values that are smaller than the running mean around it, which further accentuates peaks and removes low-valued noise.The outer compansion then restores the peaks to their original relative scale, as long as they were not removed by the HWR operation.
The peakedness of an activation curve H k,t is defined as the ratio ∑ t H If the ratio of the sum of values of H (peaks) k,t and H k,t is close to 1, then H k,t changed very little by the impulse-accentuating operation, i.e., it was already quite impulse-like itself.If the ratio is lower, however, then around the peaks in H k,t there must be low values that are removed by the HWR operation when calculating H (peaks) k,t , meaning that the activations are less impulse-like.
We report the average of the peakedness values of all activation curves of H.

Implementation Details
Our NMFD implementation is loosely based on the NMFD implementation from López-Serrano et al. [23].All code for this paper is made available on a public online repository [24].

Results
This section presents the evaluation of our method.Sections 4.1 and 4.2 present the evaluation on the ENST dataset.Section 4.3 presents a case study to visually illustrate the effects of our method.
The purpose of our evaluation is twofold.The first objective is to compare the performance of the sigmoidal model and of the baselines in terms of the proposed evaluation metrics.This comparison is presented in Section 4.1 and Table 1.In this evaluation, the sigmoidal NMFD model is optimized with a simplified and straightforward optimization strategy, i.e., optimization strategy 0 with a constant learning rate η G .Comparing with this simplified algorithm helps us understand to what extent the observed improvements are caused by the proposed model itself, rather than by certain elements in the optimization strategy (also see Section 4.2).For completeness, Table 1 also reports the results for the sigmoidal model optimized with the best performing optimization strategy, i.e., strategy 2 with γ set to 0.1 during the explore-and-converge stage.
The second objective of the evaluation is to provide an in-depth analysis of the additional gains that can be achieved by using more advanced optimization strategies.This analysis is provided in Section 4.2.We furthermore perform an ablation study to quantify the impact of several techniques we use in the optimization of our model.From this evaluation, we conclude that more advanced optimization strategies and techniques help to achieve better local minima of the objective function L tot .

Evaluation on the ENST dataset
Table 1 shows the results of the evaluation on the ENST dataset.As discussed in Section 3.1, the baselines are evaluated once with and once without an "unconstrained warm-up stage".We found that performing a warm-up stage for optimizing the sparse baselines leads to virtually the same results as not using that technique, i.e., the outcome in terms of the metrics reported in Table 1 is exactly or almost exactly the same.For the sake of conciseness and not cluttering the Table, we therefore omit the results for the sparse baselines with a warm-up stage from Table 1.We conclude that the sparse baselines are not hindered in their convergence by applying regularization from the beginning of the optimization.The conclusions drawn in the following evaluation are therefore valid for all sparse baselines, regardless of whether or not an unconstrained warm-up stage has been applied.Table 1.Comparison of the performance of the NMFD baseline, the sparse NMFD baselines, and the proposed sigmoidal NMFD model on the evaluation metrics.For each metric, the mean value over all 135 phrases is shown (standard deviation between parentheses).The results for the peakedness metric for the strong sparsity baseline (*) are computed after discarding any all-zero activation curves in H.For each metric, the most optimal value is shown in bold in each column.Regarding the spectrogram reconstruction quality, the average MAE is low for most models, i.e., all spectrograms are approximated well.For the high sparsity baseline, the mean MAE is approximately twice as high as for the other models, suggesting that the L1 regularization in this baseline is too strong and leads to a worse spectrogram approximation.On average, the approximations by the medium sparsity baseline are comparable to those by the sigmoidal model in terms of MAE, and slightly worse than the unconstrained NMFD algorithm.This result is expected, as the unconstrained model optimizes only the reconstruction loss L KL , while the other models have to take additional constraints into account.

Algorithm
In terms of onset coverage, all algorithms perform similarly in terms of F-measure, with slight differences in precision and recall.The baseline NMFD model and the weak sparsity model give a better recall, while the sigmoidal model and the high sparsity baseline lead to an improved precision.The sigmoidal model and high sparsity baseline thus yield fewer false positives at the expense of missing more ground-truth hits (The precision is equal to the ratio of the number of peaks in the activations that "match" a ground-truth hit over the total number of detected peaks.Therefore, an improved precision means that a higher proportion of peaks in the activations correspond with a ground-truth hit.The recall is equal to the ratio of the number of ground-truth hits that were detected, i.e., that have a "match" in the activations, over the total number of ground-truth hits.For the sigmoidal model and high sparsity baseline, the recall is lower than for the other baselines, but the precision is higher: hence, on average, fewer ground-truth hits are detected, i.e., a lower recall, but the peaks that are detected in the activations are more likely to correspond with a ground-truth hit, i.e., a higher precision).Based on the low-threshold onset coverage metrics, all algorithms seem to perform approximately equally well in detecting the onsets in the mixture.
This conclusion changes when a high threshold is used for peak picking.Table 1 shows the F-measure when the peak-picking threshold is changed to θ thr = 0.5, i.e., in each activation curve, a peak is only considered if it is at least half as high as the largest value in the curve.In this case, the F-measure drops for all models; however, the decrease is much more severe for the baseline models, whereas the performance of the sigmoidal model remains relatively stable.The decrease in performance is most pronounced for the strong sparsity baseline.In other words, the activations discovered by sigmoidal NMFD are the least sensitive to the specific choice of peak picking threshold, which is an indirect indication that the activations are approximately equally high, i.e., they exhibit binary behavior.In the other baselines, there is more variation in peak height within each activation curve, and this increases with increasing L1 sparsity.For completeness, Figure 3 shows the evolution of the F-measure as a function of the peak picking threshold θ thr .This again shows that the performance decrease for increasing θ thr is more severe for the baselines with a stronger sparsity term, whereas the sigmoidal model maintains a much more stable performance for increasing θ thr .In terms of activation curve similarity, both the unconstrained NMFD baseline and the low sparsity baseline (λ = 0.01) have an average minimum and mean similarity that is considerably higher than of the other models.A non-zero minimum similarity is not necessarily undesired: percussive events of different instruments in the same recording are often correlated, so some similarity is to be expected.However, too much similarity might indicate the undesired result that the discovered components all represent parts of the same percussive onsets, leading to an entangled decomposition that is hard to interpret.Visual inspection of the decompositions (see Section 4.3) will indeed show that this is the case for the unconstrained and low sparsity baselines.
When the L1 sparsity is too high, we observe a substantial increase in mean and maximum activation similarity.This is because in this case many activation curves are effectively "disabled" by becoming monotonically zero, so that only a fraction of the allocated number of activation curves is effectively used to capture onsets.All the "disabled" activation curves of course show a high similarity between each other.
The best performing baseline is the medium sparsity baseline (λ = 0.1).This baseline has a slightly lower average minimum and mean activation curve similarity than the sigmoidal model.However, as mentioned in Section 3.5.3,a low value of the similarity metric is to be expected, and it is hard to compare values of the metric when both comparands are reasonably low; therefore, we do not draw any conclusions from this observation.Nevertheless, as with the other baselines, this baseline also shows a considerably higher average maximum activation similarity compared to the sigmoidal model, indicating that it creates decompositions that often contain at least some components that are highly correlated.
We conclude that in terms of activation curve similarity, the proposed sigmoidal model outperforms all baselines in terms of average maximum activation curve similarity, indicating that it on average makes better use of the allocated "capacity", i.e., the number of components K, to model distinct sound events in the mixture.This means that the decompositions from the sigmoidal model are more disentangled and therefore more likely to be interpretable.
Finally, the results for the peakedness metric show that the proposed approach indeed yields much more peaked activations than the non-regularized NMFD and sparse NMFD.A perhaps surprising result is that enforcing L1 sparsity does not lead to a notable increase in peakedness.
We conclude that the proposed sigmoidal model yields decompositions where the activations are considerably more impulse-like than the considered baselines, which is the desired outcome of the proposed approach.By design, the activation peaks are furthermore more uniform in height, which makes performing peak picking on the obtained activations less sensitive to the specific choice of the peak picking threshold.From the activation curve similarity, we conclude that the components are better disentangled, while the MAE and onset coverage metrics show that the algorithm maintains a good spectrogram reconstruction and onset detection quality.Both of these conclusions hold when the model is optimized with the simplified optimization strategy, as well as when a more advanced optimization strategy is used, suggesting that the improvements are not caused by the particular optimization strategy but rather by the model itself.However, applying a more advanced optimization strategy does help the model to achieve slightly better local minima of the loss function L tot , as will be shown in Section 4.2.
In Section 4.3, we show by example that these results can improve the interpretability of the decomposition.

Evaluation of the Optimization Strategies and Techniques
In this section, we evaluate the efficacy of the optimization strategies proposed in Section 2.3.5.The goal of this analysis is to evaluate to what extent more advanced optimization schemes lead to a better minimization of the loss L tot , compared to a more straightforward optimization strategy.More specifically, we consider the following settings: • strategy 0, i.e., straightforward optimization with γ = 1.0 and "static" µ k ; • strategy 1, i.e., a staged application of L G ; • strategy 2, i.e., setting µ k to a random and relatively small value for each update of G; • strategy 3, i.e., the combination of strategy 1 and strategy 2; • each of the above, but with γ = 0.1 during the explore-and-converge stage, in order to evaluate the effect of applying the regularization less strongly during the exploration stage (for strategies 1 and 3, γ remains 0 during the fine-tuning sub-stages).Note that the performance of these strategies will still be evaluated with the original formulation of L G , i.e., with γ = 1.0.We furthermore perform ablation experiments in order to assess the importance of • the component-wise normalization of the gradients of G and a when performing the updates; • the unconstrained warm-up stage, i.e., performing a few iterations of unconstrained optimization before L G is applied; and • the step-wise adaptation of the learning rate η G throughout the optimization procedure.
We evaluate each strategy by their ability to minimize the objective function L tot .To do so, we calculate the loss per timestep L tot /T for each decomposed spectrogram, and then report the average loss per timestep over all 135 examples in the ENST dataset.Dividing the loss of each decomposed spectrogram by the number of timeframes T of that spectrogram results in a quantification of the decomposition loss that is insensitive to the duration of the decomposed drum recording, so that we can appropriately average over all examples in the dataset, as the dataset contains recordings of varying lengths (a recording that is twice as long as another, but that is decomposed equally well, is expected to have a loss L tot that is twice as high as the decomposition of the shorter recording, as L tot scales linearly with the length of the decomposed mixture if a constant decomposition quality is assumed.).We also report on the different metrics defined in Section 3.5.
The results of this evaluation are shown in Table 2.We observe that both strategy 1 and 2 are effective by themselves, as both techniques lead to a lower average loss per timestep than the straightforward optimization of L tot , i.e., strategy 0 with γ = 1.0.Strategy 2 is most effective, as it obtains the lowest loss on average.It also leads to a lower standard deviation of the average loss, implying that it finds better local minima more consistently.Furthermore, combining strategy 1 and 2, i.e., strategy 3, results in a further decrease of the total loss only when γ = 1.0, but not for γ = 0.1.This means that alternatingly enabling and disabling L G does not necessarily offer an additional advantage compared to only moving around µ k throughout optimization.
Setting γ = 0.1 during the explore-and-converge stage does not lead to a consistent improvement.For strategy 0 and strategy 2, it seems to lead to slightly better results.For strategy 1, it does not seem to make a difference, i.e., using both γ = 1.0 and γ = 0.1 leads to the same average loss per timestep.For strategy 3, using γ = 0.1 even leads to a slight increase in average loss per timestep.The best results in terms of the average loss per timestep are obtained for strategy 2 with γ = 0.1 during the explore-and-converge stage, closely followed by strategy 3 with γ = 1.0.In terms of the metrics from Section 3.5, all variants seem to perform comparably well, and better than the baseline models, see Table 1.On average, the strategies with γ = 1.0 often lead to better separated components as indicated by the mean and maximum activation similarity, and also a slightly higher peakedness, which might be a desirable property of the decomposition.
We repeat the experiment for the best performing setting in terms of average loss per timestep, i.e., strategy 2 with γ = 0.1, but without the component-wise normalization of the gradients of G and a when performing the updates.This leads to the highest mean and standard deviation for the loss per timestep, indicating that normalizing the gradients component-wise indeed makes the algorithm's performance more consistent and reliable.
We furthermore perform an ablation study in order to assess the importance of the unconstrained warm-up stage at the beginning of the optimization.To do so, we repeat our experiments, but wherein L G is enforced during the first 30 iterations, i.e., γ = 0.1 or γ = 1.0 (depending on the particular experiment) instead of γ = 0.The results are reported in Table 2 for the best performing original setting, i.e., strategy 2 with γ = 0.1, as well as for the most simple optimization strategy, i.e., strategy 0; the results of the other strategies are omitted in Table 2 for conciseness, but are described in the following discussion.
For the strategies with γ = 1.0, not performing an unconstrained warm-up leads to considerably worse results.For strategies 0 and 2, there is a severe increase in the mean loss per timestep (from 0.26 to 5.90 for strategy 0, from 0.23 to 3.97 for strategy 2).For strategies 1 and 3, there is also a considerable increase in mean loss per timestep, but it is not as severe as for the other two strategies (from 0.24 to 0.73 for strategy 1, from 0.21 to 0.34 for strategy 3); periodically disabling the regularization term during the explore-and-converge stage, a technique that is used in both strategy 1 and strategy 3, seems to help to recover from the poor initial convergence due to applying L G too early in the optimization process.
Table 2. Optimization strategy evaluation results: comparison of the metrics evaluated on the outcome of each optimization strategy.For each metric, the mean value over all 135 phrases is shown (standard deviation between parentheses).For each metric, the most optimal value is shown in bold in each column.

Optimization Strategy
Loss For the strategies with γ = 0.1, the mean loss per timestep also increases, although not as drastically as with γ = 1.0.More specifically, in this case, not performing an initial convergence stage leads to a mean loss per timestep of 0.46, 0.35, 0.23, and 0.24 for strategies 0, 1, 2, and 3 respectively, compared to a mean loss per timestep of 0.24, 0.24, 0.20, and 0.22 originally.We suspect that setting γ relatively low at the beginning of the optimization and during the explore-and-converge stage allows the algorithm to still converge to a reasonable approximation of the spectrogram before L G is applied with γ = 1.0 in the finalization stage, which leads to better results compared to setting γ = 1.0 throughout the entire optimization process.
We conclude that an unconstrained warm-up stage is essential for a proper optimization of the sigmoidal model if the regularization strength is relatively large.If L G is applied less strongly during the earlier iterations of the optimization, then it still is beneficial to perform a warm-up stage, although the performance decrease when not doing so is not as severe, and with more advanced optimization techniques (e.g., strategy 2 or 3) the results become comparable with those for the algorithms with an initial convergence stage.Note that these observations contrast with the conclusion for the sparse baselines, which do not seem to benefit from using a similar unconstrained warm-up stage, as evaluated in Section 4.1.
Finally, we perform an ablation study in order to better understand the impact of finetuning the learning rate η G of the logit-activations throughout the optimization procedure.As discussed in Section 2.3.5, η G is set to 0.5 in the warm-up stage, then decreased to 0.2 for the explore-and-converge stage, and is finally set to 0.1 for the finalization stage.
In this ablation test, η G is set to 0.2 throughout the entire optimization procedure.This is done for strategy 0 with γ = 1.0 and for strategy 2 with γ = 0.1.The former experiment yields an evaluation of the sigmoidal algorithm optimized in a most straightforward way, i.e., without varying learning rates and with the most simple optimization strategy, i.e., strategy 0. Note that this is the simplified algorithm with which the baselines are compared in Section 4.1.The latter experiment shows the impact of keeping η G constant on the best performing model in terms of average loss per timestep.The results are reported in Table 2.
In short, we find that using a more fine-tuned optimization scheme for G is effective, as it leads to slightly lower mean loss per timestep (mean loss per timestep 0.28 without tuning vs. 0.26 with tuning for strategy 0 with γ = 1.0 and 0.21 vs. 0.20 for strategy 2 with γ = 0.1).We repeated this experiment with the learning rate η G set to the smaller constant value of 0.1, which performed consistently worse than setting η G to a constant value of 0.2 (average loss per timestep 0.26 for η G = 0.1 vs. 0.21 for η G = 0.2 for strategy 2 with γ = 0.1; average loss per timestep 3.41 for η G = 0.1 vs. 0.28 for η G = 0.2 for strategy 0 with γ = 1.0).This shows that, when η G is kept constant, it is furthermore important to choose an appropriate value for η G to ensure a proper convergence of the optimization process.

Example Decomposition
Figures 2 and 4-6 show the decomposition of an example drum loop using, respectively, unconstrained NMFD, sparse NMFD with λ = 0.1, sparse NMFD with λ = 1.0, and sigmoidal NMFD (more examples of decompositions are provided as Supplementary Material to this paper).We do not show the decomposition using the weak sparsity baseline, λ = 0.01, as the results are almost identical to those by the unconstrained model.Note that all decompositions reconstruct the spectrogram approximately equally well, except the reconstruction with high sparsity (Figure 5).
As mentioned in the introduction (Section 1.3), the activations discovered by unregularized NMFD (Figure 2) have two undesirable properties.The first is that the activations are rather "smeared out", with a sharp initial onset followed by a slowly decaying amplitude.Some small activations are even not preceded by a sharp initial onset, making it hard to determine whether they correspond to a detected drum hit or not.This does not correspond with the expected impulse-like nature of activations of percussive instruments.
The second problem is that the activation curves are highly correlated, so that many drum hits are modeled by a mixture of all of the components.This makes it difficult to interpret the resulting decomposition.When considering the decompositions by sparse NMFD (Figures 4 and 5), it becomes clear that imposing L1 sparsity does not lead to significantly more impulse-like activations.Even more troublesome is that imposing more sparsity by increasing λ actually hinders a good decomposition: Figure 5 illustrates that a high λ causes all but one of the components to become inactive, i.e., their activations are monotonically zero, in order to minimize the (extreme) sparsity constraint as much as possible.This effect is unfortunately also sometimes observed even for reasonable values of λ (Figure 4), so that the effective capacity of the NMFD model is reduced in order to minimize the sparsity constraint.
The aforementioned problems are solved by using the sigmoidal NMFD model (Figure 6).The activations are highly peaked, and each component models distinct parts of the spectrogram.The allocated capacity, i.e., the number of components K, is used effectively, and it is now very clear where specific sounds are repeated in the mixture.It is worth noting that the proposed regularization term L G only provides a direct bias towards binary "on-off" behavior, and that the impulse-like behavior emerges spontaneously when this bias is applied to the decomposition of a percussive mixture.In this example, the components even lend themselves to a musically meaningful interpretation: the first component captures the low end of the kick drum, the second captures the mid-and high-end of the kick drum and snare drum, the third component captures hi-hat hits, and the fourth component models the snare drum hits.Note that the second component thus contributes to both the kick drum and the snare drum; unfortunately, some entanglement between components is always possible in an unsupervised decomposition.Nevertheless, the decomposition by the sigmoidal model yields much more interpretable results, with activation curves that show to the expected impulse-like behavior.

Conclusions
In this paper, we have approached NMFD as an unsupervised decomposition algorithm for percussive music mixtures.Such an unsupervised decomposition is valuable in application scenarios where the exact instruments in the mixture are unknown, or to bootstrap semi-supervised learning approaches such as the one in Wu and Lerch [8].We investigated an adapted NMFD model where the activations are biased to be binary in nature, by defining them as the output of a sigmoidal function and by applying a regularization term to push their values to saturation.We observe that this results in activations that are highly impulse-like, which correspond to the expected properties of percussive activations, and we have shown that the proposed approach is more effective at obtaining such impulse-like behavior as compared to a sparse NMFD baseline using an L1 sparsity constraint.By means of a case study, we illustrated the potential of our approach to yield more interpretable decompositions.
Regarding future work, we remark that our method, like the original NMFD algorithm, is unsupervised, so that the optimization procedure is free to adapt the templates W (k) without considering their musical validity.Even in an informed setting, where each W (k) is initialized with a template of the desired instrument, there is no guarantee that it will converge to a solution where the components map to individual instruments.This issue could be addressed by adding some kind of supervision to the NMFD framework; this could be a supervised learning algorithm that imposes certain musical constraints learned from data, or an interface where a user can guide the decomposition interactively.A related direction for further research would be to investigate other and more informed initialization strategies for the templates W (k) , and to research how the initialization of the templates impacts the outcome of the optimization process.A second limitation is that this work assumes that the number of components K is known in advance.A next step could therefore be to reliably estimate this number of components prior to decomposition, or to use an iterative decomposition strategy, where K is increased progressively until the full mixture has been decomposed.Finally, we propose that the idea of combining a regularization term that encourages diverging activation values with saturating activations could be incorporated in other models and use cases where binary activations are desired, for example, in the context of music transcription beyond percussive recordings, or even for sound event detection in general.
Note that the left-hand term in Equation (A6) is always negative when G k,t > µ k and always positive when G k,t < µ k .In the gradient-descent updates, Equation ( 9), this will always cause G k,t to grow away from µ k , i.e., it pushes G k,t towards saturation.
However, when G k,t = min t (G k,t ) or G k,t = max t (G k,t ), the right-hand term in Equation (A6) can have the opposite sign of the left-hand term, potentially canceling it out or even updating G k,t in the other direction, i.e., away from saturation.L G is then minimized not by pushing the activations towards saturation, i.e., moving G k,t away from µ k , but instead by moving µ k away from G k,t .This might lead to unstable updates where the ultimate values of G k,t are hindered from growing to saturation, an undesired effect which we wish to prevent.Therefore, we regard µ k as a constant when applying the updates, i.e., we ignore the right-hand term in Equation (A6).This gives the expression from Equation (11) for the derivative of L G with respect to G k,t : The expression is exact for all G k,t , except when G k,t = max t (G k,t ) or G k,t = min t (G k,t ).
Appendix A.2. Multiplicative Update for W The derivation of the multiplicative updates for W n,τ is analogous to the derivation in Schmidt and Mørup [11] and Lee and Seung [25].Consider the gradient-descent updates for W (k) n,τ that minimize L tot with learning rate η: as in Lee and Seung [25] and Schmidt and Mørup [11] so that the first term in Equation (A10) cancels out.This gives The derivative of L tot (X,

Figure 1 .
Figure 1.Conceptual illustration of the drum mixture decomposition problem: a percussive mixture (a) is decomposed into prototypical samples of the used instruments (b) and the corresponding onsets (c).

Figure 2 .
Figure 2. Decomposition of a percussive recording using non-negative matrix factor deconvolution (NMFD).The activations are not impulse-like, and contain noisy regions where it is difficult to detect individual drum hits.

Figure 3 .
Figure 3. Onset coverage F-measure as a function of the peak picking threshold θ thr .

Figure 4 .
Figure 4. Decomposition of a drum loop using sparse NMFD, λ = 0.1.Although slightly more peaked than the activations for unconstrained NMFD (Figure 2), the activations do not show impulselike behavior, and still contain noisy regions where it is difficult to detect individual drum hit onsets.The third component has become "inactive" in order to minimize the sparsity constraint.

Figure 5 .
Figure 5. Decomposition of a drum loop using sparse NMFD, λ = 1.0.The decomposition fails because the sparsity constraint is too strong, so that only one component remains active.

Figure 6 .
Figure 6.Decomposition of a drum loop using sigmoidal NMFD.The activations show impulse-like behavior, and each component captures different parts of the spectrogram, leading to an interpretable decomposition.The dashed line indicates the amplitude a k for each component.
X, G) with respect to W