Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF^{ †}

## Abstract

## 1. Introduction

- WNMF is applied for audio source separation instead of NMF, which introduces a control on different frequencies and time frames of the input mixture signal. Such a control can help to better emphasize certain important components for distinguishing the target sound events from noise, such as the critical subbands of target sounds, and thus improve the separation quality.
- Noise estimation results from the noise dictionary learning step are exploited in developing both the frequency weights and temporal weights. This produces noise-adapted weights so as to fit the WNMF decomposition to time-varying background noise.

## 2. NMF and Weighted NMF

#### 2.1. NMF

**V**by the product of a dictionary matrix $W\in {\mathbb{R}}_{+}^{F\times R}$, and an activation matrix $H\in {\mathbb{R}}_{+}^{R\times T}$, that is, $V\approx WH$. Supposing that

**V**represents the magnitude spectrogram of an audio signal with F frequency bins and T time frames, the columns of

**W**can be considered as a set of R spectral bases, and the corresponding time-varying gains are stored in the columns of

**H**.

**W**and

**H**, an optimization problem is formulated by minimizing the reconstruction error between the input matrix and its approximation under the non-negativity constraint, that is,

**1**is an F × T. matrix with all elements equal to 1, and the superscript T means the transposition of a matrix. Once matrices

**W**and

**H**are initialized with random non-negative values, the multiplicative update rules can preserve their non-negativity during iteration.

**V**, we have $V\approx {V}_{s}+{V}_{n}$. In this study, we used the subscript s to indicate the target event class, and n for the noise. Supposing that prior information of both sound classes is available as in a supervised case, an event dictionary and a noise dictionary can be trained in advance via standard NMF, denoted by ${W}_{s}\in {\mathbb{R}}_{+}^{F\times {R}_{s}}$ and ${W}_{n}\in {\mathbb{R}}_{+}^{F\times {R}_{n}}$, where ${R}_{c}$ is the number of bases for each sound source c = s or n. The NMF decomposition for the source separation takes the following form [8]:

#### 2.2. Weighted NMF

**G**is a matrix with all of the elements equal to 1, Equation (8) is identical to the standard NMF. WNMF can be utilized to emphasize the relative importance of the different components in

**V**.

## 3. Proposed Method

#### 3.1. Noise Dictionary Learning by Robust NMF

**V**, robust NMF decomposes it into the following form:

**S**, which is measured by its L

_{1}-norm, and the parameter λ controls the weight of sparsity in the cost function. To estimate the matrices, multiplicative update rules are derived, as follows:

**S**represents the foreground events of the input and may possibly include other salient undesirable sound events in the background, and thus is not suitable for event detection. The procedure of noise dictionary learning by robust NMF is outlined in Algorithm 1.

Algorithm 1. Noise dictionary learning by robust NMF | |

Input: spectrogram of an input signal V, the number of noise bases ${R}_{n}$, sparsity parameter $\lambda $ | |

Output: estimated noise dictionary ${W}_{n}$ and spectrogram ${L}_{n}$ | |

1: | Initialize ${W}_{n}$, ${H}_{n}$, and S with random non-negative values |

2: | repeat |

3: | update ${W}_{n}$, ${H}_{n}$, and S using Equations (14)–(16) |

4: | until convergence |

5: | Compute ${L}_{n}={W}_{n}{H}_{n}$ |

#### 3.2. Source Separation by Supervised and Weighted NMF

#### 3.2.1. Frequency Weighting Based on Subband Importance

#### 3.2.2. Temporal Weighting Based on Event Presence Probability

#### 3.2.3. Combined Time-Frequency Weighting

Algorithm 2. Source separation by supervised and weighted NMF | |

Input: spectrogram of an input noisy signal V,training spectrogram for the target event class ${V}_{s}^{train}$ and the event dictionary ${W}_{s}$, estimated noise dictionary ${W}_{n}$ and spectrogram ${L}_{n}$, parameters ${T}_{0}$, ${r}_{min}$, ${r}_{max}$, and theTypeOfWeighting | |

Output: activations ${H}_{s}$ and ${H}_{n}$ | |

1: | switch theTypeOfWeighting do |

2: | case frequency_weighting |

3: | calculate frequency weights using Equations (17)–(19), and set $G(f,t)={g}_{freq}(f,t)$ |

4: | case temporal_weighting |

5: | calculate temporal weights using Equations (17)–(22) , and set $G(f,t)={g}_{temp}(t)$ |

6: | case time_frequency_weighting |

7: | calculate time-frequency weights using Equations (17)–(23) , and set $G(f,t)={g}_{freq+temp}(f,t)$ |

8: | otherwise |

9: | $G(f,t)=1,\text{}\forall f,t$ |

10: | endsw |

11: | Initialize ${H}_{s}$ and ${H}_{n}$ with random non-negative values |

12: | repeat |

13: | update ${H}_{s}$ and ${H}_{n}$ using Equation (10) |

14: | until convergence |

#### 3.3. Event Detection

## 4. Experimental Results

#### 4.1. Dataset and Metric

- TP: a detected event whose temporal duration overlaps with that of an event in the reference, under the condition that the output onset is within the range of 500 ms of the actual onset;
- FP: a detected event that has no correspondence to any events in the reference under the onset condition;
- FN: an event in the reference that has no correspondence to any events in the system output under the onset condition.

#### 4.2. Parameter Selection

_{s}= 32 and R

_{n}= 32 were good choices that would guarantee an excellent performance and also a satisfactory computational load.

#### 4.3. Detection Results and Comparative Analysis

## 5. Conclusions

**Figure 1.**Framework of the proposed sound event detection method based on non-negative matrix factorization (NMF) [14].

**Figure 2.**A practical example of calculating subband weights. (

**a**) Spectrogram of a baby cry event and its spectral template; (

**b**) an example of the estimated noise spectrogram and the noise template for a specific frame (the pictured template is calculated within the frames from 6 s to 10 s, as marked by the dashed box); (

**c**) subband weights for that frame; (

**d**) subband weight matrix for all frames.

**Figure 3.**A practical example of calculating temporal weights as well as time-frequency weights. (

**a**) Spectrogram and the energy curve of the input noisy signal (frames where the baby cry event is active are marked with *); (

**b**) spectrogram and the energy curve of the filtered signal; (

**c**) energy increase curve after filtering and the corresponding temporal weights; (

**d**) time-frequency weights that combine temporal weights and the subband weights in Figure 2d.

**Figure 5.**F-score results for three event classes under different values of the sparsity parameter λ. The results are obtained on the development dataset by the supervised NMF method without weighting [14].

**Figure 6.**Detection results of the proposed weighted methods compared to two baseline approaches. The test noisy signal is shown in Figure 3a. (

**a**) Results of the semi-supervised NMF approach; (

**b**) results of the supervised NMF approach with noise dictionary learning, but without weighting. Results of the proposed supervised and weighted NMF approach with (

**c**) frequency weighting, (

**d**) temporal weighting, and (

**e**) time-frequency weighting.

**Figure 7.**Performance comparison of the proposed method with some other methods submitted to Task 2 of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE 2017) challenge.

**Table 1.**Error rate (ER) and F-score (F) results of the proposed method for three event classes on the evaluation dataset.

Method | Baby Cry | Glass Break | Gunshot | Average | |||||
---|---|---|---|---|---|---|---|---|---|

ER | F (%) | ER | F (%) | ER | F (%) | ER | F (%) | ||

Proposed supervised NMF + | combined weighting | 0.10 | 94.8 | 0.06 | 96.9 | 0.46 | 76.2 | 0.21 | 89.3 |

frequency weighting | 0.11 | 94.0 | 0.13 | 93.7 | 0.51 | 74.0 | 0.25 | 87.2 | |

temporal weighting | 0.14 | 92.4 | 0.12 | 94.3 | 0.52 | 73.3 | 0.26 | 86.7 | |

no weighting [14] | 0.17 | 91.4 | 0.22 | 89.1 | 0.55 | 72.0 | 0.31 | 84.2 | |

Semi-supervised NMF | 0.29 | 84.9 | 0.36 | 81.3 | 0.65 | 60.7 | 0.43 | 75.6 | |

Subband filtering [26] | 0.62 | 66.4 | 0.25 | 86.7 | 0.54 | 67.5 | 0.47 | 73.5 |

