Data-Dependent Feature Extraction Method Based on Non-Negative Matrix Factorization for Weakly Supervised Domestic Sound Event Detection

: In this paper, feature extraction methods are developed based on the non-negative matrix factorization (NMF) algorithm to be applied in weakly supervised sound event detection. Recently, the development of various features and systems have been attempted to tackle the problems of acoustic scene classiﬁcation and sound event detection. However, most of these systems use data-independent spectral features, e.g., Mel-spectrogram, log-Mel-spectrum, and gammatone ﬁlterbank. Some data-dependent feature extraction methods, including the NMF-based methods, recently demonstrated the potential to tackle the problems mentioned above for long-term acoustic signals. In this paper, we further develop the recently proposed NMF-based feature extraction method to enable its application in weakly supervised sound event detection. To achieve this goal, we develop a strategy for training the frequency basis matrix using a heterogeneous database consisting of strongly-and weakly-labeled data. Moreover, we develop a non-iterative version of the NMF-based feature extraction method so that the proposed feature extraction method can be applied as a part of the model structure similar to the modern “on-the-ﬂy” transform method for the Mel-spectrogram. To detect the sound events, the temporal basis is calculated using the NMF method and then used as a feature for the mean-teacher-model-based classiﬁer. The results are improved for the event-wise post-processing method. To evaluate the proposed system, simulations of the weakly supervised sound event detection were conducted using the Detection and Classiﬁcation of Acoustic Scenes and Events 2020 Task 4 database. The results reveal that the proposed system has F1-score performance comparable with the Mel-spectrogram and gammatonegram and exhibits 3–5% better performance than the log-Mel-spectrum and constant-Q transform.


Introduction
More and more studies have been targeting machine learning and artificial intelligence recently. Machine recognition of environments and sound events using acoustic signals have been of particular interest to researchers [1][2][3][4]. There are two main tasks related to the automatic recognition of acoustic signals: acoustic scene classification (ASC) and sound event detection (SED). These tasks are often not clearly distinguished. ASC mainly focuses on the recognition of long clips, e.g., a 10-s clip, to classify the whole acoustic environment [5], whereas SED tends to analyze short sound events, e.g., dog barking or alarm ringing, to determine their types and obtain onset/offset information [6].
The extraction of proper features from the acoustic signals is the first and important step in SED using machine learning algorithms. The most common acoustic features for ASC and SED are Mel-frequency cepstral coefficients (MFCC) [1,7,8] and Mel-frequency spectrum [9][10][11]. These are the variations of frequency-domain features used for a characterization of the human hearing system [12]. Mel-frequency-based features have been successfully used in speech signal processing systems employing machine learning techniques. The features exhibit better performance in ASC and SED compared with other existing ones. However, their performance is limited by the less-structured acoustic environmental signals compared with speech signals [5]. Therefore, several alternative features, including a computer-vision-inspired feature [13] and statistical characteristics [14], have been investigated. Recently, several features inspired by the characteristics of the human hearing system, e.g., Mel-frequency discrete wavelet coefficients [15][16][17], gammatonegram [18], and gammatone-frequency cepstral coefficients [19], have been investigated. The vast majority of the developed features, including the MFCC and psycho-acoustic-based features, are data-independent feature extraction methods because the feature extraction processes are consistent regardless of the data characteristics in the given problems.
Recently, several data-dependent feature extraction methods, including principal component analysis (PCA) [20] and non-negative matrix factorization (NMF) [21], have also been developed. The NMF method has been widely employed to analyze and extract the signal characteristics in the recent acoustic signal processing fields, including music information retrieval [22][23][24] and speech signal processing [25][26][27].
As the NMF method can extract the common property of the given signals, datadependent feature extraction methods based on NMF have been developed [5,28,29]. The method developed in [5] is a supervised task-driven dictionary learning (TDL) with a limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [30]. However, feature extraction using the TDL method is strongly related to the classification step. Thus, it is difficult to apply the TDL feature extraction method to other techniques, such as the recently suggested convolutional neural network-based classifiers. Unsupervised NMF-based feature extraction methods were also developed [28,29]. The method in [29] capitalized on the convolutional NMF and K-means clustering to deal with the uncategorized dataset. The disadvantages of the unsupervised NMF-based feature extraction method are that the data matrix to be handled is too large and the computational cost is quite high. To overcome these disadvantages, Lee and Pang developed a supervised feature extraction method for the monitoring domestic activity problem [31]. The algorithm in [31] exhibited a performance comparable to the state-of-the-art features. However, this study was limited to monitoring domestic activities, which included the activity class recognition problem (without onset/offset information) only.
In this paper, we develop and analyze a data-dependent feature extraction method for weakly supervised domestic SED. To achieve this goal, we start with the NMF-based feature extraction method [31] for the monitoring domestic activity problem. Unfortunately, this method [31] cannot be directly applied to our problem, because the configuration of the training data is different. Therefore, we develop and analyze several strategies to utilize the data for feature extraction. In addition, we consider the recent trend of making the feature extraction step a part of the neural network or developing "on-the-fly" feature extraction systems [32,33] in the acoustic signal processing. The conventional NMF-based feature extraction methods [5,31] cannot be applied to "on-the-fly" systems because they require dozens or hundreds of iterations. To overcome this problem, we develop a matrix multiplication-based feature extraction method without any iteration.

Problem Description
As mentioned in the Introduction, we aim to develop a data-dependent feature extraction method for weakly supervised domestic SED system. The target system and problem are designed in accordance with the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Task 4 [34]. The target system aims to detect sound events with class, onset time, and offset time labels through a weakly-supervised training. In this training, some of the dataset annotations are omitted. More specifically, the dataset may be categorized into three types: strongly-labeled, weakly-labeled, and unlabeled data. The strongly-labeled data provide all the required information: sound class, onset time, and offset time annotations. The weakly-labeled data provide only part of the information, for example the sound class label only. The unlabeled data do not provide any information, only waveforms. A schematic diagram of the target system is presented in Figure 1.
Our goal is not to design the system itself but to develop a data-dependent feature extraction method for the system. Thus, we focus on the non-negative matrix decomposition technique, which is a sparse representation tool for acoustic signal data.

Non-Negative Matrix Factorization
The NMF method, which was developed by Lee and Seung [21,35], is employed to decompose a non-negative matrix V ∈ R + K×N into two matrices, W ∈ R + K×R and H ∈ R + R×N , that satisfy the following relation [35]: The matrices W and H can be estimated via alternating optimization of the two equations as [35,36] where D(V|WH) denotes the distance function between V and WH. The distance functions can be optimized by the multiplicative update rule as [35] when the Kullback-Leibler divergence (KLD) is used as the distance function, where 1 K×N denotes a K × N matrix of ones, and when the Euclidean distance is used, where ⊗ and the fraction denote the Hadamard (element-wise) product and division, respectively. The NMF method has been widely employed in the recent acoustic signal processing studies to analyze a music signal into several musical notes or to reduce noise signals from the speech signals. Such interest can be attributed to the fact that the frequency basis matrix, W, and the temporal basis matrix, H, represent the frequency characteristics and temporal envelop of acoustic signal components, respectively. More specifically, speech denoizing methods divide the bases into two classes, speech and noise, and remove the noise bases after calculating the temporal bases of each class, as shown in Figure 2.

Strategy for the Frequency Basis Learning
Similar to the previous studies [5,31], the temporal basis matrix H is used as a feature. To extract the feature matrix, the frequency basis matrix needs to be estimated in advance. Inspired by speech denoizing [27] and reverberation suppression [37] techniques based on NMF [27], the frequency basis matrix of each class is independently estimated from the spectrogram set of the class, as presented in Figure 3a. In this method, the frequency basis matrix is divided into C groups as [31] Each group denoted by W c is a K × R C matrix, where R C denotes the basis number of each class. The frequency and temporal basis matrices can be estimated by optimizing various cost functions. KLD is one of the common choices in the acoustic signal processing field. Therefore, the frequency and temporal basis matrices are iteratively updated by which is similar to Equations (4) and (5), where V c and H c denote K × N c data matrix and R C × N c temporal basis matrix, respectively, and N c stands for the total number of frames in the data matrix of the cth class.
Since weakly-supervised SED has a heterogeneous database, different learning strategies for each type of data need to be developed. In the case of strongly-labeled data, which contains the class and temporal information, the data matrix is a set of audio clips with a cut-off part, as presented in Figure 3b. As can be seen in the figure, there is a risk of data mixing from different classes. For example, if we assume that we want to generate the data matrix V for Class A, there may be various combinations, e.g., A-B, A-C, and A-D. However, even in this case, the data from Class A is expected to be dominant. Thus, the frequency matrix W can be expected to represent the characteristics of Class A if we choose a sufficiently small rank for matrix W.  For the weakly-labeled data that do not contain an onset and offset information, the temporal information cannot be utilized. Therefore, we compose the data matrix V by just cascading the clips, as shown in Figure 3c. Unlike the case of the strongly-labeled data, where only parts of the event overlap, in the weakly-labeled data, waveforms from different classes are completely mixed when two or more classes are in one clip. Therefore, it is expected that the training of matrix W is more affected by the unwanted class interference problem compared to strongly-labeled data. Therefore, we evaluate the effect of the unwanted class interference problem on the weakly-labeled data.

Iterative and Non-Iterative Feature Extraction Methods
Once the frequency basis matrix W is trained, the feature matrix H clip can be extracted from the data matrix V clip of an audio clip via the iteration of [31] with minimization of the KLD (as Equation (5)).
In this paper, we also develop a non-iterative feature extraction method. To achieve this goal, we develop a closed-form solution for H clip which makes the gradient of the divergence function with regard to H clip zero (that is, ∇ H clip D V clip |WH clip = 0). In addition, a solution with simple operations, e.g. matrix multiplication, is a preferred one because it is easy to implement.
The NMF-based methods for acoustic signal processing mainly use three types of distance functions: KLD, Euclidean distance, and Itakura-Saito divergence (ISD). The three distance functions are in the β-divergence family, which is expressed by [38]: where x and y are arbitrary values and β is a constant to adjust the cost function among the Euclidean (β = 2), KLD (β = 1), and ISD (β = 0). In our problem, x and y are elements of the matrices V clip and WH clip , respectively, and the gradient of the beta-divergence with regard to H clip is [38] where A .
[n] denotes element-wise nth power of matrix A. Because Equation (13) includes matrix multiplications, element-wise products and power, and the relationship between these operations is also complex. Therefore, it is not easy to find a solution that makes Equation (13) zero, and it is even more difficult to find a solution consisting of simple matrix multiplications. One easy way is to make the term of the left of the Hadamard product, (WH) .[β−2] , equal to one. In this case, β becomes 2.0, and the divergence function becomes the Euclidean distance. The Euclidean distance function can be defined using the Frobenius norm as and the gradient of the cost function with respect to H clip is Since the cost function in Equation (14) is convex, the optimal solution of the cost function makes the gradient zero. Therefore, the optimal solution of H clip needs to satisfy the relationship: and therefore where W † denotes the Moore-Penrose pseudo-inverse of the matrix W. Since the frequency basis matrix W is pre-trained and fixed, W † is also defined a priori. Thus, feature extraction can be performed via a simple production using Equation (17). The pseudo-inverse can be implemented via the singular value decomposition [39]. If we assume that the matrix W is decomposed as where U and V denote K × K and R × R matrices, respectively, and Σ denotes the K × R diagonal matrix whose elements are singular values of W, then the pseudo-inverse W † can be calculated as where Σ † is a matrix formed from Σ by taking the element-wise inverse of the non-zero elements and shaping the matrix as R × K. In practice, the small singular values can be ignored as where [Σ] k,r denotes the (k, r)th element of the matrix Σ and δ is an arbitrary threshold.

Classifier
To develop a feature extraction method and evaluate the performance of our method, a conventional convolutional recurrent neural network (CRNN) with a mean-teacher model is adopted [40][41][42]. Figure 4 presents the details of the network structure of the CRNN classifier. The convolutional layers and the corresponding max-pooling layers are designed so that the n f eature axis of the output of the CNN layers is equal to unity. The RNN layers and the fully connected layer make the frame-wise probability of the classes, which is denoted by "Time Stamps" in Figure 4. To train the classifier using the weakly-labeled data, then it also generates the "Clip Class" output, which denotes the existing classes in a clip. The element of the clip class output C is calculated by weighted average as where N out denotes the frame number of the classifier output of a sound clip.
The mean-teacher model is adopted to train the classifier using both the labeled and unlabeled data, similar to state-of-the-art weakly-supervised SED systems [40,41]. The network parameters of the student model, θ, are optimized to minimize the cost function as where J class (θ) and J consist (θ) denote the classification cost and the consistency cost, respectively, as presented in Figure 5. β is the weight for the consistency cost. The parameters of the teacher model, θ teacher , are fixed during the training procedure and updated at the end of the training batch as where α denotes the exponential weighting factor of the moving average.

Post-Processing
The output of the sound event detector is a frame-wise probability of each class. Generally, the classifier output is post-processed with a certain threshold to determine whether the class is activated or not. The magnitude of the threshold value influences the classification performance. For example, a high threshold value can reduce false-positive errors but sometimes underestimates the length of the event, and vice versa. In this paper, the classifier output is first processed with a frame-wise threshold and then with an eventwise threshold to eliminate false-positive errors. The threshold value of the event-wise post-processing is set to be larger than that of the frame-wise post-processing.
When we refer to the classifier output as a matrix P of N out × C, the frame-wise binary matrix B f rame is calculated as where C denotes the number of classes and τ f rame is the threshold for the frame-wise post-processing. Let us define χ(i, c) as the ith event set of class c as where n onset (i, c) and n o f f set (i, c) are the frame numbers of the ith onset and offset of class c, respectively. The result of the event-wise thresholding, B event , is obtained as where (n , c) ∈ χ means that (n , c) belongs to χ , and χ denotes a certain event set that contains (n, c) as an element. Figure 6 presents a flow chart and an illustrative example of the proposed event-wise post-processing.

Evaluation Settings
To evaluate the proposed system for domestic SED, numerical simulations were performed using the DESED dataset of the DCASE 2020 Task 4 database, which is an audio dataset for weakly supervised sound event detection in domestic environments. The database consists of real-recorded subsets of Audioset [43] and Freesound [44]. Some of the data are synthesized using the SINS database [45] as background sounds.
There are 10 event classes: speech, dog, cat, alarm bell, dishes, frying, blender, running water, vacuum cleaner, and electric shaver. Each audio clip is a 10-s wav file with 44.1 (for the weakly-labeled and unlabeled data) or 16 kHz (for strongly-labeled data) sampling frequency. All audio files were resampled to a 16 kHz sampling frequency. The database consists of strongly-labeled data (2045 files, with event classes and time stamp information), weakly-labeled data (1578 files, with event classes information only), and unlabeled data (14412 files, without any annotation). The detailed information of the database configuration can be found in [34].
The audio clips were short-time-Fourier-transformed by a 1024-sample Hanning window with 75% overlap to match the number of input frames in the classifier for the baseline of the DCASE 2020 Task 4 [40]. To train the NMF frequency basis matrix, 200 iterations were performed to calculate both W c and H c matrices (Equations (9) and (10)) for each class. The number of the basis for each class was set to 13 (11 for the "vacuum cleaner" class) so that the dimension of the extracted feature was 128. In the case of iterative feature extraction, 50 iterations were performed for each clip to calculate H clip (Equation (11)). The singular value threshold (δ in Equation (20)) was set to be a proportional to the maximum singular value as where γ denotes a proportionality constant. The classifier was trained using Adam optimizer, and the learning rate was exponentially ramped up during 50 epochs to the maximum learning rate of 0.001. The classification and the consistency costs were categorical cross-entropy and mean-squared error, respectively, and the weight for the consistency costs, β, was set to 2.0. The classifier was trained using the training dataset of the DCASE database for 200 epochs and evaluated using the validation dataset. The batch size was set to 24, and the moving-average weighting factor, α, of the mean-teacher model was 0.999.
The performances of the two types of post-processing systems, with and without the event-wise post-processing, were evaluated. The post-processing system without the eventwise post-processing consisted of only the frame-wise thresholding, and the threshold value was set to 0.5. The system with the event-wise post-processing consisted of the frame-wise and event-wise thresholding. The event threshold value was set to 0.8 as it exhibited good overall performance, while the frame thresholds were set to the optimal values that exhibited the best performance among {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7} for each class.
In this evaluation, the event-based F1-score was utilized as the performance measure, which is defined by a geometric mean of precision P and recall R as where precision and recall are defined as and R = n TP n TP + n FN , where n FP , n FN , and n TP denote the numbers of false answers (false positives), missing answers (false negatives), and correct answers (true positives). The answers were considered as "correct answer" if the onset error was smaller than 0.2 s, and the offset error was smaller than 0.2 s (or 20 % of the event duration). The criteria to determine the correct answers were set to be consistent with the baseline of the DCASE 2020 Task 4 [34]. Moreover, the performance of each class was micro-averaged, which aggregated the contributions of all the test samples regardless of the class, and macro-averaged, which averages the individual performance of each class. Table 1 compares the proposed and conventional features that are commonly used in ASC and SED applications. The abbreviations MelSpec, Log-Mel, GAM, and CQT denote the Mel-spectrogram, log-Mel spectrum, gammatonegram, and constant-Q transform (CQT), respectively. The frequency basis matrix used to extract the NMF features in Table 1 was trained using strongly-labeled data only as it exhibited good overall performance. The effect of the data on the frequency basis training will be considered in the next subsection. The dimensions of the extracted features were all set to 128, similar to that in the Delphin-Poulet system [40], to compare the performances of the features without changing the structure of the classifier. The CQT was performed under 16 bins per octave and 32.7 Hz lower bound frequency. In total, 128 CQT bins (=8 octaves) corresponded to a frequency range of less than approximately 8 kHz. The proposed system was also compared with the Cornell system [46] including the log-Mel features with data augmentation, per-channel energy normalization (PCEN), mean-teacher CRNN, and post-processing with hidden Markov model (HMM), which was submitted to DCASE 2020 Task 4. The details of the subsystems can be found in [46]. The training parameters (e.g., the weight for the consistency costs, number of epochs, batch size, and the moving-average weighting factor of the meanteacher model) were set to the same values of the proposed system. Because the Cornell system has its own post-processing consisting of HMM and random forest optimization, the proposed event-wise post-processing was not applied to the Cornell system. The performances of the NMF methods are comparable to those of the Mel-spectrogram and gammatonegram, the best among the conventional features, and are better than those of the log-Mel spectrum and CQT in all the performance measures. The comparison between the micro-averaged performance of the NMF and that of the Mel-spectrogram revealed that the NMF methods exhibited slightly better performances without the eventwise post-processing and slightly worse performances with the event-wise post-processing. The gammatonegram had the best performance in the macro-averaged F1-score among the features but demonstrated low performance in the micro-averaged F1-score. Moreover, the features extracted using the iterative and non-iterative NMF methods exhibited similar performances. Thus, the non-iterative NMF method can be considered as an alternative to the iterative NMF method. The micro-averaged and macro-averaged F1-scores of the proposed system are similar to and slightly less than that of the Cornell system, respectively. We think that the difference of the performance was caused by additional sub-systems in the Cornell system, such as PCEN, data augmentation, and random forest optimization of HMM post-processing. Table 2 presents the class-wise F1-score performances. The gray background denotes the class where the NMF method exhibits better performance than that in the Melspectrogram, and the bold-face number stands for the best performance in each class. A scan be seen in the table, the NMF methods are advantageous for classes with harmonic structures (speech, dog, cat, and alarm bell) and disadvantageous for noise-like classes (electric shaver, blender, running water, and vacuum cleaner). The class-wise performances of the iterative and non-iterative NMFs are different. The iterative NMF has better performances for five classes (electric shaver, speech, frying, blender, and vacuum cleaner), similar performances for two classes (cat and dog), and worse performances for three classes (dishes, running water, and alarm bell). The last three classes (dishes, running water, and alarm bell) contained several impulsive sounds. Thus, the non-iterative method appeared to be useful for the impulsive sounds. However, we cannot easily come to this conclusion, as the frying sounds also contain impulsive ones. There are two main differences between the iterative and non-iterative NMFs: divergence function and optimization method. The iterative NMF uses KLD and multiplicative update, and the non-iterative NMF employs Euclidean and closed-form solution. To analyze the effect on the class-wise performance, we compare the performances of the NMFs with an iterative NMF with Euclidean (for convenience, we call it NMF-EUC here). The iterative NMF and NMF-EUC have the same parameters and settings except for the divergence function. Figure 7 shows the comparison results. The iterative NMF is superior to the NMF-EUC in the classes of electric shaver, speech, frying, and blender, where the iterative NMF is superior to the non-iterative NMF. In other words, the tendencies in performances are similar for the NMF-EUC and non-iterative NMF. However, the difference in the performance of non-iterative NMF compared to iterative NMF is greater than that of NMF-EUC and iterative NMF. Therefore, it seems that the difference in the optimization method had a greater effect on the performance than that in the divergence function in this experiment.

Effect of the Training Data on the Frequency Basis Learning
To analyze the effect of the training data on the frequency basis matrices in the NMF methods, the performances of the NMF methods were compared with the different frequency basis matrices from various parts of the database. Table 3 presents the F1-score results of different frequency basis matrices with the event-wise post-processing. STR, WEAK, and WEAK(U) denote the frequency basis matrices trained by the strongly-labeled data, weakly-labeled data, and weakly-labeled unitary label data, which consist of singleclass clips, respectively.
As presented in Table 3, the strongly-labeled data exhibited relatively good overall performance. The weakly-labeled data demonstrated good micro-averaged performance for iterative NMF but degraded performances for the other criteria, including non-iterative NMF. Using both the strongly-and weakly-labeled unitary label data (STR + WEAK(U)) also had good macro-averaged performance for non-iterative NMF, but it also showed degraded performances for other criteria. Conversely, the performance of the weakly-labeled unitary label data (WEAK(U)) did not exceed that of the weakly-labeled data (WEAK) that has interference classes. For instance, Class 2 data interferes with the training of Class 1 frequency basis matrix (Figure 3c). Therefore, we suggest that the class interference problem does not significantly affect the classification performance.

Thresholding Singular Values for Calculating the Pseudo-Inverse Matrix
To extract the proposed feature instantaneously, the Moore-Penrose pseudo-inverse of the frequency basis matrix needs to be calculated in advance using Equation (17). As follows from Equations (19) and (20), the pseudo-inverse matrix can be calculated via singular value decomposition with thresholding of the small singular values. The threshold, which is one of the design parameters, is related to stability and sparsity of the pseudo-inverse matrix. Thus, it may affect the performance of the extracted features. Therefore, we evaluated the effect of the threshold on the classification performance. Table 4 presents the classification performances with various threshold values. As follows from Equation (27), δ denotes the ratio of the threshold value to the maximum singular value. Among the test values, γ = 0.01 exhibits the best performance, while γ = 0.005 and γ = 0.05 exhibit comparable performances. However, γ = 0.001 and γ = 0.1 result in significantly degraded performances. Therefore, γ values between 0.005 and 0.05 are good choices for our system.

Conclusions
In this paper, two NMF-based feature extraction methods are proposed for weakly supervised SED. Inspired by the NMF applications in the acoustic signal processing systems capable of analyzing the frequency characteristics of the acoustic signals, the proposed methods were designed to extract features from heterogeneous database for weakly supervised SED. To generate the frequency basis matrix, the class-wise data matrices were composed from strongly-and weakly-annotated data. The class-wise frequency basis matrices were estimated using the NMF algorithm with the KLD, and then cascaded to compose the whole frequency basis matrix. In the iterative feature extraction method, the temporal basis matrix was calculated via iterations of the NMF equations using the whole frequency basis matrix. Moreover, we developed a non-iterative feature extraction method using a least-squares solution of the NMF problem. The classifier was constructed based on the mean-teacher model for the proposed features and enhanced by the proposed event-wise post-processing method.
To evaluate the proposed data-dependent feature extraction methods, simulations of weakly supervised SED were performed using the DCASE 2020 Task 4 database. The proposed methods were compared with the conventional features, e.g., Mel-spectrogram, log-Mel spectrum, gammatonegram, and CQT. Although the proposed features did not outperform other features, they yielded results comparable to those of the Mel-spectrogram and gammatonegram, which are state-of-the-art features. Moreover, they demonstrated 3-5% better F1-score performance than the log-Mel-spectrum and CQT.