Next Article in Journal
Transferable Deep Features for Keyword Spotting
Previous Article in Journal
Obvious and Hidden Symmetries of Mathematical Objects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection †

by
Panagiotis Giannoulis
1,2,
Gerasimos Potamianos
2,3,* and
Petros Maragos
1,2
1
School of ECE, National Technical University of Athens, 15773 Athens, Greece
2
Athena Research and Innovation Center, 15125 Maroussi, Greece
3
Department of ECE, University of Thessaly, 38221 Volos, Greece
*
Author to whom correspondence should be addressed.
Presented at the International Workshop on Computational Intelligence for Multimedia Understanding (IWCIM), Kos Island, Greece, 2 September 2017.
Proceedings 2018, 2(2), 90; https://doi.org/10.3390/proceedings2020090
Published: 9 January 2018

Abstract

:
In this paper, we investigate the performance of classifier-based non-negative matrix factorization (NMF) methods for detecting overlapping acoustic events. We provide evidence that the performance of classifier-based NMF systems deteriorates significantly in overlapped scenarios in case mixed observations are unavailable during training. To this end, we propose a K-means based method for artificial generation of mixed data. The method of Mixture of Local Dictionaries (MLD) is employed for the building of the NMF dictionary using both the isolated and artificially mixed data. Finally an SVM classifier is trained for each of the isolated and mixed event classes, using the corresponding MLD-NMF activations from the training set. The proposed system, tested on two experiments with (a) synthetic and (b) real events, outperforms the state-of-the-art classifier-based NMF system in the overlapped scenarios.

1. Introduction

Acoustic event detection (AED) is a major part of the computational auditory analysis field, aiming to detect the time boundaries of meaningful sound events. With audio being a crucial modality in multimodal content, most common applications of AED include smart home environments, surveillance and security [1,2], as well as multimedia database retrieval.
Several methods have been developed the last years for AED. In the case of isolated AED, traditional methods based on hidden Markov models (HMMs) in conjunction with conventional features (e.g., MFCCs) show satisfactory performance [3,4]. Regarding the more challenging overlapped scenario, different approaches include temporally-constrained probabilistic analysis models [5], generalized Hough-transform based systems [6], HMM-based systems with multiple-path Viterbi decoding [7], non-negative matrix factorization [8], and multi-label deep neural networks. In particular, the latter have shown good performance by modeling overlapping events in a natural way [9,10].
NMF-based approaches constitute a popular choice for AED, and especially when it comes to overlapping scenarios, due to their natural relation with the source separation task and their ability to detect multiple events occurring simultaneously. NMF-related methods can be separated in those that exploit the NMF activations directly to perform event detection [8,11], and in those that employ a classifier trained on these activations [12,13]. Based on the fact that NMF-based approaches can benefit from the creation of a Mixture of Local Dictionaries (MLD) [14], in [15] the authors propose a classifier-based NMF system using MLDs for improved detection performance.
In our paper we investigate the performance of state-of-the-art NMF approaches under overlapped conditions. We provide evidence that the performance of the, so far, classifier based NMF methods degrades significantly in overlapped scenarios, mainly because the training phase considers activations only from isolated data. To alleviate this problem, we propose the generation of mixed observations using the isolated ones available, and subsequently their incorporation in the training data. For the artificial mixing procedure, we use a K-means based method for each pair of events. The MLD dictionary is built using the new training set, and SVM classifiers are trained for each of the isolated and mixed events using the corresponding activations. Our method is tested in two experiments using (a) synthetic and (b) real event instances and shows significant improvement over the state-of-the-art classifier based method in the overlapping scenarios.
The remainder of the paper is organized as follows: Section 2 presents and discusses the drawbacks of the two NMF-based alternatives that are compared with our system; Section 3 describes the artificial generation of mixed data and the outline of the proposed method; Section 4 reviews the experimental framework and reports our results; and, finally, Section 5 concludes the paper.

2. Existing NMF-Based Methods for AED

We will present briefly two popular methods for NMF-based AED. The first can be considered as the baseline, as it is the simplest one: Sparse-NMF with thresholding. The second is a classifier-based MLD-NMF method presented in [13,15]. We will discuss the drawbacks of these two methods for isolated/overlapped acoustic event detection.

2.1. Sparse-NMF Approach

The application of sparse-NMF for isolated and overlapping AED is based on the idea of linear decomposition of events into spectral atoms. Given non-negative features with approximate linearity (e.g., filterbank energies), a test event will be decomposed into atoms of observed event(s).
NMF is a linear non-negative approximate factorization of the observed feature matrix, and it is formulated as follows: Given a non-negative matrix V 0 , M × N , the goal is to approximate V with the product: V W × H , where W 0 , M × R denotes the non-negative dictionary matrix, and H 0 , R × N represents the non-negative activation matrix. Minimization of a suitable error cost function D ( V | | W H ) results in iterative estimation of W and H [16].
For detection, assuming a given dictionary W that contains atoms of the various classes of interest, the estimated H provides activations of each class through time. It is shown that the sparse-NMF which imposes sparsity on the matrix H , performs better for the detection task. Sparse-NMF minimizes the following objective: D ( V | | W H ) + λ H 1 , with D denoting the generalized KL-divergence between V and W H , and parameter λ controlling the trade-off between sparseness on H and accurate reconstruction of V .
The method is used in this paper as a baseline. Regarding the building of the dictionary, using training data consisting of isolated event instances, a sufficient number of atoms is extracted and stored in the dictionary for each class of interest, resulting in the total dictionary matrix W . Then in the detection step, a simple thresholding on the activations of matrix H decides for the existence of each event in each frame.
We can note two main disadvantages in this traditional method. The first is that the threshold-based decision in the detection step cannot be considered as the best choice in terms of robustness. The second and more important, is that, as pointed out in [14], the convex cones created by the bases of the sub-dictionaries of the different classes may often overlap between each other. This means that new observations that fall in the overlapped regions can be reconstructed with many different ways (unstable activations) which can result in failure of classification (e.g., false alarms).

2.2. SVM-Based NMF Approach with MLD Dictionary

This method essentially refers to the core system of the works in [13,15]. This system attempts to overcome the drawbacks of the aforementioned traditional sparse-NMF method by employing an MLD dictionary framework and an SVM classifier for the final detection step. The MLD-based dictionary generation eliminates overlaps between convex cones, and produces more stable activations which are used for the training of robust SVM classifiers. As shown in the flow diagram in Figure 1 (black schemes), the method consists of two main parts; dictionary learning and classifier training.

2.2.1. Dictionary Learning

In dictionary learning, the feature matrix V containing all training data is decomposed into an initial basis matrix W 0 by basic unsupervised NMF. Next, by applying K-means to W 0 , G centroids μ ( g ) are obtained, with g { 1 , . . . , G } denoting the centroid’s index. The final MLD dictionary W consists of G sub-groups (of K g bases each) which model acoustic atoms W = [ W ( 1 ) . . . W ( G ) ] . The MLD dictionary is learned by minimizing the following objective:
D ( V | | W H ) + η g D ( μ ( g ) W ( g ) ) + λ t Ω ( h t )
where h t denotes the column vector of H at time frame t. The second term is a constraint which makes bases of sub-groups to be similar with μ ( g ) , so that the resulting convex cones are compact. The third term preserves group-sparsity in the solution.

2.2.2. Classifier Training

For each class considered, an activation matrix H i is extracted from its corresponding training spectrogram V i by MLD based NMF with the global dictionary W . Then the column vectors h t ( i ) of H i at each time frame t are used as feature vectors to train a linear SVM classifier. A multi-class SVM is trained using the one-against-all approach.
This method seems to solve the problems of the traditional sparse-NMF approach in the isolated AED case. Although, we must remark one possible drawback in the case of overlapping scenarios: The classifiers are trained for each class of interest using its corresponding isolated data. This makes the classifier vulnerable in the presence of unseen mixed data. An observation of a mixed event containing classes i and j will not necessarily be classified correctly by both the classifiers of i-th and j-th event.

3. Proposed Method

Our method attempts to solve the deficiency of the previous method in overlapped scenarios, by considering mixed data in the training and testing stages. The block-diagram of the proposed method is depicted in Figure 1 (black and blue schemes).

3.1. Dictionary Learning

Our scope is to include mixed data in the dictionary learning procedure. Considering the difficulty of having enough amount of mixed data available, we propose a method for artificial generation of mixed data. Assuming linearity of features, the method acts in the feature and not in the signal domain. The basic idea is shown in Figure 2. In order to create representative observations of the mixed data, we try to combine (sum) representative observations from each of the two events considered.
Given a number of centroids C and a percentage α , we first perform K-means clustering with C clusters in the feature space of each event. Then α % from the samples of each cluster are selected. Finally we consider all the combinations (addition) between the selected samples of the two classes.
After mixed data generation, both isolated and mixed data are used as input for the MLD dictionary learning procedure. In this way, bases created in the final dictionary may correspond to overlapped events too.

3.2. Classifier Training

In the classifier training stage, instead of training N classifiers (N is the number of events), we train N + N 2 . Also as we are modeling all the possible events (isolated and mixed), we train linear probabilistic SVMs and in the testing stage we choose the event with the highest score for each frame.

4. Experiments

4.1. Datasets and Experimental Framework

We perform our experiments on two datasets, with the one containing synthetic events and the other real events. In the case of the synthetic event dataset, we generated artificial spectral patches for 5 synthetic events, while in the real event case, we extracted spectral patches from 5 real events contained in the database designed for the Task 2 of the DCASE’16 challenge (office-related events; drawer, phone, keys, speech, doorslam).
In both datasets, the performance of different methods is evaluated in both isolated and overlapped scenarios. In the isolated case, testing sequences of isolated spectral patches are created, whereas in the overlapped case, sequences of mixed spectral patches are generated. A mixed spectral patch results from the superposition of two isolated spectral patches from the corresponding testing dataset. Regarding the spectral patch extraction, in the case of synthetic events, we generate 5 × 5 spectral patches with the following procedure: The spectral patches of each event are characterized by a particular pattern which is slightly varying its structure in the different instances (see Figure 3). To introduce variability, each time some of the active “tiles” of the the pattern can be missing (up to 5), while the active “tiles” take random positive values in the 0.5, 1 interval. Random noise is also added after the generation of each spectral patch. In the case of real events, spectral patches have dimension 100 × 10 and are composed of 100 Mel-filterbank energies in 100 msec intervals (10 frames).
Finally, regarding the partition into training and testing sets, in the real event case, we partitioned the training data of DCASE’16 challenge, so that 80% of event recordings is used for training and the rest 20% for testing purposes. In the synthetic event case, we generated a small number of instances per event (30) for building the training set. For both databases, the testing sequences contain 1000 spectral patches for both isolated and overlapped scenarios. We should note, that in the way that we build our synthetic testing sequences, when overlap occurs, it occurs in the whole duration of spectral patches involved. In this way, our problem can be also considered as classification of spectral patches of acoustic events with temporal information.

4.2. Results

In Table 1 and Table 2, the comparative results for the three different methods are presented in terms of Fscore, for both isolated and overlapped scenarios and under two different experimental setups, for the two event datasets. In the first setup (Local opt.), optimization of the various parameters of the methods is performed in each scenario separately, while in the second (Global opt.) optimization is performed only one time for the whole testing procedure. In fact, “Local opt.” assumes prior knowledge of overlap existence.
In Table 1 we can draw three major conclusions: First of all, our proposed method clearly outperforms the state-of-the-art SVM&MLD-NMF based method in the overlapping scenarios, both in “local” and “global” setups achieving 77.16% and 61.18% relative error reductions correspondingly. In fact, SVM&MLD-NMF method’s performance degrades significantly in the presence of mixed events. Next, we can observe that the performance of baseline sparse-NMF approach is stable across the different scenarios and setups, achieving also the best Fscore in the “global” optimization setup. We can say that in the case of quite simple and discriminable events this baseline is a good option for both isolated and overlapped scenarios. Finally, only our proposed method seems to be affected significantly by using global optimization instead of the local one. It seems that the parameter α that controls the amount of mixing data included in the training phase, has strong influence on the behavior of our method.
In Table 2, corresponding results for the real-event scenario are presented. Similarly to the synthetic case, we can again notice the big drop in the performance of SVM&MLD-NMF method when we move from the isolated to the overlapped scenario, as well as the superiority of the proposed method in the overlap case (33.29% and 18.57% relative error reduction in “local” and “global” setups respectively). Also, the baseline sparse-NMF method shows again stable performance across different scenarios. However, as expected, in this more challenging case of real events, both the SVM&MLD-NMF and proposed methods perform significantly better than the baseline in the isolated scenario. Finally, like before, among the three methods, our approach is affected the most by the switch from the “local” to the “global” optimization setup.
By summarizing the results, we can claim that the classifier based SVM&MLD-NMF approach outperforms the baseline sparse-NMF based one in the isolated event scenario. This is important, as the fact is that the isolated scenario is by far the most frequent under realistic conditions. However, if we want to test the system under more challenging overlapping conditions, the performance of the existing method deteriorates. Our proposed method, by incorporating mixed data in the training phase, succeeds to increase the performance significantly under overlapped conditions, and also provide better results in total. However there is one drawback: our method is strongly affected by the amount of mixed data employed for training. This is depicted also in Figure 4, where the performance of the proposed method is shown for the real events dataset, for both the isolated and overlapped cases, as the mixing parameter α increases. As α increases, performance increases also in the overlapping case, but at the same time, decreases (with a higher rate) in the isolated case. With knowledge of the expected degree of overlap in our dataset, an optimal value of α could be chosen.

5. Conclusions

In this paper we investigated the performance of state-of-the-art NMF approaches for overlapping acoustic event detection. We provided evidence of degradation of the existing method’s performance under highly overlapped conditions, and we proposed a new method which tries to alleviate this problem by employing a module for artificial generation of mixed data which are considered in the training phase. Probabilistic SVMs are also employed in the final classification step using all available classes (isolated and mixed).
Results obtained on experiments with synthetic and real events were promising, outperforming the existing method in overlapping scenarios while also preserving good performance in the isolated ones.
In future work, the design of a module able to identify the existence (or not) of overlap will be investigated, in order to increase the robustness of our system. Also alternative methods for artificial generation of mixed data will be considered.

Acknowledgments

This work has been partially funded by the BabyRobot project, supported by the EU Horizon 2020 Programme under grant 687831.

References

  1. Clavel, C.; Ehrette, T.; Richard, G. Events detection for an audio-based surveillance system. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Amsterdam, The Netherlands, 6 July 2005; pp. 1306–1309. [Google Scholar]
  2. Atrey, P.K.; Maddage, N.C.; Kankanhalli, M.S. Audio based event detection for multimedia surveillance. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France, 14–19 May 2006; Volume 5. [Google Scholar]
  3. Giannoulis, P.; Potamianos, G.; Katsamanis, A.; Maragos, P. Multi-microphone fusion for detection of speech and acoustic events in smart spaces. In Proceedings of the 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, 1–5 September 2014; pp. 2375–2379. [Google Scholar]
  4. Zhou, X.; Zhuang, X.; Liu, M.; Tang, H.; Hasegawa-Johnson, M.; Huang, T. HMM-based acoustic event detection with adaboost feature selection. In Multimodal Technologies for Perception of Humans; Springer: Berlin/Heidelberg, Germany, 2008; pp. 345–353. [Google Scholar]
  5. Benetos, E.; Lagrange, M.; Plumbley, M.; Mark, D. Detection of overlapping acoustic events using a temporally-constrained probabilistic model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6450–6454. [Google Scholar]
  6. Dennis, J.; Tran, H.D.; Chng, E.S. Overlapping sound event recognition using local spectrogram features and the generalised Hough transform. Pattern Recognit. Lett. 2013, 34, 1085–1093. [Google Scholar] [CrossRef]
  7. Diment, A.; Heittola, T.; Virtanen, T. Sound event detection for office live and office synthetic AASP challenge. In Proceedings of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (WASPAA), New Paltz, NY, USA, 20–23 October 2013. [Google Scholar]
  8. Gemmeke, J.F.; Vuegen, L.; Karsmakers, P.; Vanrumste, B.; van hamme, H. An exemplar-based NMF approach to audio event detection. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. [Google Scholar]
  9. Choi, I.; Kwon, K.; Bae, S.H.; Kim, N.S. DNN-based sound event detection with exemplar-based approach for noise reduction. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 September 2016; pp. 16–19. [Google Scholar]
  10. Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the International Joint Conference on Neural networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–7. [Google Scholar]
  11. Giannoulis, P.; Potamianos, G.; Maragos, P.; Katsamanis, A. Improved dictionary selection and detection schemes in sparse-CNMF-based overlapping acoustic event detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 September 2016; pp. 25–29. [Google Scholar]
  12. Cotton, C.V.; Ellis, D.P.W. Spectral vs. In spectro-temporal features for acoustic event detection. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 16–19 October 2011; pp. 69–72. [Google Scholar]
  13. Komatsu, T.; Senda, Y.; Kondo, R. Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2259–2263. [Google Scholar]
  14. Kim, M.; Smaragdis, P. Mixtures of local dictionaries for unsupervised speech enhancement. IEEE Signal Process. Lett. 2015, 22, 293–297. [Google Scholar] [CrossRef]
  15. Komatsu, T.; Toizumi, T.; Kondo, R.; Senda, Y. Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 September 2016; pp. 45–49. [Google Scholar]
  16. Lee, D.D.; Seung, H.S. Algorithms for Non-Negative Matrix Factorization. Available online: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization (accessed on 13 December 2017).
Figure 1. Block-diagram of the proposed AED method.
Figure 1. Block-diagram of the proposed AED method.
Proceedings 02 00090 g001
Figure 2. Generation of mixed data (green) from a pair of isolated events (blue and red). Toy example, with two features “x1” and “x2”.
Figure 2. Generation of mixed data (green) from a pair of isolated events (blue and red). Toy example, with two features “x1” and “x2”.
Proceedings 02 00090 g002
Figure 3. Different instances for each of the 5 synthetic events. Horizontal axis corresponds to time and vertical to frequency.
Figure 3. Different instances for each of the 5 synthetic events. Horizontal axis corresponds to time and vertical to frequency.
Proceedings 02 00090 g003
Figure 4. Performance of the proposed method in both Isolated and Overlapped scenarios as the percentage α of mixing increases.
Figure 4. Performance of the proposed method in both Isolated and Overlapped scenarios as the percentage α of mixing increases.
Proceedings 02 00090 g004
Table 1. Performance of the different systems for the synthetic data scenario in terms of Fscore (%).
Table 1. Performance of the different systems for the synthetic data scenario in terms of Fscore (%).
MethodLocal Opt.Global Opt.
IsolOverlAvgIsolOverlAvg
sparse-NMF95.1095.8295.4695.2193.5394.37
SVM&MLD-NMF96.7877.2387.0094.3977.2385.81
Proposed96.4294.8095.6192.3091.1691.73
Table 2. Performance of the different systems for the real data scenario in terms of Fscore (%).
Table 2. Performance of the different systems for the real data scenario in terms of Fscore (%).
MethodLocal Opt.Global Opt.
IsolOverlAvgIsolOverlAvg
sparse-NMF78.3678.5478.4575.4977.5276.51
SVM&MLD-NMF85.8361.7673.7983.9661.7672.86
Proposed85.7974.4980.1482.0068.8675.43

Share and Cite

MDPI and ACS Style

Giannoulis, P.; Potamianos, G.; Maragos, P. On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection. Proceedings 2018, 2, 90. https://doi.org/10.3390/proceedings2020090

AMA Style

Giannoulis P, Potamianos G, Maragos P. On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection. Proceedings. 2018; 2(2):90. https://doi.org/10.3390/proceedings2020090

Chicago/Turabian Style

Giannoulis, Panagiotis, Gerasimos Potamianos, and Petros Maragos. 2018. "On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection" Proceedings 2, no. 2: 90. https://doi.org/10.3390/proceedings2020090

Article Metrics

Back to TopTop