Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
Abstract
:1. Introduction
2. Related Work
3. Method
3.1. DenseNet
3.2. Squeeze-and-Excitation
3.3. Global Pooling for Aggregation
3.4. Structured Prediction for Accurate Event Localization
3.4.1. RNN-Based Structured Prediction
3.4.2. CRF Post-Processing
4. Experiments
4.1. Dataset
4.2. Metrics
4.3. Feature Extraction
4.4. DSNet and DSNet-RNN Structures
4.5. Baseline CNN Structure
4.6. Training and Evaluation
5. Results and Discussion
5.1. Audio Tagging
5.2. Event Detection with Localization
5.3. Comparison with the DCASE 2017 Task 4 Results
6. Conclusions
Author Contributions
Acknowledgments
Conflicts of Interest
References
- Kumar, A.; Raj, B. Audio event detection using weakly labeled data. In Proceedings of the ACM on Multimedia Conference, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1038–1047. [Google Scholar]
- Kumar, A.; Raj, B. Audio event and scene recognition: A unified approach using strongly and weakly labeled data. arXiv 2016, arXiv:1611.04871. [Google Scholar]
- Su, T.W.; Liu, J.Y.; Yang, Y.H. Weakly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 791–795. [Google Scholar]
- McFee, B.; Salamon, J.; Bello, J.P. Adaptive pooling operators for weakly labeled sound event detection. arXiv 2018, arXiv:1804.10070. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1529–1537. [Google Scholar]
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), Williamstown, MA, USA, 28 June–1 July 2001; pp. 282–289. [Google Scholar]
- Zhou, X.; Zhuang, X.; Liu, M.; Tang, H.; Hasegawa-Johnson, M.; Huang, T. HMM-based acoustic event detection with AdaBoost feature selection. In Multimodal Technologies for Perception of Humans; Springer: Berlin/Heidelberg, Germany, 2008; pp. 345–353. [Google Scholar]
- Mesaros, A.; Heittola, T.; Eronen, A.; Virtanen, T. Acoustic event detection in real life recordings. In Proceedings of the 18th European Signal Processing Conference (EUSIPCO), Aalborg, Denmark, 23–27 August 2010; pp. 1267–1271. [Google Scholar]
- Temko, A.; Nadeu, C. Classification of acoustic events using SVM-based clustering schemes. Pattern Recognit. 2006, 39, 682–694. [Google Scholar] [CrossRef] [Green Version]
- Temko, A.; Nadeu, C. Acoustic event detection in meeting-room environments. Pattern Recognit. Lett. 2009, 30, 1281–1288. [Google Scholar] [CrossRef]
- Portelo, J.; Bugalho, M.; Trancoso, I.; Neto, J.; Abad, A.; Serralheiro, A. Non-speech audio event detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, 19–24 April 2009; pp. 1973–1976. [Google Scholar]
- Chin, M.L.; Burred, J.J. Audio event detection based on layered symbolic sequence representations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 1953–1956. [Google Scholar]
- Gemmeke, J.F.; Vuegen, L.; Karsmakers, P.; Vanrumste, B.; hamme, H.V. An exemplar-based NMF approach to audio event detection. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. [Google Scholar]
- Mesaros, A.; Heittola, T.; Dikmen, O.; Virtanen, T. Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 9–24 April 2015; pp. 151–155. [Google Scholar]
- Pancoast, S.; Akbacak, M. Bag-of-audio-words approach for multimedia event classification. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012; pp. 2105–2108. [Google Scholar]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Sparse representation based on a bag of spectral exemplars for acoustic event detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6255–6259. [Google Scholar]
- Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–7. [Google Scholar]
- Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar]
- Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv 2016, arXiv:1604.06338. [Google Scholar]
- Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T.; Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
- Lee, J.; Kim, T.; Park, J.; Nam, J. Raw waveform-based audio classification using sample-level CNN architectures. arXiv 2017, arXiv:1712.00866. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
- Maron, O.; Lozano-Pérez, T. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1998; pp. 570–576. [Google Scholar]
- Kumar, A.; Raj, B. Deep CNN framework for audio event recognition using weakly labeled web data. arXiv 2017, arXiv:1707.02530. [Google Scholar]
- Kumar, A.; Khadkevich, M.; Fügen, C. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 326–330. [Google Scholar]
- Mesaros, A.; Heittola, T.; Diment, A.; Elizalde, B.; Shah, A.; Vincent, E.; Raj, B.; Virtanen, T. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the DCASE2017 Workshop, Munich, Germany, 16–17 November 2017. [Google Scholar]
- Lee, D.; Lee, S.; Han, Y.; Lee, K. Ensemble of Convolutional Neural Networks for Weakly-Supervised Sound Event Detection Using Multiple Scale Input; Technical Report; DCASE2017 Challenge: Tampere, Finland, 2017. [Google Scholar]
- Lee, J.; Park, J.; Nam, J. Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection; Technical Report; DCASE2017 Challenge: Tampere, Finland, 2017. [Google Scholar]
- Xu, Y.; Kong, Q.; Wang, W.; Plumbley, M.D. Surrey-CVSSP System for DCASE2017 Challenge Task4; Technical Report; DCASE2017 Challenge: Tampere, Finland, 2017. [Google Scholar]
- Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2011; pp. 109–117. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Layers | Output Size | DSNet | DSNet-RNN |
---|---|---|---|
Convolution | 800 × 128 × 32 | [3 × 3, 32 conv] | |
Dense block | 800 × 128 × 32 | [3 × 3, 16 conv] × 4 [1 × 1, 32 conv] | |
SE block | 800 × 128 × 32 | bottleneck size 8 | |
Max-pooling | 800 × 64 × 32 | 1 × 2 max pool | |
Dense block | 800 × 64 × 48 | [3 × 3, 16 conv] × 4 [1 × 1, 48 conv] | |
SE block | 800 × 64 × 48 | bottleneck size 12 | |
Max-pooling | 400 × 32 × 48 | 2 × 2 max pool | |
Dense block | 400 × 32 × 64 | [3 × 3, 16 conv] × 4 [1 × 1, 64 conv] | |
SE block | 400 × 32 × 64 | bottleneck size 16 | |
Max-pooling | 200 × 16 × 64 | 2 × 2 max pool | |
Dense block | 200 × 16 × 64 | [3 × 3, 16 conv] × 4 [1 × 1, 64 conv] | |
SE block | 200 × 16 × 64 | bottleneck size 16 | |
Max-pooling | 100 × 8 × 64 | 2 × 2 max pool | |
Reshape | 100 × 512 | 100 × 8 × 64 to 100 × 512 | |
Segment-level prediction | 100 × 17 | 256 dense(ReLU) 17 dense(sigmoid) | 128 Bi-GRUs 17 dense(sigmoid) |
Clip-level prediction | 17 | global LSE pooling | |
Parameters | - | 0.32 M | 0.69 M |
Layers | Output Size | CNN |
---|---|---|
Convolution | 800 × 128 × 32 | [3 × 3, 32 conv] × 2 |
Max-pooling | 800 × 64 × 32 | 1 × 2 max pool |
Convolution | 800 × 64 × 32 | [3 × 3, 32 conv] × 2 |
Max-pooling | 400 × 32 × 32 | 2 × 2 max pool |
Convolution | 400 × 32 × 64 | [3 × 3, 64 conv] × 2 |
Max-pooling | 200 × 16 × 64 | 2 × 2 max pool |
Convolution | 200 × 16 × 64 | [3 × 3, 64 conv] × 2 |
Max-pooling | 100 × 8 × 64 | 2 × 2 max pool |
Reshape | 100 × 512 | 100 × 8 × 64 to 100 × 512 |
Segment-level prediction | 100 × 17 | 256 dense(ReLU) 17 dense(sigmoid) |
Clip-level prediction | 17 | global LSE pooling |
Parameters | - | 0.29 M |
Model | F1 | P | R |
---|---|---|---|
CNN | 0.5506 | 0.5667 | 0.5353 |
DSNet | 0.5853 | 0.5822 | 0.5883 |
DSNet-RNN | 0.5839 | 0.5504 | 0.6281 |
Class | CNN | DSNet | DSNet-RNN |
---|---|---|---|
Train horn | 0.5273 | 0.4615 | 0.5102 |
Air horn, truck horn | 0.4000 | 0.5455 | 0.5783 |
Car alarm | 0.4267 | 0.4500 | 0.3836 |
Reversing beeps | 0.3373 | 0.3765 | 0.4186 |
Ambulance | 0.5556 | 0.4681 | 0.4854 |
Police car | 0.4906 | 0.5778 | 0.6525 |
Fire engine, fire truck | 0.5606 | 0.6055 | 0.5586 |
Civil defense siren | 0.7704 | 0.8160 | 0.8189 |
Screaming | 0.6833 | 0.7059 | 0.8333 |
Bicycle | 0.4675 | 0.4615 | 0.3294 |
Skateboard | 0.5946 | 0.7627 | 0.6372 |
Car | 0.6266 | 0.6759 | 0.6411 |
Car passing by | 0.2727 | 0.2931 | 0.2468 |
Bus | 0.4238 | 0.4000 | 0.2637 |
Truck | 0.4455 | 0.4541 | 0.4505 |
Motorcycle | 0.5465 | 0.6324 | 0.7009 |
Train | 0.7209 | 0.7883 | 0.7759 |
Model | F1 | P | R | ER |
---|---|---|---|---|
CNN | 0.4987 | 0.4598 | 0.5447 | 0.7568 |
DSNet | 0.5135 | 0.4746 | 0.5593 | 0.7039 |
DSNet-RNN | 0.5354 | 0.5074 | 0.5667 | 0.6213 |
F1 | ER | |
---|---|---|
0 | 0.5168 | 0.6564 |
0.005 | 0.5184 | 0.7048 |
0.01 | 0.5354 | 0.6213 |
0.02 | 0.5281 | 0.6867 |
0.05 | 0.5039 | 0.8109 |
Model | Before CRF | After CRF | ||
---|---|---|---|---|
F1 | ER | F1 | ER | |
CNN | 0.4987 | 0.7568 | 0.5195 | 0.6680 |
DSNet | 0.5135 | 0.7039 | 0.5265 | 0.6849 |
DSNet-RNN | 0.5354 | 0.6213 | 0.5432 | 0.6131 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Choi, I.; Bae, S.H.; Kim, N.S. Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection. Appl. Sci. 2019, 9, 2302. https://doi.org/10.3390/app9112302
Choi I, Bae SH, Kim NS. Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection. Applied Sciences. 2019; 9(11):2302. https://doi.org/10.3390/app9112302
Chicago/Turabian StyleChoi, Inkyu, Soo Hyun Bae, and Nam Soo Kim. 2019. "Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection" Applied Sciences 9, no. 11: 2302. https://doi.org/10.3390/app9112302
APA StyleChoi, I., Bae, S. H., & Kim, N. S. (2019). Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection. Applied Sciences, 9(11), 2302. https://doi.org/10.3390/app9112302