Confidence Learning for Semi-Supervised Acoustic Event Detection
Abstract
:1. Introduction
- (1)
- C-SAED realizes the co-training of classification and confidence with only classification labels in the first stage by designing a multi-task model. The experimental results show that the generated confidence can effectively measure the correctness of the label.
- (2)
- Compared with the traditional self-training method, differentiated training rather than screening strategies in the second stage effectively improves the utilization efficiency of unlabeled data. Our experiments illustrate that the training effect is significantly improved under the same number of iterations.
- (3)
- C-SAED uses the mean teacher model as the backbone of each stage model that effectively fuses two semi-supervised methods: the consistency principle and pseudo-labels. The ER decreases compared to adapting mean teacher only.
2. Proposed System
2.1. Baseline: Mean Teacher
2.2. C-SAED
2.2.1. Stage One: Multi-Task SAED Model (MT-SAED)
2.2.2. Stage Two: Retraining with Pseudo-Labels and Confidence
2.3. Pooling Functions
3. Experiments and Discussion
Dataset and Metrics
4. Results and Analysis
4.1. Comparison of Posterior Probability and Confidence as Evaluation Criteria
4.2. Comparison with Other Methods
- MT18: the official baseline for DCASE2019 task4, with the mean teacher structure [17].
- Baseline: modified MT18 method with attention pooling.
- MT-SAED: the stage one model of C-SAED with power pooling.
- Prob0.9: only predictions with added to pseudo-labels, samples retrained with equal weight [20].
- Prob: all samples retrained with probabilities as weights.
- Prob0.5: only predictions with added to pseudo-labels, samples retrained with confidence.
4.3. The Effect of Hyperparameter
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bello, J.P.; Silva, C.; Nov, O.; Dubois, R.L.; Arora, A.; Salamon, J.; Mydlarz, C.; Doraiswamy, H. SONYC: A System for Monitoring, Analysis and Mitigation of Urban Noise Pollution. Commun. ACM 2019, 62, 68–77. [Google Scholar] [CrossRef]
- Lostanlen, V.; Salamon, J.; Farnsworth, A.; Kelling, S.; Bello, J.P. Birdvox-Full-Night: A Dataset and Benchmark for Avian Flight Call Detection. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 266–270. [Google Scholar]
- Crocco, M.; Cristani, M.; Trucco, A.; Murino, V. Audio Surveillance: A Systematic Review. ACM Comput. Surv. 2016, 48, 1–46. [Google Scholar] [CrossRef]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Reliable Detection of Audio Events in Highly Noisy Environments. Pattern Recognit. Lett. 2015, 65, 22–28. [Google Scholar] [CrossRef]
- Heittola, T.; Mesaros, A.; Eronen, A.; Virtanen, T. Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013, 2013, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015. [Google Scholar]
- Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 20–25 March 2016. [Google Scholar]
- Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. arXiv 2016, arXiv:1604.06338. [Google Scholar]
- Akr, E.; Virtanen, T. End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018. [Google Scholar]
- Serizel, R.; Turpault, N.; Eghbal-Zadeh, H.; Shah, A.P. Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Surrey, UK, 19–20 November 2018; pp. 19–23. [Google Scholar]
- Turpault, N.; Serizel, R.; Parag Shah, A.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA, 25–26 October 2019; pp. 253–257. [Google Scholar]
- Lin, L.; Wang, X.; Liu, H.; Qian, Y. Guided Learning for Weakly-Labeled Semi-Supervised Sound Event Detection. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 626–630. [Google Scholar]
- Yan, J.; Song, Y.; Dai, L.; McLoughlin, I. Task-Aware Mean Teacher Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 326–330. [Google Scholar]
- Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv 2017, arXiv:1703.01780. [Google Scholar]
- Wang, J.; Xia, J.; Yang, Q.; Zhang, Y. Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET. IEEE Access 2020, 8, 38032–38044. [Google Scholar] [CrossRef]
- Lu, J. Mean Teacher Convolution System for Dcase 2018 Task 4. DCASE2018 Challenge. 2018. Available online: http://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Lu_19.pdf (accessed on 14 September 2021).
- Lionel, D.; Cyril, P. Mean Teacher with Data Augmentation for Dcase 2019 Task 4; DCASE2019 Challenge; Orange Labs: Lannion, France, 2019. [Google Scholar]
- McClosky, D.; Charniak, E.; Johnson, M. Effective Self-training for Parsing. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, New York, NY, USA, 4–9 June 2006; pp. 152–159. [Google Scholar]
- Liu, Y.L.; Yan, J.; Song, Y. Ustc-Nelslip System for Dcase 2018 Challenge Task 4. DCASE2018 Challenge. 2018. Available online: http://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Liu_69.pdf (accessed on 14 September 2021).
- Elizalde, B.; Shah, A.; Dalmia, S.; Min, H.L.; Lane, I. An Approach for Self-Training Audio Event Detectors Using Web Data. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017. [Google Scholar]
- Kothintiu, S.; Imoto, K.; Charkrabarty, D. Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection. DCASE2018 Challenge. 2018. Available online: http://dcase.community/documents/challenge2018/technical_reports/DCASE2018_Kothinti_90.pdf (accessed on 14 September 2021).
- Guo, C.; Pleiss, G.; Yu, S.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
- DeVries, T.; Graham, W.T. Learning Confidence for Out-of-Distribution Detection in Neural Network. arXiv 2018, arXiv:1802.04865. [Google Scholar]
- Liu, Y.; Chen, H.; Wang, Y.; Zhang, P. Power pooling: An adaptive pooling function for weakly labelled sound event detection. arXiv 2021, arXiv:2010.09985. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for Polyphonic Sound Event Detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]





| Model | Evaluation 2018 | Validation 2019 | Polyset | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ER | DEL | INS | ER | DEL | INS | ER | DEL | INS | ||
| MT18 | 1.65 | 0.78 | 0.87 | 1.56 | 0.76 | 0.80 | - | - | - | |
| Baseline | 1.34 | 0.72 | 0.62 | 1.26 | 0.70 | 0.56 | 1.45 | 0.81 | 0.64 | |
| MT-SAED | 1.13 | 0.69 | 0.44 | 1.07 | 0.67 | 0.40 | 1.26 | 0.75 | 0.51 | |
| Retrain | ||||||||||
| 0 | 1.10 | 0.68 | 0.42 | 1.04 | 0.67 | 0.37 | 1.18 | 0.75 | 0.43 | |
| Prob0.9 | 1 | 3.72 | 0.70 | 3.02 | 3.41 | 0.69 | 2.72 | 3.91 | 0.80 | 3.11 | 
| Prob | 0.3 | 1.15 | 0.68 | 0.47 | 1.09 | 0.66 | 0.43 | 1.28 | 0.76 | 0.52 | 
| Prob0.5 | 0 | 1.19 | 0.68 | 0.51 | 1.14 | 0.68 | 0.46 | 1.32 | 0.77 | 0.55 | 
| C-SAED | 0.3 | 1.09 | 0.70 | 0.39 | 1.03 | 0.68 | 0.35 | 1.23 | 0.78 | 0.45 | 
| ( = 0.1) | 0.7 | 1.13 | 0.69 | 0.44 | 1.05 | 0.67 | 0.38 | 1.17 | 0.75 | 0.43 | 
| 1 | 1.06 | 0.68 | 0.38 | 1.02 | 0.67 | 0.35 | 1.19 | 0.74 | 0.45 | |
| C-SAED | 0.3 | 1.08 | 0.69 | 0.39 | 1.01 | 0.67 | 0.34 | 1.16 | 0.75 | 0.41 | 
| ( = 0.01) | 0.7 | 1.06 | 0.70 | 0.36 | 1.01 | 0.69 | 0.32 | 1.14 | 0.76 | 0.38 | 
| 1 | 1.00 | 0.86 | 0.14 | 0.98 | 0.85 | 0.13 | 1.02 | 0.89 | 0.13 | |
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Y.; Chen, H.; Wang, J.; Wang, P.; Zhang, P. Confidence Learning for Semi-Supervised Acoustic Event Detection. Appl. Sci. 2021, 11, 8581. https://doi.org/10.3390/app11188581
Liu Y, Chen H, Wang J, Wang P, Zhang P. Confidence Learning for Semi-Supervised Acoustic Event Detection. Applied Sciences. 2021; 11(18):8581. https://doi.org/10.3390/app11188581
Chicago/Turabian StyleLiu, Yuzhuo, Hangting Chen, Jian Wang, Pei Wang, and Pengyuan Zhang. 2021. "Confidence Learning for Semi-Supervised Acoustic Event Detection" Applied Sciences 11, no. 18: 8581. https://doi.org/10.3390/app11188581
APA StyleLiu, Y., Chen, H., Wang, J., Wang, P., & Zhang, P. (2021). Confidence Learning for Semi-Supervised Acoustic Event Detection. Applied Sciences, 11(18), 8581. https://doi.org/10.3390/app11188581
 
         
                                                

 
       