Sound Event Detection Employing Segmental Model
Abstract
1. Introduction
2. The Segmental Model for SED
2.1. Feature Extraction
2.2. Feature Encoding
2.3. Segment Embedding and Scoring
2.4. Training
2.5. Decoding
3. Experimental Results
3.1. Experimental Conditions
3.2. Evaluation Metrics
- • Detection Tolerance parameter (dtc): 0.5
- • Ground Truth intersection parameter (gtc): 0.5
- • Cross-Trigger Tolerance parameter (cttc): 0.3
- • Maximum False Positive rate (e_max): 100
3.3. Experimental Results
4. Conclusions
4.1. Discussions
4.2. Future Studies
Funding
Data Availability Statement
Conflicts of Interest
References
- Turpault, N.; Serizel, R.; Salamon, J.; Shah, A.P. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
- Nandwana, M.K.; Ziaei, A.; Hansen, J. Robust unsupervised detection of human screams in noisy acoustic environments. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 161–165. [Google Scholar]
- Crocco, M.; Cristani, M.; Trucco, A.; Murino, V.M. Audio surveillance: A systematic review. ACM Comput. Surv. 2016, 48, 1–46. [Google Scholar] [CrossRef]
- Salamon, J.; Bello, J.P. Feature learning with deep scattering for urban sound analysis. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 724–728. [Google Scholar]
- Ntalampiras, S.; Potamitis, I.; Fakotakise, N. On acoustic surveillance of hazardous situations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, 19–24 April 2009; pp. 165–168. [Google Scholar]
- Wang, Y.; Neves, L.; Metze, F. Audio-based multimedia event detection using deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France, 14–19 May 2006; pp. 2742–2746. [Google Scholar]
- Dekkers, G.; Vuegen, L.; Waterschoot, T.; Vanrumste, B.; Karsmakers, P. DCASE 2018 Challenge-Task 5: Monitoring of domestic activities based on multi-channel acoustics. arXiv 2018, arXiv:1807.11246. [Google Scholar]
- Renisha, G.; Jayasree, T. Cascaded Feedforward Neural Networks for speaker identification using Perceptual Wavelet based Cepstral Coefficients. J. Intell. Fuzzy Syst. Appl. Eng. Technol. 2019, 37, 1141–1153. [Google Scholar] [CrossRef]
- Jayasree1, T.; Emerald Shia, S. Combined Signal Processing Based Techniques and Feed Forward Neural Networks for Pathological Voice Detection and Classification. Sound Vib. 2021, 55, 141–161. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 60, 84–90. [Google Scholar] [CrossRef]
- Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural Networks. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Cho, K.; Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
- Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multilabel deep neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–7. [Google Scholar]
- McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
- Jayasree, T.; Blessy, S. Infant cry classification via deep learning based Infant cry networks using Discrete Stockwell Transform. Eng. Appl. Artif. Intell. 2025, 160, 112008. [Google Scholar] [CrossRef]
- Kwak, J.; Chung, Y. Sound event detection using derivative features in deep neural networks. Appl. Sci. 2020, 10, 4911. [Google Scholar] [CrossRef]
- Kim, S.; Chung, Y. Multi-scale Features for Transformer Model to Improve the Performance of Sound Event Detection. Appl. Sci. 2022, 12, 2626. [Google Scholar] [CrossRef]
- Fonseca, E.; Pons, J.; Favory, X.; Font, F.; Bogdanov, D.; Ferraro, A.; Oramas, S.; Porter, A.; Serra, X. Freesound datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 486–493. [Google Scholar]
- Frenay, B.; Verleysen, M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 2, 845–869. [Google Scholar] [CrossRef] [PubMed]
- Fonseca, E.; Plakal, M.; Ellis, D.; Font, F.; Favory, X.; Serra, X. Learning sound event classifiers from web audio with noisy labels. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019. [Google Scholar]
- Beigman, E.; Klebanov, B. Learning with annotation noise. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 2–7 August 2009; pp. 280–287. [Google Scholar]
- Miyazaki, K.; Komatsu, T.; Hayashi, T.; Watanabe, S.; Toda, T.; Takeda, K. Weakly-supervised sound event detection with self-attention. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 66–70. [Google Scholar]
- Ruiz-Muñoz, J.; Orozco-Alzate, M.; Castellanos-Dominguez, G. Multiple instance learning-based birdsong classification using unsupervised recording segmentation. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
- Tang, H.; Lu, L.; Kong, L.; Gimple, K. End-to-end neural segmental models for speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1254–1264. [Google Scholar] [CrossRef]
- Shi, B.; Settle, S.; Livescu, K. Whole-word segmental speech recognition with acoustic word embeddings. In Proceedings of the IEEE Spoken Language Technology Workshop, Shenzen, China, 19–22 January 2021. [Google Scholar]
- Lu, L.; Kong, L.; Dyer, C.; Smith, N.; Renals, S. Segmental Recurrent Neural Networks for End-to-end Speech Recognition. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
- Wang, C.; Wang, Y.; Huang, P.; Mohamed, A.; Zhou, D.; Deng, L. Sequence Modeling via Segmentations. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Tang, H.; Wang, W.; Gimpel, K.; Livescu, K. End-to-end training approaches for discriminative segmental models. In Proceedings of the IEEE Spoken Language Technology Workshop, San Diego, CA, USA, 13–16 December 2016. [Google Scholar]
- Kong, L.; Dyer, C.; Smith, N. Segmental Recurrent Neural Networks. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vacouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Gemmeke, J.; Ellis, D.; Feedman, D.; Jasen, A.; Lawrence, W.; Moore, R.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
- Dekkers, G.; Lauwereins, S.; Thoen, B.; Adhana, M.; Brouckxon, H.; Bergh, B.; Waterschoot, T.; Vanrumste, B.; Verhelst, M.; Karsmakers, P. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany, 16 November 2017; pp. 32–36. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
- Bilen, C.; Ferroni, G.; Tuveri, F.; Azcarreta, J.; Krstulovic, S. A Framework for the Robust Evaluation of Sound Event Detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Serizel, R.; Turpault, N.; Eghbal-Zadeh, H.; Shah, A.P. Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Surrey, UK, 19–20 November 2018. [Google Scholar]






| Training Set | Validation Set | Evaluation Set | |||
|---|---|---|---|---|---|
| Label Type | Weak | Strong | Unlabeled | Strong | Strong |
| No. of clips | 1578 | 2045 | 14,412 | 1168 | 692 |
| Properties | Real recording | Synthetic | Real recording | Real recording | Real recording |
| Clip length | 10 s | ||||
| Classes (10) | Speech, Dog, Cat, Alarm bell ring, Dishes, Frying, Blender, Running water, Vacuum cleaner, Electric shaver toothbrush | ||||
| # of LSTM Layers | Evaluation Set | Validation Set | Training Set |
|---|---|---|---|
| 1 | 14.87 | 9.23 | 57.07 |
| 2 | 10.49 | 9.26 | 42.44 |
| 3 | 13.91 | 8.83 | 45.55 |
| 4 | 11.43 | 7.92 | 33.65 |
| # of LSTM Layers | Evaluation Set | Validation Set | Training Set |
|---|---|---|---|
| 1 | 0.29 | 0.22 | 0.77 |
| 2 | 0.23 | 0.20 | 0.79 |
| 3 | 0.27 | 0.21 | 0.75 |
| 4 | 0.28 | 0.20 | 0.73 |
| Stride (# of Frames in Audio Script) | Evaluation Set | Validation Set | Training Set |
|---|---|---|---|
| 1 (250) | 4.26% (0.14) | 2.15% (0.12) | 19.37% (0.45) |
| 2 (125) | 12.11% (0.23) | 5.66% (0.19) | 47.21% (0.74) |
| 4 (62) | 14.87% (0.29) | 9.23% (0.22) | 57.07% (0.77) |
| Evaluation Set | ||
|---|---|---|
| DCASE 2018 CRNN-Based Baseline | Proposed Segmental Model | |
| F-score (%) | 2.74% | 14.87% |
| Number of LSTM Layers | Evaluation Set (F-Score/PSDS) | Validation Set (F-Score/PSDS) | Training Set (F-Score/PSDS) | |
|---|---|---|---|---|
| With “SIL” | 1 | 14.87% (0.29) | 9.23% (0.22) | 57.07% (0.77) |
| 2 | 10.49% (0.24) | 9.26% (0.20) | 42.44% (0.79) | |
| Without “SIL” | 1 | 6.41% (0.23) | 4.28% (0.17) | 7.48% (0.32) |
| 2 | 8.03% (0.25) | 5.56% (0.18) | 9.53% (0.38) | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chung, Y.-J. Sound Event Detection Employing Segmental Model. Mathematics 2025, 13, 3948. https://doi.org/10.3390/math13243948
Chung Y-J. Sound Event Detection Employing Segmental Model. Mathematics. 2025; 13(24):3948. https://doi.org/10.3390/math13243948
Chicago/Turabian StyleChung, Yong-Joo. 2025. "Sound Event Detection Employing Segmental Model" Mathematics 13, no. 24: 3948. https://doi.org/10.3390/math13243948
APA StyleChung, Y.-J. (2025). Sound Event Detection Employing Segmental Model. Mathematics, 13(24), 3948. https://doi.org/10.3390/math13243948
