Next Article in Journal
Mean-Square Quasi-Consensus for Discrete-Time Multi-Agent Systems with Multiple Uncertainties
Next Article in Special Issue
Essential Conflict Measurement in Dempster–Shafer Theory for Intelligent Information Fusion
Previous Article in Journal
A Multi-Tier Vehicular Edge–Fog Framework for Real-Time Traffic Management in Smart Cities
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sound Event Detection Employing Segmental Model

Department of Electronics, Keimyung University, Daegu 42601, Republic of Korea
Mathematics 2025, 13(24), 3948; https://doi.org/10.3390/math13243948
Submission received: 5 November 2025 / Revised: 4 December 2025 / Accepted: 9 December 2025 / Published: 11 December 2025

Abstract

Segmental models compute likelihood scores in segment units instead of frame units to recognize sequence data. Motivated by some promising results in speech recognition and natural language processing, we apply segmental models to sound event detection for the first time and verify their effectiveness compared to the conventional frame-based approaches. The proposed model processes variable-length segments of sound signals by encoding feature vectors employing deep learning techniques. These encoded vectors are subsequently embedded to derive representative values for each segment, which are then scored to identify the best matches for each input sound signal. Owing to the inherent variation in lengths and types of input sound signals, segmental models incur high computational and memory costs. To address this issue, a simple segment-scoring function with efficient computation and memory usage is employed in our end-to-end model. We use marginal log loss as the cost function while training the segment model, which eliminates the reliance on strong labels for sound events. Experiments performed on the detection and classification of acoustic scenes and events challenge 2019 dataset reveal that the proposed method achieves a better F-score in sound event detection compared with conventional convolutional recurrent neural network-based models.

1. Introduction

Machine learning-based approaches have been proposed to automatically extract information from environmental sounds. In this context, the detection and classification of acoustic scenes and events (DCASE) competition has been organized annually since 2013, addressing several key topics such as sound event detection (SED). SED involves the identification of sound signals as well as the onset and offset times of sound events [1]. Its applications include audio surveillance [2,3], urban sound analysis [4], information retrieval from multimedia content [5], healthcare monitoring [6], bird call detection [7], pathological voice detection [8] and infant cry classification [9].
Existing SED studies have primarily focused on deep neural network (DNN)-based approaches owing to their state-of-the-art performance in various artificial intelligence tasks, including computer vision [10], speech recognition [11], machine translation [12,13,14,15], and speaker identification [16]. In particular, convolutional recurrent neural networks (CRNNs), which combine convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been demonstrated to outperform conventional machine learning methods in SED [13]. CRNNs are now considered representative deep neural network architectures for SED and are widely used in diverse applications [17,18].
Despite the recent successes of DNNs in SED, there is considerable scope for improvement to ensure reliable SED in real-world situations. A key factor contributing to the performance degradation of DNNs in SED is unavailability of adequate audio data with strong labels while training the audio classifier. Strong labels specify correct onset and offset times corresponding to each sound event in conjunction with their type. In real-world application scenarios, the high cost of obtaining strongly labeled data restricts the amount of such data available for training. In addition, real-world audio data, particularly large-scale audio data available from websites, contain various types of label noise [19], e.g., incorrect labels, where labeling errors result from mistakes made by annotators or ambiguities inherent in the sound signal, and incomplete labels, where the onset and offset times of each sound event are not provided alongside the sound event type. Weakly labeled data, which are widely used for training audio classifiers in DCASE challenges, are known to be the most representative form of incompletely labeled data in this context [1,20].
Extensive research has been conducted to prevent performance degradation caused by incorrectly labeled noisy data. For instance, data cleansing, noise-tolerant training, and label noise modeling have been proposed for this purpose, with considerable success in audio classification as well as speech and image recognition [21,22].
By contrast, incompletely (weakly) labeled noisy data have not been studied as extensively. Techniques such as multiple instance learning and attention-based neural networks have been proposed in this context; however, they have only achieved limited success [23,24]. As weakly labeled data are readily available on a large scale and widely used for audio classification, training methods should be devised for audio classifiers that are capable of overcoming their limitations [20].
Recently, segment-based models have been applied successfully in speech recognition tasks [25,26,27] using words or sub-words as basic computational units. This is in contrast with conventional speech recognizers, which compute likelihood scores at each time frame. As frame-based models compute scores based on fixed input signal lengths, they struggle to incorporate linguistic or contextual information. By contrast, segmental models compute likelihood scores of segments by considering the beginning and end times of each segment relative to its associated word or sub-word unit label. The ability to incorporate various types of information within a segment makes segmental models suitable for applications such as speech recognition and sound signal classification.
Because segmental models identify optimal boundaries of each segment automatically during training, the timing boundaries of each label need not be included in the training dataset [28,29,30]. This makes segmental models suitable for SED, where only weakly labeled data are available as training data. Segmental models exhibit negligible performance improvement compared with frame-based methods in speech recognition; however, this is likely attributed to the excessive vocabulary size involved in speech recognition that hinders the training of robust segmental models. In addition, the use of complex language models in speech recognizers hinders the efficient implementation of segmental models. However, audio classification involves fewer sound classes than speech recognizers, and does not involve complex language models. Therefore, we believe segmental models will be suitable for audio classification, with performance improvement compared with existing methods.
Motivated by these considerations, we propose the use of segmental models to mitigate the degradation of audio classification performance induced by weakly labeled noisy data. To the best of our knowledge, this is the first study to apply segmental models to SED. Segmental models are expected to perform robustly when handling weakly labeled noisy data because they identify optimal segment boundaries between neighboring sound events incrementally during the training process. This characteristic mitigates the need for start and end times of sound events to be included in audio recording databases, enabling the use of large-scale audio data for training.
The remainder of this paper is organized as follows. In Section 2, we introduce the proposed segmental model. In Section 3, we present the experimental results on SED performance. Finally, the conclusions are presented in Section 4.

2. The Segmental Model for SED

The architecture of the proposed segmental SED model is illustrated in Figure 1. It is an end-to-end model from an acoustic signal to a sound label, without requiring any lexicon model. Owing to the benefits of processing a sound signal as a whole rather than decomposing it into sub-sound units, an end-to-end model is suitable for SED. Our model is adapted for SED from a segmental model recently used for speech recognition [26]. Unlike other end-to-end models, it computes sequence probabilities based on segment scores rather than frame scores. Segment scoring is performed using dot products between segment embeddings derived from acoustic features and a weight layer, where each node represents a sound label. This enables the use of a layer for predefined acoustic segment embeddings and sound label embeddings, if they are available.
As depicted in Figure 1, the proposed model can be decomposed into several sub-modules, i.e., feature extraction, feature encoding, segment embedding, and segment scoring. In addition, loss computation and backpropagation are performed during training. While decoding, SED is performed based on the optimal segment boundary decision by applying the Viterbi algorithm to the segmental model. In the following subsections, we discuss each part of the segmental model in detail.

2.1. Feature Extraction

For the training and decoding of the segmental model, the log-mel filterbank (LMFB) is extracted to provide acoustic features to the network, as outlined in Figure 2. The audio signal is sampled at 44 kHz, and the short-time Fourier transform (STFT) is computed using a Hamming window of length of 1764 (40 ms) with an overlap of 882 (20 ms). Sixty-four bands of the mel-scale filterbank outputs between 0 and 22.05 kHz are obtained using STFT and log-transformed to yield equidimensional LMFB for each 20 ms frame. The feature extraction process generates 500 frames with 40 dimensions for the 10 s clips used for training and decoding. After computing the LMFB, it is normalized by subtracting its mean from it and dividing it by its standard deviation over the entire training dataset. Subsequently, it is used as the input for the segmental model.
In speech recognition, which also involves time-series signals as in the case of SED, the first- and second-derivative features are calculated from static features. In this study, we consider the extracted LMFB as a static feature and compute the derivative features as follows for training the segmental model:
d t = k = 1 K k ( o t + k o t k ) 2 k = 1 K k 2
where d t denotes the derivative feature at time t , o t denotes the static feature, and K denotes the number of frames preceding and following the t -th frame. For the second-derivative feature, the computed first-derivative feature given by (1) is considered as the static feature. Consequently, we use 120   ( 40 × 3 ) -dimensional vectors (static LMFB + delta LMFB + delta-delta LMFB) as the input features. In addition, to compactify the input data, pairs of successive frames are stacked and alternative frames are dropped, yielding 240-dimensional feature vectors as inputs to the segmental model.

2.2. Feature Encoding

After extracting the feature vectors, a neural network structure is used to encode the feature vectors x = x 1 , x 2 ,   , x T ,     x t R F into encoding vectors h = h 1 ,   h 2 ,   ,   h T ^ ,   h t R E , where F and E denote the dimension of the extracted feature vectors and encoded vectors, respectively, and T and T ^ denote the sequence lengths of respective vectors. The architecture of the neural network for feature encoding, which makes the encoded features more suitable than the original feature vectors for sound signal classification, is depicted in Figure 3.
As depicted in Figure 3, a sequence of feature vectors X with a length of 250 frames (1 s) is input into the six layers of bidirectional long short-term memory (LSTM) to model the time-correlation of the feature vectors. The number of hidden nodes in the LSTM is 512; therefore, the dimension of the output of the bidirectional LSTM is 1024 (2 × 512). Applying the 1D CNN to the output of the bidirectional LSTM, X , reduces the dimension of the vectors to 512 without any loss of time-frame signal information. After transmitting the output of the 1D CNN to the rectified linear unit (RELU) layer, it is passed through average pooling (Avg_pooling1d) to halve the length of the time sequence to 125. Avg_pooling1d facilitates in discarding irrelevant information in the output of the 1D CNN and reducing the computation time, which is significantly high for subsequent processes. By controlling the stride in Avg_Pooling1d, we can achieve a tradeoff between the performance of the segmental model and the computation time required.

2.3. Segment Embedding and Scoring

The segment embedding module extracts embedding vectors that represent segments from the encoding vectors h = h 1 ,   h 2 ,   ,   h T ^ ,   h t R E . For example, for an assumed segment t ,   t + s , 1 t T ^ ,   1 s S , where S denotes the maximum possible segment length, the segment embedding vector I h t : t + s     R D is derived using a feed-forward neural network (FNN) with a RELU activation function and h t : t + s     R s   × E as input.
I h t : t + s = W 1 P h t : t + s + b 1
Here, h t : t + s denotes the set of vectors h t ,   h t + 1 ,   ,   h t + s 1 , P ( · ) represents pooling of the vector sequence h t ,   h t + 1 ,   ,   h t + s 1 , and W 1 and b 1 represent the weight and bias of the FNN, respectively. Of the several potential options, P ( · ) is taken to be simple concatenation of the first and last feature encoding vector of the segment h t ,   h t + 1 ,   ,   h t + s 1 , as follows.
P h t : t + s = h t : h t + s 1
Using the segment embedding vector I h t : t + s given by (2), the segment score ω t , s , y for the segment beginning at time t and ending at ( t + s 1 ) with the sound label y is obtained using another FNN, as follows.
ω t , s , 1 y L y = W 2 I h t : t + s + b 2 ,   1 t T ^ ,   1 s S
Here, L y denotes the total number of sound classes (labels) to be classified by the segmental model, with W 2   R D × Y and b 2 R Y . The integrated neural network architecture of the segment embedding and segment scoring modules is illustrated in Figure 4.
In Figure 5, we show the block diagram for the whole architecture of the segmental model. From the figure, we can find that original 40-dimensional LMFB feature vectors of 500 frames are passed through the model to finally output time stamp predictions for the 10 sound classes assuming that stride is 4 at pooling.

2.4. Training

During training, the marginal log loss criterion is adopted as the loss function to efficiently cope with real-world environments, where sufficient training data with ground truth segmentation are not typically available. This is expected to compensate for any performance degradation induced by incompletely labeled errors that occur frequently in most public sound datasets collected from real environments.
Given an input training feature vector sequence x = x 1 , x 2 ,   , x T and its corresponding label sequence Y = y 1 , y 2 , y L without time segmentation information, the marginal log loss function, L · , is defined as follows:
L Y , x = l o g p Y x = l o g z Z ´     p ( Y , z | x )
Here, Z ´ denotes the set of all possible segmentations of input x relative to the given label sequence Y , and p Y , z | x denotes the conditional probability of the label sequence Y and a segmentation z given an input sequence vector x .
p Y , z | x is defined as follows:
p Y , z x = p π Y x 1 Z x e x p ω x , π Y
Z x = π ¯ e x p ω x , π ¯
In (6), π Y = Y , z denotes a segmentation path of Y and ω π Y , x represents the segmentation score of x = x 1 , x 2 ,   , x T for the segmentation path π Y . ω π Y , x can be obtained by adding the segmentation score for each segment comprising the path π Y using (4). In (7), π ¯ represents any segmentation path without relying on the label sequence Y .
From (5) and (6), we have
L Y , x = log z Z ´ p Y , z x = log z Z ´ e x p ω x , ( Y , z ) + l o g Z x = log π , ¯   L π ¯ = Y e x p ω x , π ¯ + l o g Z x ,
By adding the condition L π ¯ = Y to (8), we restrict the summation to only the segmentation paths with the label sequence Y , unlike that in (7), where no such restriction is imposed on the label sequence.
The cost function given by (8) and its derivative values can be computed efficiently using an iterative forward-backward algorithm [31], which enables the optimization of parameters of the neural networks that comprise the segmental model.

2.5. Decoding

The decoding algorithm aims to optimize label segmentation and can be formulated mathematically as follows:
π * = a g r m a x π   ω x , π
Because π * = Y * ,   z * given by (9) yields the highest segmentation score among all possible segmentations for the given SED task, the optimal sound label sequence Y * and its time frame segmentation information z * can be obtained based on π * . The implementation of (9) can be computed efficiently via dynamic programming using the Viterbi algorithm.

3. Experimental Results

3.1. Experimental Conditions

The DCASE 2019 Task 4 dataset is used for our experiments [1]. The training set comprises three parts—2045 synthesized audio clips with strong labels, 1578 real-world audio recording clips with weak labels, and 14,412 real-world audio recording clips without labels. The real-world audio recording clips are obtained from AudioSet [29]. Synthetic clips are generated using clean signals from the Freesound dataset [32] and noise derived from the Sound Interfacing through the Swarm dataset [33]. We also consider validation data comprising 1168 real-world audio recording clips. The length of each clip is taken to be 10 s, with 10 different sound classes, usually domestic or household. The details of the dataset are presented in Table 1. The experimental results on the validation and testing datasets are presented below.
As listed in Table 1, although the original training data comprise weakly labeled, strongly labeled, and unlabeled data, only strongly labeled data are used in this study because the other two data types (weakly labeled and unlabeled) do not have the desired label format suitable for the segmental model proposed in this paper. The segmental model requires label information that is updated whenever they appear in the sound signal. As the weakly labeled data in DCASE2019 Task 4 specify only the types of labels for the audio clips, the entire label sequence information required for training the segmental model cannot be obtained from them. For example, when the audio clip has actual sound labels “Speech-Dog-Speech-Speech-Dog” in sequence, the weakly labeled data in DCASE2019 gives label information as “Speech-Dog”, from which we cannot obtain the entire label sequence information for training the segmental model. Although we use strongly labeled data, time segmentation information in the strongly labeled data is not used in this work because the marginal log loss function used as the training criterion requires only label sequence information. There could be some modification in the marginal log loss function to accommodate the weakly labeled data, which will be left as a future study. Thus, our segmental model has the restriction that the weakly labeled data should provide the whole label sequence information.

3.2. Evaluation Metrics

The performance of the segmental model is evaluated in terms of the F-score employing event-based analysis [34], which compares the output of the model with the ground truth table when the output indicates that an event has occurred. The initial decision is potentially of three types—true positive (TP), false positive (FP), or false negative (FN). A TP indicates that the period of the detected sound event overlaps with that in the ground-truth table. In the decision, an onset collar of 200 ms and an offset collar of 200 ms or 20% of the event length are permissible. An FP indicates the absence of any corresponding overlap period of the Transformer output with the corresponding entry in the ground-truth table. An FN indicates an event period in the ground-truth table that is not captured by a corresponding output of the Transformer model.
The F-score (F) is computed based on the initial three decisions and is defined to be the harmonic mean of precision (P) and recall (R). These are computed as follows:
P = T P T P + F P ,   R = T P T P + F N ,   F = 2 P R P + R
We also used polyphonic sound detection score (PSDS) as another metric for evaluating the performance of the segmental model. PSDSs are presented in this paper as a contrastive measure to F-score. It was originally introduced as a new, flexible and robust definition of sound event detection that yields an evaluation closer to the end-user perception of sound events [35]. It is known to discriminate cross-triggers from generic false positives and supports their custom weighting to cope with imbalanced datasets and to help developers to identify the system weaknesses. While F-scores are computed using a single operating point (decision threshold = 0.5), PSDSs are computed using 50 operating points (linearly distributed from 0.01 to 0.99). The parameters used for PSDS evaluation is this study are (alpha_ct = 0, alpha_st = 0):
  • • Detection Tolerance parameter (dtc): 0.5
  • • Ground Truth intersection parameter (gtc): 0.5
  • • Cross-Trigger Tolerance parameter (cttc): 0.3
  • • Maximum False Positive rate (e_max): 100

3.3. Experimental Results

The marginal log loss function is used as the training criterion based on the Adam [36] optimizer with a learning rate of 0.001 and batch size of 32. The StepLR scheduler from PyTorch (v1.11.0) is used to reduce the learning rate during training iterations, and early termination is applied based on the F-score on the validation dataset.
Figure 6 depicts typical examples of the audio clips considered in this study. In Figure 6a, two different types of sound signals appear in sequence, with silent periods between them. As depicted in Figure 6b, a long period of silence is observed prior to the sound signal. These examples highlight the high occurrence of silent periods or other irrelevant sound signals (garbage) at the beginning and end of the audio clip or between pairs of separate sound signals. To address this scenario, we add an “SIL” label at the initial and final parts of the audio clip and between pairs of distinct sound signals. This is expected to improve SED performance significantly. The effect of the added “SIL” label will be explained in detail later in this section.
Table 2 presents the F-score of the proposed segmental model corresponding to varying numbers of bidirectional long short-term memory (LSTM) layers when the stride at Avg_Pooling1d in Figure 3 is 4 and the maximum segment size S in Figure 4 is 62—performance is observed to be degraded across all three datasets (testing, validation, and training sets) as the number of LSTM layers is increased. This confirms the insufficiency of the amount of training data used in this experiment to leverage the potential of the proposed architecture. We believe that the performance of the proposed segmental model can be assessed more accurately when sufficient training data are available. The best performance is achieved when one LSTM layer is used. As listed in Table 2, the F-score in this case is 14.87% on the testing set, 9.23% on the validation set, and 57.07% on the training set.
In Table 3, we compared the PSDSs across different numbers of LSTM layers. Unlike performances measured by F-score, we could not see significant performance variation as the number of LSTM layers changes. We can observe that F-scores are more sensitive to the changes in the number of LSTM layers than PSDSs, which is expected because PSDSs are usually more robust than F-scores by representing performances from the perspective of human perception.
To observe the effects of pooling in the feature encoding shown in Figure 3, the performance variation as the stride in Avg_Pooling1d changes is shown in Table 4. It shows F-scores as well as PSDSs when the maximum segment size S is 62 and the number of LSTM layers is 1. We can see that the performance improves as the stride increases from 1 to 4 where the length of the audio file becomes equal to maximum segment size. This result can be interpreted in two different ways. Firstly, the superior performance with more pooling may suggest that we need more training data to fully exploit the advantage of the segmental model. Secondly, the performance variation with the pooling stride is related to the maximum segment size S. When the stride is 2, the length of the audio script (125 frames) is less than the maximum segment size (62 frames), which can negatively affect the performances because the segmental model may fail to model reliably long duration sound classes such as Speech or Electric shaver toothbrush whose length often exceed half the total length of the audio script.
Table 5 presents performance comparison between the proposed segmental model and the CRNN-based baseline model [37] of the DCASE 2018 Challenge. To ensure fair comparison, only strongly labeled data without segmentation information are used to train the CRNN model. Further, the architecture of the CRNN model as well as its training procedure are adopted from the official website of the DCASE 2018 Challenge. The results listed in Table 5 indicate that the proposed segmental model outperforms the CRNN model. Even though this conclusion is based on a subset of the dataset presented in Table 1, we believe that it is still very meaningful as it demonstrates the potential of the segmental model to replace conventional frame-based SED approaches once it is improved further. The poor performance of CRNN-based baseline system in Table 5 can be understood by considering the fact that it has shown an F-score of 14.06% when it is trained by using the weakly labeled data (1578 clips) as well as unlabeled data (14,412 clips) in the DCASE2018 challenge dataset [37]. We think that the insufficient amount of the strongly labeled training data (2045 clips) have significantly contributed to the poor performance of the CRNN model and could confirm that the segmental model is more robust than the frame-based methods like CRNN in SED.
Table 6 presents a performance comparison between segmental models with and without “SIL” labels. As expected, significant performance improvement is attained when the “SIL” label is used. This can be attributed to the effective modeling of silent periods and non-target garbage signals that occur frequently in audio clips used for training and testing in the segmental model. The performance can be improved further by specifying the types (label) of nontarget garbage signals in greater detail.

4. Conclusions

4.1. Discussions

Frame-based classification approaches are widely used for SED; however, they require the inclusion of frame-wise segmentation information in the training data, which is rarely available in real-world environments. To address this limitation, we propose a segmental model, wherein the entire sound signal is modeled as a segment rather than as a concatenation of frames, and adopt a marginal log loss function to optimize the parameters of the segmental model without requiring time segmentation information.
The proposed segmental model comprises three primary components that are concatenated sequentially. They involve encoding feature vectors, embedding the encoded vectors to obtain representative values for each segment, and scoring the segments to identify the best matches for each incoming input sound signal. The three primary components are implemented by employing deep neural networks appropriately to construct an end-to-end audio classifier, which contributes to improving SED performance by optimizing the parameters of the segmental model consistently.
Experimental results on the DCASE 2019 Challenge dataset reveal that the proposed segmental model outperforms the CRNN-based baseline audio classifier, demonstrating the future potential of the proposed segmental model in SED despite the insufficient training data to fully justify the simulation results.

4.2. Future Studies

Although our study is novel in that it uses segmental models instead of conventional frame-wise classification methods for the first time in SED, it has some constraints that should be overcome in future studies.
First, the proposed segmental model employs marginal log loss function as the training criterion, which requires label sequence information in the training data. However, the weakly labeled data in DCASE Challenges specify only the types of labels that exist in the audio clips and do not provide the label sequence information. Thus, we had to use the strongly labeled data in DCASE Challenge without segmentation information as the training data since it was the only data that fits for the training criterion. By modifying the marginal log loss function in (8), we think that the weakly labelled data in DCASE Challenges could be accommodated in the proposed segmental model, which will be left as a main topic in future study.
Secondly, there needs to be some improvement in the segment pooling employed in the segment embedding process. Although we just used the concatenation of the encoding vectors for the segment pooling, it may not well represent the characteristic of the segment as a whole. We need to devise segment pooling methods that can more fully represent a segment than the current concatenation method.
Thirdly, due to the limited training data available, we could not fully evaluate the performance of the proposed segmental model, and the comparison with the state-of-the-art methods in the field of weakly supervised SED was difficult. We will have to find ways to significantly enrich the training data required for the proposed segmental model, thus making it possible to do extensive performance evaluation of the segmental model in the near future.

Funding

This research was supported by the BISA Research Grant of Keimyung University in 2023.

Data Availability Statement

The data presented in this study are openly available in https://dcase.community/challenge2019/task-sound-event-detection-in-domestic-environments#download (accessed on 1 June 2023).

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Turpault, N.; Serizel, R.; Salamon, J.; Shah, A.P. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
  2. Nandwana, M.K.; Ziaei, A.; Hansen, J. Robust unsupervised detection of human screams in noisy acoustic environments. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 161–165. [Google Scholar]
  3. Crocco, M.; Cristani, M.; Trucco, A.; Murino, V.M. Audio surveillance: A systematic review. ACM Comput. Surv. 2016, 48, 1–46. [Google Scholar] [CrossRef]
  4. Salamon, J.; Bello, J.P. Feature learning with deep scattering for urban sound analysis. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 724–728. [Google Scholar]
  5. Ntalampiras, S.; Potamitis, I.; Fakotakise, N. On acoustic surveillance of hazardous situations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, 19–24 April 2009; pp. 165–168. [Google Scholar]
  6. Wang, Y.; Neves, L.; Metze, F. Audio-based multimedia event detection using deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France, 14–19 May 2006; pp. 2742–2746. [Google Scholar]
  7. Dekkers, G.; Vuegen, L.; Waterschoot, T.; Vanrumste, B.; Karsmakers, P. DCASE 2018 Challenge-Task 5: Monitoring of domestic activities based on multi-channel acoustics. arXiv 2018, arXiv:1807.11246. [Google Scholar]
  8. Renisha, G.; Jayasree, T. Cascaded Feedforward Neural Networks for speaker identification using Perceptual Wavelet based Cepstral Coefficients. J. Intell. Fuzzy Syst. Appl. Eng. Technol. 2019, 37, 1141–1153. [Google Scholar] [CrossRef]
  9. Jayasree1, T.; Emerald Shia, S. Combined Signal Processing Based Techniques and Feed Forward Neural Networks for Pathological Voice Detection and Classification. Sound Vib. 2021, 55, 141–161. [Google Scholar] [CrossRef]
  10. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 60, 84–90. [Google Scholar] [CrossRef]
  11. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural Networks. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
  12. Cho, K.; Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
  13. Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
  14. Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multilabel deep neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–7. [Google Scholar]
  15. McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
  16. Jayasree, T.; Blessy, S. Infant cry classification via deep learning based Infant cry networks using Discrete Stockwell Transform. Eng. Appl. Artif. Intell. 2025, 160, 112008. [Google Scholar] [CrossRef]
  17. Kwak, J.; Chung, Y. Sound event detection using derivative features in deep neural networks. Appl. Sci. 2020, 10, 4911. [Google Scholar] [CrossRef]
  18. Kim, S.; Chung, Y. Multi-scale Features for Transformer Model to Improve the Performance of Sound Event Detection. Appl. Sci. 2022, 12, 2626. [Google Scholar] [CrossRef]
  19. Fonseca, E.; Pons, J.; Favory, X.; Font, F.; Bogdanov, D.; Ferraro, A.; Oramas, S.; Porter, A.; Serra, X. Freesound datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 486–493. [Google Scholar]
  20. Frenay, B.; Verleysen, M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 2, 845–869. [Google Scholar] [CrossRef] [PubMed]
  21. Fonseca, E.; Plakal, M.; Ellis, D.; Font, F.; Favory, X.; Serra, X. Learning sound event classifiers from web audio with noisy labels. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019. [Google Scholar]
  22. Beigman, E.; Klebanov, B. Learning with annotation noise. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 2–7 August 2009; pp. 280–287. [Google Scholar]
  23. Miyazaki, K.; Komatsu, T.; Hayashi, T.; Watanabe, S.; Toda, T.; Takeda, K. Weakly-supervised sound event detection with self-attention. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 66–70. [Google Scholar]
  24. Ruiz-Muñoz, J.; Orozco-Alzate, M.; Castellanos-Dominguez, G. Multiple instance learning-based birdsong classification using unsupervised recording segmentation. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  25. Tang, H.; Lu, L.; Kong, L.; Gimple, K. End-to-end neural segmental models for speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1254–1264. [Google Scholar] [CrossRef]
  26. Shi, B.; Settle, S.; Livescu, K. Whole-word segmental speech recognition with acoustic word embeddings. In Proceedings of the IEEE Spoken Language Technology Workshop, Shenzen, China, 19–22 January 2021. [Google Scholar]
  27. Lu, L.; Kong, L.; Dyer, C.; Smith, N.; Renals, S. Segmental Recurrent Neural Networks for End-to-end Speech Recognition. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
  28. Wang, C.; Wang, Y.; Huang, P.; Mohamed, A.; Zhou, D.; Deng, L. Sequence Modeling via Segmentations. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  29. Tang, H.; Wang, W.; Gimpel, K.; Livescu, K. End-to-end training approaches for discriminative segmental models. In Proceedings of the IEEE Spoken Language Technology Workshop, San Diego, CA, USA, 13–16 December 2016. [Google Scholar]
  30. Kong, L.; Dyer, C.; Smith, N. Segmental Recurrent Neural Networks. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  31. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vacouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  32. Gemmeke, J.; Ellis, D.; Feedman, D.; Jasen, A.; Lawrence, W.; Moore, R.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
  33. Dekkers, G.; Lauwereins, S.; Thoen, B.; Adhana, M.; Brouckxon, H.; Bergh, B.; Waterschoot, T.; Vanrumste, B.; Verhelst, M.; Karsmakers, P. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany, 16 November 2017; pp. 32–36. [Google Scholar]
  34. Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
  35. Bilen, C.; Ferroni, G.; Tuveri, F.; Azcarreta, J.; Krstulovic, S. A Framework for the Robust Evaluation of Sound Event Detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
  36. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  37. Serizel, R.; Turpault, N.; Eghbal-Zadeh, H.; Shah, A.P. Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Surrey, UK, 19–20 November 2018. [Google Scholar]
Figure 1. The Architecture of the Proposed Segmental Model used for the SED.
Figure 1. The Architecture of the Proposed Segmental Model used for the SED.
Mathematics 13 03948 g001
Figure 2. Feature extraction process of log-mel filterbank (LMFB).
Figure 2. Feature extraction process of log-mel filterbank (LMFB).
Mathematics 13 03948 g002
Figure 3. The architecture of the neural network for Feature Encoding.
Figure 3. The architecture of the neural network for Feature Encoding.
Mathematics 13 03948 g003
Figure 4. The neural network architecture of segment embedding and segment scoring module of the segmental model.
Figure 4. The neural network architecture of segment embedding and segment scoring module of the segmental model.
Mathematics 13 03948 g004
Figure 5. The block diagram for the whole architecture of the segmental model.
Figure 5. The block diagram for the whole architecture of the segmental model.
Mathematics 13 03948 g005
Figure 6. Typical examples of audio clips used in the experiment.
Figure 6. Typical examples of audio clips used in the experiment.
Mathematics 13 03948 g006
Table 1. DCASE 2019 Task4 dataset used in the experiment.
Table 1. DCASE 2019 Task4 dataset used in the experiment.
Training SetValidation SetEvaluation Set
Label TypeWeakStrongUnlabeledStrongStrong
No. of clips1578204514,4121168692
PropertiesReal
recording
SyntheticReal
recording
Real
recording
Real
recording
Clip length10 s
Classes (10)Speech, Dog, Cat, Alarm bell ring, Dishes, Frying, Blender, Running water, Vacuum cleaner, Electric shaver toothbrush
Table 2. F-scores (%) corresponding to different numbers of LSTM layers (stride at Avg_Pooling1d = 4, maximum segment size = 62).
Table 2. F-scores (%) corresponding to different numbers of LSTM layers (stride at Avg_Pooling1d = 4, maximum segment size = 62).
# of LSTM LayersEvaluation SetValidation SetTraining Set
114.879.2357.07
210.499.2642.44
313.918.8345.55
411.437.9233.65
Table 3. PSDSs corresponding to different numbers of LSTM layers (stride at Avg_Pooling1d = 4, maximum segment size = 62).
Table 3. PSDSs corresponding to different numbers of LSTM layers (stride at Avg_Pooling1d = 4, maximum segment size = 62).
# of LSTM LayersEvaluation SetValidation SetTraining Set
10.290.220.77
20.230.200.79
30.270.210.75
40.280.200.73
Table 4. F-scores and PSDSs depending on the stride at Avg_Pooling1d in Figure 3 (maximum segment size = 62, # of LSTM layers = 1).
Table 4. F-scores and PSDSs depending on the stride at Avg_Pooling1d in Figure 3 (maximum segment size = 62, # of LSTM layers = 1).
Stride (# of Frames
in Audio Script)
Evaluation SetValidation SetTraining Set
1 (250)4.26% (0.14)2.15% (0.12)19.37% (0.45)
2 (125)12.11% (0.23)5.66% (0.19)47.21% (0.74)
4 (62)14.87% (0.29)9.23% (0.22)57.07% (0.77)
Table 5. Performance comparison between the proposed segmental model (# of LSTM layers = 1, maximum segment size = 62, stride at Avg_Pooling1d = 4) and the CRNN-based baseline model of DCASE 2018 Challenge. (Both models trained using strongly labeled data w/o segmentation information).
Table 5. Performance comparison between the proposed segmental model (# of LSTM layers = 1, maximum segment size = 62, stride at Avg_Pooling1d = 4) and the CRNN-based baseline model of DCASE 2018 Challenge. (Both models trained using strongly labeled data w/o segmentation information).
Evaluation Set
DCASE 2018
CRNN-Based Baseline
Proposed
Segmental Model
F-score (%)2.74%14.87%
Table 6. Performance comparison between segmental models with “SIL” and without “SIL” label (maximum segment size = 62, stride at Avg_Pooling1d = 4).
Table 6. Performance comparison between segmental models with “SIL” and without “SIL” label (maximum segment size = 62, stride at Avg_Pooling1d = 4).
Number of LSTM LayersEvaluation Set
(F-Score/PSDS)
Validation Set
(F-Score/PSDS)
Training Set
(F-Score/PSDS)
With “SIL”114.87% (0.29)9.23% (0.22)57.07% (0.77)
210.49% (0.24)9.26% (0.20)42.44% (0.79)
Without “SIL”16.41% (0.23)4.28% (0.17)7.48% (0.32)
28.03% (0.25)5.56% (0.18)9.53% (0.38)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chung, Y.-J. Sound Event Detection Employing Segmental Model. Mathematics 2025, 13, 3948. https://doi.org/10.3390/math13243948

AMA Style

Chung Y-J. Sound Event Detection Employing Segmental Model. Mathematics. 2025; 13(24):3948. https://doi.org/10.3390/math13243948

Chicago/Turabian Style

Chung, Yong-Joo. 2025. "Sound Event Detection Employing Segmental Model" Mathematics 13, no. 24: 3948. https://doi.org/10.3390/math13243948

APA Style

Chung, Y.-J. (2025). Sound Event Detection Employing Segmental Model. Mathematics, 13(24), 3948. https://doi.org/10.3390/math13243948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop