Abstract
Segmental models compute likelihood scores in segment units instead of frame units to recognize sequence data. Motivated by some promising results in speech recognition and natural language processing, we apply segmental models to sound event detection for the first time and verify their effectiveness compared to the conventional frame-based approaches. The proposed model processes variable-length segments of sound signals by encoding feature vectors employing deep learning techniques. These encoded vectors are subsequently embedded to derive representative values for each segment, which are then scored to identify the best matches for each input sound signal. Owing to the inherent variation in lengths and types of input sound signals, segmental models incur high computational and memory costs. To address this issue, a simple segment-scoring function with efficient computation and memory usage is employed in our end-to-end model. We use marginal log loss as the cost function while training the segment model, which eliminates the reliance on strong labels for sound events. Experiments performed on the detection and classification of acoustic scenes and events challenge 2019 dataset reveal that the proposed method achieves a better F-score in sound event detection compared with conventional convolutional recurrent neural network-based models.