Sound Event Detection with Perturbed Residual Recurrent Neural Network

Yuan, Shuang; Yang, Lidong; Guo, Yong

doi:10.3390/electronics12183836

Open AccessArticle

Sound Event Detection with Perturbed Residual Recurrent Neural Network

by

Shuang Yuan

,

Lidong Yang

^* and

Yong Guo

School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou 014010, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3836; https://doi.org/10.3390/electronics12183836

Submission received: 10 August 2023 / Revised: 2 September 2023 / Accepted: 6 September 2023 / Published: 11 September 2023

(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Sound event detection (SED) is of great practical and research significance owing to its wide range of applications. However, due to the heavy reliance on dataset size for task performance, there is often a severe lack of data in real-world scenarios. In this study, an improved mean teacher model is utilized to carry out semi-supervised SED, and a perturbed residual recurrent neural network (P-RRNN) is proposed as the SED network. The residual structure is employed to alleviate the problem of network degradation, and pre-training the improved model on the ImageNet dataset enables it to learn information that is beneficial for event detection, thus improving the performance of SED. In the post-processing stage, a customized median filter group with a specific window length is designed to effectively smooth each type of event and minimize the impact of background noise on detection accuracy. Experimental results conducted on the publicly available Detection and Classification of Acoustic Scenes and Events 2019 Task 4 dataset demonstrate that the P-RRNN used for SED in this study can effectively enhance the detection capability of the model. The detection system achieves a Macro Event-based F1 score of 38.8% on the validation set and 40.5% on the evaluation set, indicating that the proposed method can adapt to complex and dynamic SED scenarios.

Keywords:

sound event detection; residual neural network; pre-train; mean teacher

1. Introduction

In activities involving human interactions, various valuable sounds exist that serve as the primary source for perceiving changes in the surrounding environment and as important channels for facilitating communication. The accurate and effective utilization of sound information can generate significant value for human society. Sound event detection (SED) is an essential form of sound information utilization that aims to detect specific event categories from complex sound signals and locate the temporal boundaries of each event [1]. It is a crucial component in the field of auditory perception. SED detects ongoing events in the surrounding environment through sound, thereby enabling timely actions to improve people’s quality of life and safeguard their safety and property. It has been widely applied in various aspects of human life and production, such as healthcare monitoring [2], smart homes [3,4], and audio surveillance [5,6]. These applications contribute to improving quality of life and enhancing security.

However, SED is susceptible to background noise interference, and there is often a possibility of multiple sound events occurring simultaneously, making it challenging to manually annotate the time boundaries for such tasks. As a result, the creation of suitable SED datasets and the design of high-performance detection systems have been receiving increasing attention from researchers.

To overcome the challenges faced by SED systems and improve the current state-of-the-art techniques, Queen Mary University of London initiated the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge in 2013 [7]. The DCASE challenge provides researchers with a platform for sharing datasets and evaluating their detection systems. It aims to facilitate the exchange of ideas, foster innovation, and address problems in the field, thereby greatly advancing the progress of SED research.

In this article, the DCASE 2019 Challenge Task 4 dataset is used as the development set [8], and a perturbed residual neural network (P-ResNet) is designed based on ResNet34 [9], which is then combined with a recurrent neural network (RNN) to form the backbone network of the SED system. The pre-trained parameters of ResNet34 on the ImageNet image classification database are used as the starting point for training P-ResNet. Within ResNet34, multiple frequency-domain sub-sampling, Dropout, and Drop Block [10] structures are designed. The frequency-domain sub-sampling enables RNN to receive temporal sequences of audio. The Dropout and Drop Block structures are designed to add perturbation factors to semi-supervised training, thus improving the ability of the mean teacher model to handle a large amount of unbalanced and unlabeled data. To further improve the model’s performance, the window length of the median filter bank is calculated using the synthesized strong annotation information from the development set, and the output is smoothed using the median filter to make the results more accurate.

The rest of this paper is organized as follows: Section 2 analyzes literature interviews in the SED domain. Section 3 presents the architecture of the proposed SED system. Section 4 explains the detection network training process. Section 5 explains the database and presents the experimental results. Finally, Section 6 concludes the paper.

2. Literature Review

Since 2010, deep learning has undergone development. In fields such as speech recognition and text processing, deep learning methods have proven to be more effective in handling data with complex structures and high nonlinearity compared to probabilistic statistical models [11,12]. This outcome has demonstrated the ability of deep learning to better address many complex problems and has sparked widespread interest among SED researchers [13]. Convolutional neural networks (CNNs) have shown an excellent performance in SED tasks [14]. They possess good translational invariance and can effectively learn local features from feature maps. CNNs exhibit robustness to frequency shifts caused by sound source movement, enabling them to detect both single and overlapping events. RNNs can capture the temporal relationships between audio signals, compensating for the temporal information deficiency in CNN [15,16]. Convolutional recurrent neural networks (CRNNs) combine the strengths of CNNs and RNNs to achieve impressive results in SED tasks [17].

Currently, with the tremendous improvement in computing power and data resources, deep learning-based methods have become the mainstream approach to SED tasks. For instance, the multi-scale feature fusing networks (MFFNs) [18] method replaces point sampling in dilated convolutions with region sampling; this mixed dilated convolution can better capture the neighboring information of audio and, combined with feature fusion, achieves the SED task. Zhao et al. [19] utilize a CRNN as the detection network for SED systems and employ a differentiable soft median filter. They find that the systems automatically learn weights with varying smoothness through a linear selection layer to achieve adaptive smoothing. Kiyokawa et al. [20] introduce a self-mask module based on residual networks to generate sound event candidate regions that constrain the temporal boundaries, thereby providing a new perspective for the SED task. In [21], an asymmetric focal loss is used to control the focal factors of active and non-active frames separately, offering an effective method with which to address the issue of data imbalance in the SED task.

The Transformer-based architecture provides parallel computing capabilities and is more efficient in temporal modeling compared to traditional sequential models such as RNNs [22]. In [23], log-mel spectrogram information is extracted using a CNN and is stacked into feature vectors as inputs to the Transformer; the token vectors of the Transformer are used to predict weak labels, while the feature vectors are used to predict strong labels. Kim et al. [24] adopt an architecture with Transformer encoders to combine the outputs of different-level encoders and obtain multi-scale features. Their experimental results demonstrate significant improvements in the recognition performance of the SED system.

Utilizing data beyond the labeled dataset can also be beneficial for the SED task. For example, pre-training the model with a large-scale image database can provide better initial parameters. Unlabeled audio data can also be used as training data for semi-supervised learning of the SED detection network, which improves the performance of the SED system. Currently, common semi-supervised learning methods for SED include mean teacher [25], interpolation consistency training [26], and co-training [27].

In summary, in the field of SED, due to the complexity of polyphonic sound structures and varying event durations, designing specific loss functions or applying post-processing techniques to predicted probabilities is a way to improve system recognition rates. Although researchers have successfully applied CRNN models in SED with significant results, they have not ceased exploring other models. For example, techniques such as residual connections and layer normalization have shown greater efficiency in handling long-term dependencies and memory issues. In the future, we can expect to see more applications and improvements of models like Transformers and RNNs in the SED domain.

3. Sound Event Detection System

3.1. Pre-Processing

In the SED in the domestic environment task, the training set is divided into three parts. The strong label data contain both event category and occurrence time information, providing rich annotation information. The weak label data, on the other hand, only include event category information, providing coarse-grained category labels. The unlabeled data do not have category labels or timestamp information, thus requiring the utilization of semi-supervised learning or other techniques to make use of it.

During the training and testing phases, the information from strong and weak labels is encoded to represent the presence or absence of events (1 indicating existence, 0 indicating non-existence). The encoding principles for both types are illustrated in Figure 1.

For strong labels, the encoding is based on the length of the time boundaries for each event and the correspondence between event categories and positions in a

128 \times 10

matrix, and the time encoding length is rounded up, with the vertical axis representing time and the horizontal axis representing categories. For weak labels, encoding is performed using one-hot encoding, resulting in a vector representation.

3.2. Perturbed Residual Recurrent Neural Network

The detection network, established in this paper—namely, P-RRNN—is depicted in Figure 2. To leverage the ResNet34 model, which was originally trained for image classification tasks, in SED tasks, we propose an optimized structure named P-ResNet. This structure retains the overall framework of the original model while modifying specific layers for enhanced performance. The P-ResNet model makes the following two improvements over the ResNet34 model:

The introduction of MaxPool layer structures into the network architecture facilitates frequency-domain downsampling of sound features. This adjustment not only curtails the complexity of the detection network, but also satisfies the input requirements of RNN for time series processing.
The addition of multiple Dropout and Drop block layers ensures that, even under identical conditions, the output of two similar detection networks will vary during the training process. This variation aids in enhancing the performance of the mean teacher model.

The input of ResNet34 is a 3-channel picture; therefore, P-ResNet takes the log-mel spectrogram of the audio as input and performs three consecutive channel-wise stacking, increasing the channel dimensions from 1 to 3. The stacked spectrogram is then processed through a convolution layer (Conv) for initial feature extraction, where the parameters of Conv represent the kernel size, stride, and output channels. Afterward, batch normalization (BN) and the ReLU activation function are applied, and the output is used as input for the ResNet.

The residual block module (RBM) and downsample block (DSB) are the core components of the ResNet, and they both employ residual connections. The Maxpool layer is used for downsampling in the frequency domain, with parameters indicating the kernel size and stride. The Dropout layer randomly drops values of some points in the feature map, with the dropout probability as a parameter. The RBM structure is used to increase the depth of the model. In the diagram, “C” represents the output channels of identical convolution (IConv) [28]—which is the most frequently used convolutional layer in the P-RRNN model—the IConv’s channels can be chosen, and the convolutional kernel size is

3 \times 3

with a stride of

1 \times 1

and padding of

1 \times 1

.

The basic block (BB) is a residual connection structure composed of two sets of identical IConv, BN, and ReLU activation functions. It serves as the fundamental building block of the RBM, RBM1, RBM2, and RBM4, which are composed of 3, 3, and 2 cascaded BBs, respectively. RBM3 is different from the other RBMs, as its basic structure is the drop basic block (DBB), which includes an additional Drop Block layer. In RBM3, the Drop Block layer randomly selects

5 \times 5

blocks and sets them to 0 with a probability of 0.2.

Since the RBM structure uses the same IConv for residual connections, it cannot increase the number of channels or perform downsampling on the feature maps; therefore, a structure similar to BB called DSB is used (Figure 3). The downsample (DS) structure consists of a convolutional layer and a BN layer. It adjusts the shape of the input to match the output dimension through the convolutional layer, enabling residual connections.

In the original ResNet34 model, the DSB structures were designed to downsample both time and frequency domains equally, as demonstrated in DSB2. However, given the unique characteristics of audio signals, we have enhanced DSB3 and DSB4 by incorporating MaxPool layers that only downsample frequency domains, thereby preserving the time resolution. This modification retains more temporal information from the original audio, making the model more suitable for subsequent RNN processing. The application of global downsampling followed by frequency downsampling in the two DSB structures enables a balance between computational efficiency and representation capability. DSB2 increases the 64 channels of the residual to 128 channels through the first convolutional layer and performs downsampling using a convolutional kernel with a stride of 2. DSB3 and DSB4 have the same structure; DSB3 increases the output channels to 256, while DSB4 increases the output channels to 512.

The Bidirectional Gated Recurrent Unit (BI-GRU) extracts temporal correlations from the output of P-ResNet and serves as the input to the parallel linear layers. The strong labels and weak labels of the same audio are correlated, and we capture this correlation through the parallel linear layers. Specifically, within the parallel linear layers, FC1 and FC2 are linear layers with the same dimensions, and FC1 and the Sigmoid function are cascaded to produce a 2D matrix with temporal boundaries. The weak predictions are computed using an independent linear layer structure and are multiplied element-wise with the strong predictions, ensuring independence and consistency between the weak and strong predictions. Finally, the weak predictions are summed along the time axis to obtain the weak prediction vector; this parallel linear layer structure considers both strong and weak labels simultaneously and effectively learns their relationship, thereby improving prediction accuracy.

3.3. Post-Processing

The post-processing of the weak prediction vector and strong prediction matrix from the detection network requires additional steps. First, a threshold of 0.5 is applied to both the weak and strong prediction probabilities to binarize them. The binarized weak prediction vector is then decoded using one-hot encoding to obtain the weak labels.

To improve the accuracy and stability of the strong prediction results, this paper introduces a median filter with prior information to smooth the strong prediction matrix. The window length of the median filter is determined based on the statistics provided in Table 1, which presents the information about each class of events in the synthetic audio of the DCASE 2019 Task 4 development set with strong annotations.

The “Total number” column represents the total number of each sound class, and the “Thres” column indicates the threshold used to select the window length. The threshold is determined based on the occurrence of events in each time segment. If the number of events exceeding a specific time segment accounts for more than 85% of the total events, then that time segment is selected as the threshold. The choice of the 85% threshold was derived from empirical observations and iterative testing. We aimed to identify a threshold that would allow us to detect the majority of sound events without being overly sensitive to brief, incidental sounds. Preliminary tests indicated that lower thresholds tended to incorporate too many short, incidental sounds, whereas higher thresholds neglected some of the longer, more significant sounds. We tested a variety of threshold values and determined that 85% provided an optimal balance between sensitivity and specificity, capturing most of the meaningful sound events while disregarding most of the incidental ones. The “NGT” column represents the number of events exceeding the threshold.

The window length calculation also depends on the length of the label encoding, as shown in Equation (1):

w i n d o w_n = \{\begin{matrix} ⌈ T h r e s \times T E / A D ⌉ + 1 & e v e n n u m b e r \\ ⌈ T h r e s \times T E / A D ⌉ & o d d n u m b e r, \end{matrix}

(1)

where

⌈ . ⌉

denotes rounding up,

T E

represents the length of time encoding, and

A D

represents the audio duration.

4. Sound Event Detection System Training Process

In the SED task, it is challenging to obtain large-scale frame-level labels, and the performance of SED systems is often subpar in real-world scenarios. The mean teacher model is an efficient method for utilizing unlabeled data. However, in comparison to individual data augmentation, introducing perturbations at the model level can elevate the complexity and diversity of the detection network, which promotes consistency regularization [29]. Therefore, to make the most of unlabeled data and enhance model performance, we propose a P-RRNN with model perturbation to augment the mean teacher model for the SED task in this study.

Specifically, the improved mean teacher model consists of a student model and a teacher model, forming a semi-supervised learning approach (Figure 4). The training data consist of three parts: unlabeled data, weak labels, and strong labels. The teacher and student models both have the same P-RRNN structure but different model parameters, denoted as

θ^{T}

and

θ^{S}

, respectively.

The teacher model does not participate in backpropagation, and its parameters are updated as the exponential moving average (EMA) of the student model parameters, as shown in Equation (2):

θ_{e p o}^{T} = β * θ_{e p o}^{T} + (1 - β) * θ_{e p o}^{S} .

(2)

Here,

β

is set to 0.999, and

e p o

represents the training batch.

The proposed improvement in this study builds upon the mean teacher framework by introducing both data perturbation and model perturbation. A consistency loss function is employed to constrain the student and teacher models, encouraging them to produce consistent predictions. Gaussian white noise is added to the log-mel spectrogram as data perturbation for the teacher model, denoted as

η

. Additionally, model perturbation is introduced using Dropout, Drop Block, and Max Pool operations designed in the P-RRNN. The student model perturbation is denoted as

η_{m}^{S}

, and the teacher model perturbation is denoted as

η_{m}^{T}

.

The student model participates in backpropagation, and the loss of the student model is computed based on the mean teacher’s consistency loss (

L_{C}

) and the error (

L_{C E}

) between the detection results of the student model and the true labels.

Under data perturbation and model perturbation, the mean teacher framework constrains the outputs of the student and teacher models through consistency loss. Let

x_{i}

denote the i-th audio sample, and

M S E (.)

represent the mean squared error function.

The strong consistency mean squared error loss (

L_{c s}

) for N audio samples is given by

L_{c s} = S u m {\sum_{i = 1}^{N} [M S E (S t u^{s} (x_{i} | η_{m}^{S}), T^{s} (x_{i} + η | η_{m}^{T}))]},

(3)

where

S u m (.)

is the sum of matrix elements,

S t u^{s} (.)

is the strong label prediction of the student model, and

T^{s} (.)

is the strong label prediction of the teacher model.

The weakly consistent mean square error loss for N audios is

L_{c w} = \sum_{i = 1}^{N} {M S E [S t u^{w} (x_{i} | η_{m}^{S}), T^{w} (x_{i} + η | η_{m}^{T})]},

(4)

where

S t u^{w} (.)

is the weak label prediction of the student model and

T^{w} (.)

is the weak label prediction of the teacher model.

Then,

L_{C}

can be expressed as:

L_{C} = L_{c s} + L_{c w} .

(5)

Any strong label in the dataset includes the event category and its time boundary. Let there be a total of

N_{s}

strong label audio and a total of

N_{t}

frames in the time domain encoding. The

y_{i}^{j}

vector represents the label event category of the i-th audio frame j. The

p_{i}^{j}

vector represents the event category of the i-th audio frame j, as predicted by the student model.

Then, the strong label logarithmic cross-entropy loss function is

L_{c e s} = - \sum_{j = 1}^{N_{s}} \sum_{i = 1}^{N_{t}} [y_{i}^{j} ln p_{i}^{j} + (1 - y_{i}^{j}) ln (1 - p_{i}^{j})],

(6)

where

y_{i}^{j}

,

p_{i}^{j} \in R^{C \times 1}

, and C represents the total number of events

Given

N_{w}

weak-labeled audio samples, where the

y_{i}

vector represents the event category of the i-th audio sample and

p_{i}

represents the predicted event category by the student model, the weak label logarithmic cross-entropy loss function is

L_{c e w} = - \sum_{i = 1}^{N_{w}} [y_{i} ln p_{i} + (1 - y_{i}) ln (1 - p_{i})] .

(7)

Then,

L_{C E}

can be expressed as

L_{C E} = L_{c e s} + L_{c e w} .

(8)

All loss values are used for student model backpropagation, where the total loss is

L_{t o t a l} = L_{C E} + λ L_{C},

(9)

where

λ

is the hyperparameter used to balance

L_{t o t a l}

.

5. Experiment and Analysis

5.1. Dataset and Preprocessing

This study conducted experiments using the DCASE 2019 Challenge Task 4 development set. The dataset consists of sound events from 10 home environments, namely, speech, dog, cat, alarm/bell ringing, dishes, frying, blender, running water, vacuum cleaner, and electric shaver/toothbrush. The development set includes a training set, validation set, and evaluation set. The length of the clips is 10 s, and clips slightly below 10 s are padded with zeros. The collection of the development set consists of two parts.

The first part consists of synthetic audio clips, which are created by synthesizing foreground sounds from the FSD dataset [30] with background sounds from the SINS dataset [31]. A total of 2045 clips with strong labels are used for the training set. The second part consists of real recorded audio clips from the AudioSet [32], including 14,412 unlabeled clips and 1578 weakly labeled clips used for training. Strongly labeled validation and evaluation sets, each containing 1168 and 692 clips, respectively, are used to evaluate the detection systems.

The experiments were conducted using Jupyter notebooks with PyTorch (version 1.12) to build the neural network. The Python version used was 3.9. The training was performed on a GPU model 3090ti with driver version 515.57 and CUDA version 11.7. The Adam optimizer [33] was employed to update the model parameters, with an initial learning rate set to 0.01. A cosine annealing function was used to dynamically adjust the learning rate with a minimum of 0.001. The learning rate was annealed every 4 cycles, and a total of 80 iterations were performed during training. The training time of our model is about 160 s per epoch, and the total number of parameters is about twenty-one million. Each batch size consisted of 4 strong label samples, 4 weak label samples, and 16 unlabeled samples.

5.2. Model Pre-Training

Owing to the limited number of annotated samples in SED tasks, the performance of deep learning models in SED is often restricted. To address this issue, pre-training P-ResNet was implemented on the large-scale ImageNet-1K dataset [34]. This dataset contains at least 1000 images per category and has been annotated through multiple verifications to eliminate the subjective factors of annotators, thereby improving its accuracy.

In this study, to validate the impact of P-ResNet pre-training on the detection system, the training curves of the student model under the influence of pre-training on the DCASE 2019 Task 4 development set are shown in Figure 5.

The upper plot represents the sum of strong and weak consistency losses between the student model and the teacher model, as well as the logarithmic cross-entropy loss of the student model’s predicted probabilities on the strong label training set (synthetic) and weak label training set, as shown in Equation (9). The lower plot represents the evaluation of the detection system using the student model on the test set, which consists of a subset of synthetic audio (79 samples) and weak label audio (103 samples) from the development set. The vertical axis represents the sum of the test set Macro Event-based F1 (MEBF) and weak label F1 scores. The training curves demonstrate that pre-training the model with P-ResNet improves the convergence speed of the network, increases the F1 value on the validation set, and reduces the training loss. This outcome indicates that P-ResNet effectively addresses issues such as the difficulty of learning shallow parameters and the insufficient utilization of weight parameters by initializing the network with a pre-trained model.

5.3. Analysis of Results

The experiment conducted a specific comparison of the detection performance of P-RRNN when using pre-training versus the baseline system, using the validation set and evaluation set from DCASE 2019 Task 4. The baseline system [8] employs a mean teacher model for semi-supervised learning. The detection network uses a CRNN structure. This CRNN network is composed of a 3-layer convolutional neural network (for feature extraction) and BI-GRU (for sequence modeling to classify and recognize sound signals). The evaluation metric used was MEBF, as proposed by the DCASE official guidelines, and as shown in Table 2.

From the MEBF scores for various SED tasks, it can be observed that, compared to the baseline model, the P-RRNN model significantly improves the detection performance for most sound events, in both the validation and evaluation sets. Particularly, there is a significant improvement in the long-duration events, such as frying, vacuum cleaner, speech, and electric, while the detection performance for short-duration events like dog, dishes, and cat also shows noticeable improvement. This result indicates that P-RRNN has enhanced modeling capabilities for inter-frame relationships in the samples, and the perturbation structure is beneficial for the model to learn information from unlabeled samples. For events with medium durations, blender, and running water demonstrate significant performance improvements on the validation set. However, there is a slight decrease in the detection performance for the alarm sound event on the evaluation set, while it shows a slight improvement on the validation set. This phenomenon may be due to the fact that short-duration sound signals are easily masked by noise, leading to inaccurate detection.

In this study, the validation set of DCASE 2019 Task 4 was used as the test set, and MEBF scores were compared with the detection system mentioned above under the same training set conditions shown in Table 3. Since the performance of the detection system is influenced by various factors, including hardware environment, software version, and other details, the baseline model mentioned in this paper was reproduced using the provided code and obtaining the corresponding MEBF. The results of other models were obtained by referring to relevant papers. The impact of pre-training on the performance of the detection system was also compared, showing an 8.8% improvement in MEBF when using pre-training. Our research results indicate that when only data perturbation is employed, excluding model perturbation (MP), the MEBF of P-RRNN declines by 5.4%. However, compared to the baseline detection network, P-RRNN exhibits superior detection performance when trained under the same conditions with the mean teacher model. Consequently, these findings demonstrate that our proposed P-RRNN effectively enhances detection performance.

In general, system performance not only depends on the network architecture but is also affected by factors such as dataset quality, data preprocessing, and model training strategies. Compared with CNNs, P-ResNet based on ResNet34 in this paper can improve the expression ability and learning depth of the model, extract more essential features of samples, and use the pre-training model to solve the problem of insufficient network fitting ability. The improved detection system combined with mean teacher and P-RRNN has practical significance for realizing the research goal of creating an intelligent and safe society.

6. Conclusions

In this study, we employed an improved ResNet34 network for SED systems, which involved feature extraction, thus enhancing the flow of sound information in the model, pre-training, and in introducing perturbation factors to the improved teacher model. Experimental results on the DCASE 2019 Task 4 dataset demonstrated that, without pre-training, the detection system achieved a 5% improvement in the MEBF metric on the validation set compared to the baseline system; with pre-training, the detection system achieved a 13.7% improvement in MEBF.

Overall, the P-RRNN model exhibited higher accuracy and robustness in SED tasks compared to the baseline system, but it also has limitations and challenges. This paper also needs to consider the deployment of event detection systems in practical applications. Future work aims to effectively fuse sound features, such as sound spectrogram centroids and scattering transforms, to reduce the complexity of the detection system, using smaller models for multi-SED and achieving better detection performance.

In addition, the strong labeled data for model training is generated by combining foreground and background sounds using Scaper. The foreground and background sounds have clear boundaries, and it is worth further investigating the domain adaptation techniques in SED to reduce the differences between synthesized audio and natural audio.

Author Contributions

Conceptualization, L.Y.; Methodology, S.Y.; Software, S.Y.; Validation, L.Y. and Y.G.; Formal analysis, S.Y. and Y.G.; Writing—original draft, S.Y.; Writing—review & editing, L.Y. and Y.G.; Supervision, Y.G.; Project administration, S.Y.; Funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (62161040), the Science and Technology Project of Inner Mongolia Autonomous Region (2021GG0023), the Supported by Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT22056), the Natural Science Foundation of Inner Mongolia Autonomous Region (2021MS06030), and the Science and Technology Project of Inner Mongolia Autonomous Region (2023YFSW0006).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

P-RRNN	Perturbed Residual Recurrent Neural Network
P-ResNet	perturbed residual neural network
RNN	Recurrent Neural Network
SED	Sound Event Detection
DCASE	Detection and Classification of Acoustic Scenes and Events
BI-GRU	Bidirectional Gated Recurrent Unit
Conv	Convolution layer
IConv	Identical Convolution
BN	Batch Normalization
RBM	Residual Block Module
DSB	Down Sample Block
BB	Basic Block
DS	Down Sample

References

Feroze, K.; Maud, A.R. Sound event detection in real life audio using perceptual linear predictive feature with neural network. In Proceedings of the 2018 15th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Islamabad, Pakistan, 9–13 January 2018; pp. 377–382. [Google Scholar] [CrossRef]
Roy, J.K.; Roy, T.S.; Mukhopadhyay, S.C. Heart Sound: Detection and Analytical Approach Towards Diseases. In Modern Sensing Technologies; Mukhopadhyay, S.C., Jayasundera, K.P., Postolache, O.A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 103–145. [Google Scholar] [CrossRef]
Pandya, S.; Ghayvat, H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence. Adv. Eng. Inform. 2021, 47, 101238. [Google Scholar] [CrossRef]
Krstulović, S. Audio Event Recognition in the Smart Home. In Computational Analysis of Sound Scenes and Events; Virtanen, T., Plumbley, M.D., Ellis, D., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 335–371. [Google Scholar] [CrossRef]
Kiromitis, D.I.; Bellos, C.V.; Stefanou, K.A.; Stergios, G.S.; Katsantas, T.; Kontogiannis, S. Bee Sound Detector: An Easy-to-Install, Low-Power, Low-Cost Beehive Conditions Monitoring System. Electronics 2022, 11, 3152. [Google Scholar] [CrossRef]
Gade, R.; Moeslund, M.T. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. Mach. Vis. Appl. 2014, 25, 245–262. [Google Scholar] [CrossRef]
Giannoulis, D.; Benetos, E.; Stowell, D.; Rossignol, M.; Lagrange, M.; Plumbley, M.D. Detection and classification of acoustic scenes and events: An IEEE AASP challenge. In Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 20–23 October 2013; pp. 1–4. [Google Scholar] [CrossRef]
Turpault, N.; Serizel, R.; Parag Shah, A.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. DropBlock: A regularization method for convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust Sound Event Classification Using Deep Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
McFee, B.; Salamon, J.; Bello, J.P. Adaptive Pooling Operators for Weakly Labeled Sound Event Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 2180–2193. [Google Scholar] [CrossRef]
Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6440–6444. [Google Scholar] [CrossRef]
Lu, R.; Duan, Z.; Zhang, C. Multi-Scale Recurrent Neural Network for Sound Event Detection. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 131–135. [Google Scholar] [CrossRef]
Cakır, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, G.; Xiong, K.; Shi, G. MSFF-Net: Multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection. Digit. Signal Process. 2022, 122, 103319. [Google Scholar] [CrossRef]
Zhao, F.; Li, R.; Liu, X.; Xu, L. Soft-Median Selection: An adaptive feature smoothening method for sound event detection. Appl. Acoust. 2022, 192, 108715. [Google Scholar] [CrossRef]
Kiyokawa, Y.; Mishima, S.; Toizumi, T.; Sagi, K.; Kondo, R.; Nomura, T. Sound Event Detection with Resnet and Self-Mask Module for Dcase 2019 Task 4; Technical Report; Data Science Research Laboratories, NEC Corporation: Tokyo, Japan, 2019. [Google Scholar]
Imoto, K.; Mishima, S.; Arai, Y.; Kondo, R. Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance. Appl. Acoust. 2022, 196, 108882. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Miyazaki, K.; Komatsu, T.; Hayashi, T.; Watanabe, S.; Toda, T.; Takeda, K. Weakly-Supervised Sound Event Detection with Self-Attention. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 66–70. [Google Scholar] [CrossRef]
Kim, S.J.; Chung, Y.J. Multi-Scale Features for Transformer Model to Improve the Performance of Sound Event Detection. Appl. Sci. 2022, 12, 2626. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Verma, V.; Kawaguchi, K.; Lamb, A.; Kannala, J.; Solin, A.; Bengio, Y.; Lopez-Paz, D. Interpolation consistency training for semi-supervised learning. Neural Netw. 2022, 145, 90–106. [Google Scholar] [CrossRef] [PubMed]
Peng, J.; Estrada, G.; Pedersoli, M.; Desrosiers, C. Deep co-training for semi-supervised image segmentation. Pattern Recognit. 2020, 107, 107269. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
Zheng, X.; Song, Y.; Yan, J.; Dai, L.R.; McLoughlin, I.; Liu, L. An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 841–845. [Google Scholar] [CrossRef]
Fonseca, E.; Pons Puig, J.; Favory, X.; Font Corbera, F.; Bogdanov, D.; Ferraro, A.; Oramas, S.; Porter, A.; Serra, X. Freesound datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th ISMIR Conference, Suzhou, China, 23–27 October 2017; Hu, X., Cunningham, S.J., Turnbull, D., Duan, Z., Eds.; International Society for Music Information Retrieval: Victoria, BC, Canada, 2017; pp. 486–493. [Google Scholar]
Dekkers, G.; Lauwereins, S.; Thoen, B.; Adhana, M.W.; Brouckxon, H.; Van den Bergh, B.; van Waterschoot, T.; Vanrumste, B.; Verhelst, M.; Karsmakers, P. The SINS database for detection of daily activities in a home environment using an Acoustic Sensor Network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16–17 November 2017. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]

Figure 1. Encoding of audio tags.

Figure 2. Perturbed residual recurrent neural network.

Figure 3. DSB structure.

Figure 4. Improved mean teacher model.

Figure 5. Comparison of P-Resnet pre-training to model training curves.

Table 1. Window length of Median filter for each type of sound event.

Target	Total Number	Thres in s	NGT	window_n
Alarm	755	0.234	755	3
Dog	516	0.234	516	3
Dishes	814	0.234	814	3
Speech	2132	0.391	1868	5
Cat	547	0.547	474	7
Blender	540	0.703	482	9
Running water	157	1.172	134	15
Frying	137	1.484	118	19
Vacuum cleaner	204	1.484	174	19
Electric	230	1.484	204	19

Table 2. MEBF of P-RRNN detection for each sound event in the validation and evaluation sets.

	Validation		Evaluation
Model	BaseLine	P-RRNN	BaseLine	P-RRNN
Alarm	39.4	41.3	39.0	29.7
Dog	6.1	21.3	4.8	36.8
Dishes	15.3	20.8	20.4	29.1
Speech	38.3	49.4	36.7	52.4
Cat	29.4	41.1	54.7	60.1
Blender	20.3	34.5	36.6	36.9
Running water	20.8	35.5	21.4	22.6
Frying	26.1	37.9	38.9	50.7
Vacuum cleaner	32.3	65.3	25.0	47.3
Electric	23.1	40.9	23.7	39.3
Overall	25.1	38.8	30.1	40.5

Table 3. The MEBF results of some of the mentioned SED systems in the validation set of DCASE 2019 task 4.

Model	In Feature	MEBF
Baseline [8]	Logmel	25.1
P-RRNN	Logmel	30.1
P-RRNN+pretrain	Logmel	33.4
Transformer [20]	Logmel	34.3
Self-mask+Resnet [13]	Logmel	36.1
P-RRNN+pretrain+MP	Logmel	38.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, S.; Yang, L.; Guo, Y. Sound Event Detection with Perturbed Residual Recurrent Neural Network. Electronics 2023, 12, 3836. https://doi.org/10.3390/electronics12183836

AMA Style

Yuan S, Yang L, Guo Y. Sound Event Detection with Perturbed Residual Recurrent Neural Network. Electronics. 2023; 12(18):3836. https://doi.org/10.3390/electronics12183836

Chicago/Turabian Style

Yuan, Shuang, Lidong Yang, and Yong Guo. 2023. "Sound Event Detection with Perturbed Residual Recurrent Neural Network" Electronics 12, no. 18: 3836. https://doi.org/10.3390/electronics12183836

APA Style

Yuan, S., Yang, L., & Guo, Y. (2023). Sound Event Detection with Perturbed Residual Recurrent Neural Network. Electronics, 12(18), 3836. https://doi.org/10.3390/electronics12183836

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sound Event Detection with Perturbed Residual Recurrent Neural Network

Abstract

1. Introduction

2. Literature Review

3. Sound Event Detection System

3.1. Pre-Processing

3.2. Perturbed Residual Recurrent Neural Network

3.3. Post-Processing

4. Sound Event Detection System Training Process

5. Experiment and Analysis

5.1. Dataset and Preprocessing

5.2. Model Pre-Training

5.3. Analysis of Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI