Sound Event Detection Using Derivative Features in Deep Neural Networks

We propose using derivative features for sound event detection based on deep neural networks. As input to the networks, we used log-mel-filterbank and its first and second derivative features for each frame of the audio signal. Two deep neural networks were used to evaluate the effectiveness of these derivative features. Specifically, a convolutional recurrent neural network (CRNN) was constructed by combining a convolutional neural network and a recurrent neural networks (RNN) followed by a feed-forward neural network (FNN) acting as a classification layer. In addition, a mean-teacher model based on an attention CRNN was used. Both models had an average pooling layer at the output so that weakly labeled and unlabeled audio data may be used during model training. Under the various training conditions, depending on the neural network architecture and training set, the use of derivative features resulted in a consistent performance improvement by using the derivative features. Experiments on audio data from the Detection and Classification of Acoustic Scenes and Events 2018 and 2019 challenges indicated that a maximum relative improvement of 16.9% was obtained in terms of the F-score.


Introduction
Humans can obtain information about their surroundings from nearby sounds. Accordingly, sound signal analysis, whereby information may be automatically extracted from audio data, has attracted considerable attention. The Detection and Classification of Acoustic Scenes and Events (DCASE) 2013-2020 challenges have greatly contributed to increasing the interest in this area, and several competition tasks have been defined. Among these tasks, sound event detection (SED) is aimed at identifying both the existence and occurrence times of the various sounds in our daily lives [1]. It has several applications, such as surveillance [2,3], urban sound analysis [4], information retrieval from multimedia content [5], health care monitoring [6], and bird call detection [7].
Recently, deep neural networks (DNNs) have demonstrated superior performance to that of conventional machine learning techniques in image classification [8], speech recognition [9], and machine translation [10]. In [11][12][13], it was demonstrated that the feedforward neural networks (FNNs) outperformed the traditional Gaussian mixture model and support vector machines in SED. Therefore, current studies on SED primarily focus on DNN-based approaches.
Owing to their fixed interlayer connections, FNNs (which are the basic architecture of DNNs) cannot effectively handle signal distortions in image classification. The same phenomenon may occur in SED, which generally uses a two-dimensional time-frequency spectrogram as input to the FNN. Moreover, FNNs have limitations in modeling the long-term time-correlation of the sound signal samples. Accordingly, FNNs are not widely used in SED.
(64 ms) with an overlap of 360 (41.5 ms) [19]. Sixty-four bands of the mel-scale filterbank outputs from 0 to 16 kHz were obtained using the STFT and then were log-transformed to produce the same dimensional LMFB for each 41.5 ms frame. The feature extraction process generated 240 frames with 64 dimensions for the 10 s clips used for training and testing. After the LMFB was computed, it was normalized by subtracting its mean and dividing by its standard deviation over the entire training data. Subsequently, it was used as input to the CRNN.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 11 1024 (64 ms) with an overlap of 360 (41.5 ms) [19]. Sixty-four bands of the mel-scale filterbank outputs from 0 to 16 kHz were obtained using the STFT and then were log-transformed to produce the same dimensional LMFB for each 41.5 ms frame. The feature extraction process generated 240 frames with 64 dimensions for the 10 s clips used for training and testing. After the LMFB was computed, it was normalized by subtracting its mean and dividing by its standard deviation over the entire training data. Subsequently, it was used as input to the CRNN.

Derivative Features
In speech recognition, which is similar to SED in that time-series signals are involved, the first and second derivative features are calculated from the static feature. In this study, we consider the LMFB extracted in Figure 1 as the static feature and compute the derivative features of the LMFB to use them for training the CRNN. The computation of the derivative features is done as follows: where is the derivative feature at time , and is the static feature. is the number of frames preceding and following the -th frame. When computing the second derivative feature, the computed first derivative feature is considered as the static feature in (1).

Network Architecture
We used two types of deep neural networks to evaluate the effectiveness of the derivative features in SED: a basic CRNN and a mean teacher model using an attention-based CRNN. As these are considered representative deep neural networks for SED, they may be used to confirm the usefulness of the derivative features.

Basic CRNN
The architecture of the basic CRNN is shown in Figure 2. It is similar to other CRNNs commonly used for SED [11], but the output of the network is fed to a global average pooling (GAP) layer for training using weakly labeled and unlabeled data. The GAP is used to compute the clip-level output for each class by time-averaging the frame-level sigmoid output of the classification layer. Three convolution blocks (ConvBlocks) consisting of two-dimensional CNNs, one bidirectional GRU layer, and one classification layer implemented as an FNN are cascaded in series. The GAP layer averages the frame-level output of the classification layer for the 240 frames corresponding to the 10 s audio clip. The GPA layer is not used for strongly labeled data.

Derivative Features
In speech recognition, which is similar to SED in that time-series signals are involved, the first and second derivative features are calculated from the static feature. In this study, we consider the LMFB extracted in Figure 1 as the static feature and compute the derivative features of the LMFB to use them for training the CRNN. The computation of the derivative features is done as follows: where d t is the derivative feature at time t, and o t is the static feature. K is the number of frames preceding and following the t-th frame. When computing the second derivative feature, the computed first derivative feature is considered as the static feature in (1).

Network Architecture
We used two types of deep neural networks to evaluate the effectiveness of the derivative features in SED: a basic CRNN and a mean teacher model using an attention-based CRNN. As these are considered representative deep neural networks for SED, they may be used to confirm the usefulness of the derivative features.

Basic CRNN
The architecture of the basic CRNN is shown in Figure 2. It is similar to other CRNNs commonly used for SED [11], but the output of the network is fed to a global average pooling (GAP) layer for training using weakly labeled and unlabeled data. The GAP is used to compute the clip-level output for each class by time-averaging the frame-level sigmoid output of the classification layer. Three convolution blocks (ConvBlocks) consisting of two-dimensional CNNs, one bidirectional GRU layer, and one classification layer implemented as an FNN are cascaded in series. The GAP layer averages the frame-level output of the classification layer for the 240 frames corresponding to the 10 s audio clip. The GPA layer is not used for strongly labeled data.
The 64-dimensional log-mel filterbank and its first and second derivative values are used as input to the basic CRNN. They are independently constructed as (240 × 64)-dimensional feature maps. Although the number of weights increases with the additional feature maps due to the first and second derivative features at the input layer of CRNN, this increase is relatively small compared with the total number of weights of the basic CRNN. The number of weights of the proposed model is about 127,000 compared to 126,000 without derivative features at the input layer.
In ConvBlock, a 3 × 3 convolutional filter is applied to the context window of the input feature map, and batch normalization is used to normalize the filter output to zero mean and unit variance. A rectified linear unit (ReLU) activation function is applied after batch normalization. Non-overlapping 1 × 4 max pooling is applied only in the frequency domain to reduce the dimensionality of the data and to improve frequency invariance. We preserve the time dimension to make use of the time-correlation information of the sound signal, which will be exploited in the following GRU layer. Dropout is used after max pooling to reduce overfitting in the training. The 64-dimensional log-mel filterbank and its first and second derivative values are used as input to the basic CRNN. They are independently constructed as (240 × 64)-dimensional feature maps. Although the number of weights increases with the additional feature maps due to the first and second derivative features at the input layer of CRNN, this increase is relatively small compared with the total number of weights of the basic CRNN. The number of weights of the proposed model is about 127,000 compared to 126,000 without derivative features at the input layer.
In ConvBlock, a 3 × 3 convolutional filter is applied to the context window of the input feature map, and batch normalization is used to normalize the filter output to zero mean and unit variance. A rectified linear unit (ReLU) activation function is applied after batch normalization. Nonoverlapping 1 × 4 max pooling is applied only in the frequency domain to reduce the dimensionality of the data and to improve frequency invariance. We preserve the time dimension to make use of the time-correlation information of the sound signal, which will be exploited in the following GRU layer. Dropout is used after max pooling to reduce overfitting in the training.
The output of the last ConvBlock is used as an input to the bidirectional GRU, which has 64 units in each direction and feeds its output into the classification layer, which has 10 units corresponding to the sound classes. These units have a sigmoid activation function, the output of which denotes the posterior probability of the classes for each frame of the sound signal.

Mean-Teacher Model
The mean-teacher model in this study is similar to that used as the baseline recognizer for SED in the DCASE 2019 challenge [1]. The architecture of the mean-teacher model is shown in Figure 3. The output of the last ConvBlock is used as an input to the bidirectional GRU, which has 64 units in each direction and feeds its output into the classification layer, which has 10 units corresponding to the sound classes. These units have a sigmoid activation function, the output of which denotes the posterior probability of the classes for each frame of the sound signal.

Mean-Teacher Model
The mean-teacher model in this study is similar to that used as the baseline recognizer for SED in the DCASE 2019 challenge [1]. The architecture of the mean-teacher model is shown in Figure 3. It consists of two CRNNs, the student model on the left, and the teacher model on the right. The student model updates the model parameters by calculating the classification and the consistency cost and by back-propagating the errors using gradient descent. The classification cost is calculated by comparing the output of the student model with the ground truth table using the weakly labeled and strongly labeled audio data, as shown in Figure 3. The consistency cost is calculated by comparing the output of the student model with that of the teacher model by using unlabeled, It consists of two CRNNs, the student model on the left, and the teacher model on the right. The student model updates the model parameters by calculating the classification and the consistency cost and by back-propagating the errors using gradient descent. The classification cost is calculated by comparing the output of the student model with the ground truth table using the weakly labeled and strongly labeled audio data, as shown in Figure 3. The consistency cost is calculated by comparing the output of the student model with that of the teacher model by using unlabeled, weakly, and strongly labeled data. The teacher model does not update its parameters by back propagation, but uses the ensemble moving average weights of the student model [21]. For the test, the teacher model generally produces more correct output and is used for prediction.
An attention-based CRNN is used for the mean-teacher model. The mechanism is similar to that in [19]. A gated linear unit (GLU) is used in ConvBlock, and an attention layer at the output. The GLU is shown in more detail in Figure 4; it contains a screening module that passes input signals if related to the sound event of interest, and blocks all other the signals. It consists of two CRNNs, the student model on the left, and the teacher model on the right. The student model updates the model parameters by calculating the classification and the consistency cost and by back-propagating the errors using gradient descent. The classification cost is calculated by comparing the output of the student model with the ground truth table using the weakly labeled and strongly labeled audio data, as shown in Figure 3. The consistency cost is calculated by comparing the output of the student model with that of the teacher model by using unlabeled, weakly, and strongly labeled data. The teacher model does not update its parameters by back propagation, but uses the ensemble moving average weights of the student model [21]. For the test, the teacher model generally produces more correct output and is used for prediction.
An attention-based CRNN is used for the mean-teacher model. The mechanism is similar to that in [19]. A gated linear unit (GLU) is used in ConvBlock, and an attention layer at the output. The GLU is shown in more detail in Figure 4; it contains a screening module that passes input signals if related to the sound event of interest, and blocks all other the signals.
where T is the total number of time frames in the 10 s audio clip.

Database
In this study, we used the training and test data of the DCASE 2018 and 2019 challenges. The training set is a combination of the training data from both challenges and consists of weakly labeled, strongly labeled, and unlabeled data. It is presented in Table 1. The weak label provides sound event information at the clip level without timing information at the frame level. Unlabeled data do not have any label information, but we can obtain label information at the clip level by prediction from the CRNN trained using weakly labeled data. The length of each clip is 10 s, and the number of clips for the weakly, strongly and unlabeled data is 1578, 2045 and 14,412, respectively. There are 10 different sound types, usually domestic or household. The details of the testing data are shown in Table 2. The data are divided into the DCASE 2018 and DCASE 2019 test set, which contain 208 and 1168 clips, respectively. The test data contain frame-level label information for the evaluation.

Evaluation Metrics
The CRNN computes the posterior probability for each class in every time frame and identifies a sound event when this probability exceeds 0.5. To improve reliability, a median filtering is applied to the probabilities across the frames before the final decision.
The performance of the CRNN is measured by the F-score and error rate (ER) using an event-based analysis [23], which compares the output of the CRNN with the ground truth table when the output indicates that an event has occurred. The initial decision comprises three different types: true positive (TP), false positive (FP), and false negative (FN). A TP indicates that the period of a detected sound event overlaps with that from the ground truth table. In the decision, a 200 ms onset collar and a 200 ms or 20% of the event length offset collar are allowed. An FP implies that there is no corresponding overlap period in the ground truth table, although the CRNN output indicates an event. An FN implies that there is an event period in the ground truth table, but the CRNN does not produce the corresponding output.
The F-score (F) is computed based on the initial three decisions and is the harmonic average of the precision (P) and recall (R). They are computed as follows.
The error rate is computes as where N is the total number of sound events active in the ground truth table. Sound events with correct temporal positions but incorrect class labels are counted as substitutions (S), whereas insertions (I) are sound events present in the system output but not in the reference, and deletions (D) are the sound events present in the reference but not in the output [23].

Experimental Results
To train the basic CRNN, various combinations of weakly and strongly labeled and unlabeled data were considered. Specifically, four combinations, [weakly + unlabeled], [weakly + unlabeled + strongly], [strongly] and [weakly + strongly] were chosen. Binary cross-entropy was used as the loss function, and the Adam optimizer with a learning rate of 0.001 was used to train the basic CRNN. We applied early stopping with a minimum of five epochs and a patience of 15 epochs. These hyper parameters in this study are based on the baseline systems announced at the DCASE 2018 and 2019 challenges. The classification results on the DCASE 2018 test set are shown in Table 3, where "Single channel" implies that only the static log-mel filterbank was used as the input of the basic CRNN, and "Three channels" implies that derivative features (first and second) were also used as the input. It can be seen that [weakly + unlabeled + strongly] yields the best performance because it has the largest amount of training data. However, the performance is not satisfactory considering that the unlabeled training data constitute 80% of all the training data. This implies that the basic CRNN cannot efficiently use the unlabeled data to update its parameters.
Furthermore, using derivative features in all combinations of the training data results in a consistent performance improvement in terms of the F-score. In the [weakly + unlabeled + strongly] combination, which yields the best results, a relative improvement of 7.2% in the F-score is observed. The average relative improvement of all combinations is 11.6%. However, the performance improvement is not sufficiently large to manifest itself in the ER.
We compared the performance of our system with the DCASE 2018 baseline system [24] which uses the same training and test data as the [weakly + unlabeled] combination in Table 3. We could find that, although the baseline system employs similar CRNN architecture as ours, it showed 14.06% in F-score and 1.54 in ER. It has a better F-score and worse ER than the [weakly + unlabeled] combination. Slightly different hyper parameters and differences in the learning process may be the main reasons. However, by using the derivative features, we could obtain better results than the baseline system in both F-score and ER, as shown in the first row of the table.
In Table 4, the classification results using the DCACSE 2019 test set are shown. A similar trend as in Table 3 is shown. The performance improvement by using the derivative features is rather diminished in the DCASE 2019 test set. On average, a 5.3% relative improvement in the F-score is attained, and [weakly + unlabeled + strongly] yields a 6% improvement.
The F-score learning curve during the training of the basic CRNN when only the weakly labeled data are used is shown in Figure 5. Twenty percent of the weakly labeled data were used as validation data. At epoch 45, the optimal performance is attained by early stopping. Further training only increases the F-score of the training data, possibly resulting in overfitting.  The F-score learning curve during the training of the basic CRNN when only the weakly labeled data are used is shown in Figure 5. Twenty percent of the weakly labeled data were used as validation data. At epoch 45, the optimal performance is attained by early stopping. Further training only increases the F-score of the training data, possibly resulting in overfitting.  Tables 5 and  6, respectively), and the best model on the validation data was accordingly selected for the evaluation of the test data. The Adam optimizer was used for training with a learning rate of 0.0001 and a median filtering of length 5 was applied to the output of the classification layer.
The SED results of the mean-teacher model are shown in Table 5. It can be seen that significant performance improvement is obtained by the mean-teacher model (cf . Table 3). Furthermore, the addition of a derivative to the static feature resulted in a 3% relative improvement in the F-score in the DCASE 2018 test set. In the DCASE 2019 test set, a 4.4% relative improvement was attained. In addition, a 2.5% relative improvement was observed when the strongly labeled training data set was used as the test data to demonstrate the effect of the derivative features in case the difference between the testing data and training data is small. It can be concluded that a consistent performance  Tables 5 and 6, respectively), and the best model on the validation data was accordingly selected for the evaluation of the test data. The Adam optimizer was used for training with a learning rate of 0.0001 and a median filtering of length 5 was applied to the output of the classification layer.
The SED results of the mean-teacher model are shown in Table 5. It can be seen that significant performance improvement is obtained by the mean-teacher model (cf . Table 3). Furthermore, the addition of a derivative to the static feature resulted in a 3% relative improvement in the F-score in the DCASE 2018 test set. In the DCASE 2019 test set, a 4.4% relative improvement was attained. In addition, a 2.5% relative improvement was observed when the strongly labeled training data set was used as the test data to demonstrate the effect of the derivative features in case the difference between the testing data and training data is small. It can be concluded that a consistent performance improvement was attained by using the derivative features in the mean-teacher model regardless of the test set: however, the improvement was rather diminished compared with the basic CRNN.
We also compared the performance of our system with the baseline system of the DCASE 2019 [1]. Although the baseline system uses the same audio data and similar mean-teacher model, it showed an F-score of 23.7%, which is worse than our single channel result of 25.95%. As mentioned in the comparison with the baseline of DCASE 2018 in Table 3, the performance difference seems to come from some details in the implementation. However, by using the derivative features, we could further increase the F-score of our system to 27.36%, which is better than the result of the DCASE 2019 baseline system as expected.
The SED results when the training epochs increased to 200 are shown in Table 6. For the DCASE 2018 test set, a 5% relative improvement was attained by using the derivative features. The results for the DCASE 2019 test set indicate a 7.5% improvement. When strongly labeled training data were used as test data, a 2.9% relative improvement was attained. It can be concluded that an increase in the number of training epochs resulted in an increase in the relative improvement by the derivative features.

Conclusions
Recently, among the approaches for SED, CRNNs have been widely used and have exhibited better performance than other neural networks. In this study, we proposed the use of the first and second delta features of the log-mel filterbank to improve the performance of state-of-the-art CRNNs. We used two types of CRNNs, a basic CRNN and a mean-teacher model based on an attention-based CRNN. We also used various combinations of weakly, strongly labeled, and unlabeled data to train the CRNNs to confirm the effect of the derivative features on SED. Regarding the basic CRNN, a performance improvement was always attained by using the derivative features in the various combinations of the training data. On the DCASE 2018 test set, an 11.6% average relative improvement in the F-score was obtained, and a 5.3% improvement was obtained on the DCASE 2019 test set. Regarding the mean-teacher model, consistent performance improvement was observed as we changed the number of epochs and test set type were changed. When 200 epochs were used for training, a 5% relative improvement in the DCASE 2018 test set and a 7.5% improvement in the DCASE 2019 test set were observed.
In this study, we used the derivative features of the log-mel filterbank to improve SED. In the experiments using various combinations of training and test data, we observed a consistent performance improvement in state-of-the-art CRNNs, which, however, was not as significant as in speech recognition. Nevertheless, the results appear to be sufficient to indicate the importance of the derivative features in SED.

Conflicts of Interest:
The authors declare no conflicts of interest.