Joint Optimization of Deep Neural Network-Based Dereverberation and Beamforming for Sound Event Detection in Multi-Channel Environments.

In this paper, we propose joint optimization of deep neural network (DNN)-supported dereverberation and beamforming for the convolutional recurrent neural network (CRNN)-based sound event detection (SED) in multi-channel environments. First, the short-time Fourier transform (STFT) coefficients are calculated from multi-channel audio signals under the noisy and reverberant environments, which are then enhanced by the DNN-supported weighted prediction error (WPE) dereverberation with the estimated masks. Next, the STFT coefficients of the dereverberated multi-channel audio signals are conveyed to the DNN-supported minimum variance distortionless response (MVDR) beamformer in which DNN-supported MVDR beamforming is carried out with the source and noise masks estimated by the DNN. As a result, the single-channel enhanced STFT coefficients are shown at the output and tossed to the CRNN-based SED system, and then, the three modules are jointly trained by the single loss function designed for SED. Furthermore, to ease the difficulty of training a deep learning model for SED caused by the imbalance in the amount of data for each class, the focal loss is used as a loss function. Experimental results show that joint training of DNN-supported dereverberation and beamforming with the SED model under the supervision of focal loss significantly improves the performance under the noisy and reverberant environments.


Introduction
Sound event detection (SED) is desired as a task that detects the onset and offset times for each sound event in an audio segment. Various sounds always occur around us, and SED enables many services, including social care [1], audio surveillance [2,3], drone detection [4], and bird detection [5], by allowing machines to recognize sound events like the human auditory system. In recent years, the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge and deep learning have significantly accelerated the research of SED. In the first DCASE challenge held in 2013, all proposed algorithms were based on shallow learning such as the hidden Markov model (HMM), support vector machine (SVM), and the Gaussian mixture model (GMM). In DCASE 2013, only a small number of teams participated, and the performances of the systems turned out not to be desirable [6]. Since the deep neural network (DNN)-based polyphonic SED algorithm was proposed in 2015 [7], deep learning-based SED studies have begun to pour out with the DCASE challenge (2016,2017,2018,2019). In particular, deep learning structures based on the convolutional neural network (CNN) [8,9], recurrent neural network (RNN) [10][11][12], and convolutional recurrent neural network (CRNN) [13] showed the state-of-the-art performance, and data augmentation methods were proposed to maximize beamformed signal is estimated as a result. Finally, the CRNN based SED assesses the presence or absence of sound events, including the onset and offset detection. Then, all the parts of the system are jointly optimized with the focal loss as a loss function. The details of each part of the proposed system are described in the following subsections.

DNN-Supported WPE Dereverberation
This subsection explains in detail the DNN-supported WPE dereverberation part including classical WPE dereverberation and DNN-supported WPE dereverberation reported in [20][21][22][23]. When we observe a signal using D microphones in a noisy and reverberant environment, the observed signal y t, f ,d can be represented in the short-time Fourier transform (STFT) domain as follows: t, f ,d , and n t, f ,d denote the source signal convolved with the early part of the room impulse response (RIR) and with the late reflection and noise signal, respectively. Furthermore, t is the time frame index; f is the frequency bin index; and d is the microphone channel index, respectively. We assume that the first 50 ms after the main peak of the RIR contributes to the early reflection, and the remaining part becomes the late reflection. The purpose of dereverberation is to subtract late reflection components from the observed signal as follows: where G H f ,d ,ỹ t−∆, f , and ∆ are the stacked representations of the linear prediction (LP) filter (WPE filter in Figure 1), the observation, and a delay for LP, respectively. To estimate the early reflection component, the classical WPE algorithm finds the LP filter based on the maximum likelihood (ML) for which the WPE assumes that the desired signal follows a zero-mean complex Gaussian distribution with a time-varying variance λ t, f . There is no closed-form solution of the ML optimization problem, but an iterative procedure alternates between estimating the filter coefficients G H f ,d and the time-varying variance λ t, f to find G H f ,d as follows: where (δ + 1 + δ) means the number of context frames to improve the variance estimate, R f is the correlation matrix, P f is the correlation vector, and K is the order of the LP filter.
, and it is tossed as the input of the DNN-supported MVDR beamformer part. Since the range of the masks is bounded within [0, 1], the DNN is easier to optimize than the direct prediction method of the desired power spectrum when jointly training the full networks [22].

DNN-Supported MVDR Beamformer
Originally, the MVDR beamformer used the steering vector, which depends on the angle of the desired signal from the source to minimize the residual noise while constraining the distortion of the signal. The steering vector can be obtained from an estimate of the direction of arrival (DoA) and the optimal signal is calculated by inducing the maximum beam gain in the steering vector direction and the minimum beam gain in the remaining direction. However, the MVDR beamformer also can be derived by speech and noise power spectral density (PSD) matrices without the steering vector. According to [31], the enhanced single-channel outputx t, f can be found by multiplying the gain H H MVDR (MVDR filter in Figure 1) by the observed multi-channel input signal y t, f as follows: x where Φ xx and Φ nn respectively denote the PSD matrices of the source and noise components and u is a one-hot vector for the reference microphone. In addition, tr means the trace of the matrix.
In the DNN-supported MVDR beamformer, similar to the DNN-supported WPE dereverberation, two networks are separately trained for estimating masks in calculating the source and noise PSD matrices, where v denotes the signal attribute and θ f is a predefined decision threshold, respectively [19,20,22,23]. These masks are averaged over the microphone channel d. As a result, the PSD matrices of the source and noise are found as follows: whereM (v) t, f ∈ [0, 1] denotes the estimated time-frequency mask calculated by the DNN, which uses the sigmoid as the activation function of the output layer. Finally, single-channel beamformed STFT coefficientsx t, f are estimated by following order (9) → (7) → (8). Then,x t, f is conveyed to the SED model for predicting sound events.

Sound Event Detection
For the SED, the LMFB is used as an input feature, which can be calculated by multiplying the magnitude spectrum with the Mel filter and then taking the logarithm. The input features are normalized using the global mean and variance statistics before being fed to the CRNN-based SED model, which is illustrated in Figure 2. Figure 2a-c show the CRNN-based SED model, the conventional convolutional block of the DCASE 2019 Task 3 baseline [18], and the proposed convolutional block, respectively. Unlike the conventional method using the three layers of the 3 × 3 convolution filter, the proposed convolutional block consists of two parallel parts inspired by VGGNet [32] and Inception V2 [33]. The first part conducts the convolution in the direction of the frequency axis only, and the second part performs the convolution in the direction of the time-frequency axis, then the two parts are concatenated. For the second part, the 3 × 3 convolution is divided into 1 × 3 convolution and 3 × 1 convolution. Finally, 1 × 1 convolution is used to reduce the computational cost. The output of the convolutional block is fed to the two layers of the bi-directional gated recurrent unit (GRU) RNN. Next, the output of the bi-directional GRU is connected to the fully connected layers and the output layer with the sigmoid function as an activation function, so that the value of the outputs is selected between zero and one for each class.

Joint Optimization
This section summarizes and explains how the DNN-supported dereverberation, beamforming, and the CRNN-based SED models are organized into a cascaded network. First, when the D-channel audio signal is input, the magnitudes of the STFT coefficients are calculated and then fed to the DNN. The DNN estimates the dereverberation mask, and then the magnitudes of the STFT coefficients of the dereverberated signal can be calculated using Equations (2), (3) and (6). Next, this output is fed into another DNN to estimate the source and noise masks for the neural MVDR beamformer. Using Equations (7)-(9), the magnitudes of the STFT coefficient of the single-channel enhanced signal are obtained. Then, multiplying these values with the Mel filters, the LMFB is calculated, which serves as an input for the CRNN-based SED model. The whole network is trained by the loss, which is calculated with the label and the SED output. At this time, the focal loss is considered as a loss function for further improving the performance. As for the SED, equalizing the data amount of each class is challenging because the audio lengths of each class are all different. The focal loss is useful for compensating for this problem naturally when training the deep learning model by giving a stronger loss to those that fail to estimate [29]. The focal loss is defined as follows: where gt represents the ground truth, p ∈ [0, 1] is the model's estimated probability, and γ denotes the tunable focusing parameter. All of the processes described above are differentiable, so the backpropagation with the chain rule is possible. Motivated by this, in the end, we perform joint training for the cascaded architecture of DNN-supported WPE dereverberation, the DNN-supported MVDR beamformer, and the SED network according to the focal loss, as depicted in Figure 2.
As for the joint optimization, we demand complex-valued operations including the complex-valued inverse in Equations (6) and (7). As in [19,23], the complex-valued operations using real-valued operations are implemented by separately computing real and imaginary parts. When C is a complex-valued matrix and A and B are real-valued matrices corresponding to real and imaginary parts, C can be expressed as C = A + iB. At this time, the complex-valued matrix inverse operations can be calculated as follows [34]:

Dataset
The proposed algorithm was evaluated with the TAU Spatial Sound Events 2019 dataset. The dataset consists of the two datasets, Ambisonic and Microphone Array [35]. The TAU Spatial Sound Events 2019-Ambisonic dataset provides four-channel first-order ambisonic (FOA) recordings, while the TAU Spatial Sound Events 2019-Microphone Array dataset provides four-channel directional microphone recordings from a tetrahedral array configuration. Each dataset consists of 500 audio files, 400 for development and 100 for evaluation. The records are one minute long, the sampling frequency 48,000 Hz, and the signal-to-noise ratio (SNR) for sound events and ambient noise 30 dB. These recordings were synthesized using the spatial room impulse response (IRs) collected from five indoor locations at 504 unique combinations of azimuth-elevation-distance. The collected IRs were convolved with the DCASE 2016 Task 2 dataset. In the DCASE 2016 Task 2 dataset, there are 11 classes of sound events such as clearing throat, coughing, door knock, door slam, drawer, laughter, keyboard, keys (putting on table), page-turning, phone ringing, and speech, and each class consists of 20 audio files. Finally, each development dataset was divided into four cross-validation [36] splits of 100 recordings each. Additionally, to consider the noisy environment, we mixed the datasets with the ambient noise recorded at an indoor location inside the Hanyang University campus in Seoul, Korea, under 10 dB SNR. Two simple data augmentation methods (pitch shifting [14] and block mixing [10] using monophonic audio clips) were applied in the training process for the model generalization to reduce overfitting.

Evaluation Metrics
To evaluate the performance of the SED model, we measured the segment-based F-score and error rate (ER) in the same way as the DCASE 2019 Task 3. The F-score and ER were calculated in segments of one second with no overlap [37,38]. Therefore, the labels and the SED outputs were generated on average for segments of one second to calculate metrics. First, the F-score, which measures the effectiveness of retrieval, is calculated as follows: where K is the number of segments and TP(k) denotes the number of true positives, which is the total number of sound event classes that were active in both the reference and predictions for the segment. In addition, FP(k) denotes the number of false positives, which is the number of sound event classes that were active in the prediction, but were inactive in the reference. Similarly, FN(k) is the number of false negatives, which is the number of sound event classes inactive in the predictions, but active in the reference. Additionally, the ER, which measures the amount of errors, is given as follows: where N(k) is the total number of active sound event classes in the reference. In addition, S(k), D(k), and I(k) are called the substitution, deletion, and insertion, respectively, which are mathematically defined as: As for the ideal case, it is noted that the F-score and ER become one and zero, respectively.

Experimental Setup
The evaluation was performed with a window length of 40 ms, a hop length of 20 ms, and a fast Fourier transform (FFT) size of 2048 points. Therefore, we obtained 3000 frames in one file since the file was 60 seconds long, and the input sequence length T for training was 128. For dereverberation and beamforming, the multi-layer perceptron, which consisted of three hidden layers with 1024 nodes, was used. ReLU was chosen for the activation function at the hidden layers. For the DNN input, we used the log-scale power spectra (folded frequency bins were discarded) as features that were spliced with three left and three right context frames. Note that the parameters of the LP filter for the WPE were fixed to (∆, K) = (3, 10). For sound event detection, first, the number of Mel filters for LMFB C was 240. Next, the number of CNN filters for each layer was [64, 64, 64], and the max pooling sizes along the frequency axis (MP 1 , MP 2 , and MP 3 ) were 6, 5, and 4, respectively. Additionally, the size of two GRU layers and two fully connected (FC) layers was [128, 128] and [256,256], and the drop-out rate for the FC layers was 0.5. We summarize the configurations of the neural networks in Tables 1 and 2. The batch size was 16, and an early stopping method was applied. Batch normalization [39] was applied to all networks, and the networks were optimized by Adam [40]. The focus parameter γ of the focal loss was set to two.  Tables 3 and 4 show the results with the TAU Spatial Sound Events 2019-Ambisonic development dataset and TAU Spatial Sound Events 2019-Microphone Array development dataset, respectively. First, by replacing the convolutional block, the F-score increased by approximately 1.6% on average compared to the conventional method in both datasets, and the ER also improved to 0.05. This result exhibited that using the different types of blocks in the convolutional block to extract features and concatenate them also worked well for the SED. Next, the performance was improved in all cases where the WPE was combined with the SED, the MVDR was combined with the SED, and the WPE and MVDR were connected with the SED and then jointly trained, respectively. The one point of these results was that MVDR was much more useful than WPE. However, this may be because the reverberation of the dataset was not active. Finally, the focal loss also turned out to be helpful in gaining the performances for the unbalanced dataset. The performance of Split 2, which had a slightly lower performance than the other splits, was relatively increased. Subsequently, the average F-score increased by 13.1%, and the ER improved 0.23 compared to the conventional method. For the DCASE 2019 Task 3 challenge results, two systems showed better performance than our proposed system with this dataset, and they achieved the F-score of 98.2%, while Xue_JDAI_task3_1 [41] achieved the F-score of 93.4%. However, MazzonYasuda_NTT_task3_3 [42] used 134M parameters for a vast ensemble model because the DCASE 2019 challenge did not require limited complexity. In contrast, the number of parameters in our system was 21M only. Tables 5 and 6 show the results at 10 dB SNR for the Ambisonic and Microphone Array development datasets, respectively. Similar to the original 30 dB datasets, the performance in the noisy environment was also improved in all cases where the WPE was combined with the SED, the MVDR was combined with the SED, and the WPE and MVDR were attached to the SED and then jointly trained, respectively. Table 7 shows the F-score and ER results of the evaluation dataset. Compared to the DCASE 2019 Task 3 algorithms, the proposed algorithm showed 4% better performance under the 10 dB SNR environment.

Conclusions
The CRNN-based SED model, which combines the DNN-supported WPE dereverberation and the DNN-supported MVDR beamformer, was jointly trained using a single loss function. Since the DNN-supported WPE dereverberation and MVDR beamformer were all differentiable, the gradients derived from the SED part could be backpropagated to update all the parameters of the DNN-supported dereverberation and beamforming. As for the loss function, we used the focal loss to compensate for the imbalance in the amount of data between classes. Experimental results showed that the joint training and focal loss improved the F-score and error rate of the SED, especially noisy environments.