Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection

.


Introduction
People experience a variety of audio events with meaningful information that can be useful for human activities. Audio event detection (AED) aims to identify the occurrence of specific sounds in audio recordings. As the amount of multimedia data on the Internet is growing rapidly, analyzing audio events will help in describing and understanding environmental and social activities in video and audio contents. AED is also useful in many other applications, including surveillance, self-driving cars, healthcare, smart home systems, and military applications.
In early studies on AED, several approaches were proposed based on signal processing and machine learning techniques, and recently deep learning based methods have been widely developed. Most of these studies are based on fully supervised learning methods that require strongly labeled data. In strongly labeled data, either audio event examples are directly provided or the exact time of each audio event is given. However, building a large strongly labeled database is a time-consuming and challenging work. For these reasons, there exist only a few publicly available large-scale audio event datasets with strong labels.
Recently, there have been some studies on weakly supervised AED [1][2][3][4]. These studies focus on learning AED models based on weakly labeled data that provide only the presence or absence of events in the recording. We can obtain weakly labeled datasets much more easily than strongly labeled datasets. For example, we can collect videos uploaded on the Internet and use the tags (user-generated keywords which describe the content of videos) of the videos as weak labels. However, it is problematic to directly use these data for AED since the exact occurrence times of the events are not known, which makes it difficult to learn a model for segment-level predictions.
Most of the AED methods use spectro-temporal representations as input features. Since the spectro-temporal feature of an audio signal, such as log mel spectrogram, can be considered as a 2D image, computer vision techniques can be applied to AED. In recent works on computer vision, deep learning approaches including convolutional neural network (CNN) models such as the residual network (ResNet) [5], the densely connected convolution network (DenseNet) [6], and the squeeze-and-excitation network (SENet) [7] have shown impressive performance. In addition, many studies report that better results are obtained by using structured prediction methods, which consider dependencies of each pixel-level output [8,9].
In this paper, we propose a deep convolutional network based on DenseNet and SENet for weakly supervised AED. We take advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. In addition, the correlations among segments in recordings are considered through a recurrent neural network (RNN) and conditional random field (CRF) [10]. We evaluated our proposed method and compared its performance with a CNN-based baseline approach. Empirical results show that the proposed method outperformed the baseline on the DCASE 2017 task 4 dataset.
The rest of the paper is organized as follows. In Section 2, we introduce previous works on AED and related studies. In Section 3, we present our proposed model for weakly supervised AED. Section 4 describes the experimental settings and performance measurement. In Section 5, we present the experimental results. We draw some conclusions in Section 6.

Related Work
Early works on AED focus on detecting audio events based on various machine learning techniques. Several approaches are proposed based on hidden Markov models (HMMs) [11,12]. In [12], Gaussian mixture model (GMM)-HMM-based modeling, similar to speech recognition techniques, is proposed to model audio events. Support vector machine (SVM) [13][14][15] and non-negative matrix factorization (NMF) [16][17][18] are also applied to AED in some studies. Bag of words representation is used to represent and detect audio events with various classifiers [19,20]. In [21], the use of multi label deep neural networks (DNNs) are proposed for detection of temporally overlapping audio events in realistic environments. Many works on AED have been proposed based on CNN [22][23][24]. RNNs have also been utilized in conjunction with DNN or CNN [25]. However, increasing the size of a fully supervised deep learning model is difficult due to a lack of large-scale strongly labeled datasets. This limitation can be somewhat alleviated by model regularization and data augmentation, but it is difficult to overcome the limitation completely.
There have been several studies on analyzing and detecting audio events in a weakly supervised scenario. Weakly supervised AED has been widely studied after the release of AudioSet [26], which contains more than two million 10-s YouTube clips with weak audio labels. In the early studies of weakly supervised AED, a multiple instance learning (MIL) [27] based approach is proposed in [1]. The authors formulated weakly supervised AED as a MIL problem and proposed MIL methods based on SVM and DNN. Although the training was done using weakly labeled data without temporal information, the authors showed that temporal localization of audio events was able to be extracted. In [2], the authors proposed a unified framework for supervised and weakly supervised learning (SWSL) using a graph-based model. The proposed model was able to be learned simultaneously from strongly and weakly labeled data.
Deep learning based methods have been widely proposed for weakly supervised AED and many of these methods have employed CNNs [3,4,28,29]. In [3], CNN is applied with an event-specific Gaussian filter layer, which is designed to improve its learning ability. McFee et al. [4] proposed a CNN structure with adaptive pooling operators to aggregate temporally dynamic predictions. Kumar and Raj [28] used CNN to scan and produce outputs at small segments and then mapped these segment-level outputs to full recording level outputs. Kumar et al. [29] used transfer learning to effectively convey knowledge from weakly labeled web audio data to the target data. In the DCASE 2017 [30], most of the top performing methods on the weakly labeled task relied on CNNs [31][32][33].
Recent improvements in computer hardware have enabled training very deep CNNs. However, this is not easy due to the problem of vanishing/exploding gradients particularly in lower layers. Many algorithms have been proposed to solve this problem such as ResNet [5]. ResNet introduces a residual block that sums a non-linear transformation of the input and its identity mapping. The identity mapping is implemented through a shortcut connection, which makes the networks avoid the vanishing gradient problem. The shortcut connections help to improve the performance of the networks and obtain faster convergence of training. As an extension of ResNets, a new CNN architecture, called DenseNet, is introduced in [6]. DenseNet is built from stacks of dense blocks and pooling operations. The dense blocks consist of multiple layers with direct connections from any layer to all subsequent layers to improve the information flow between layers.
In [7], the authors focused on the channel relationship and proposed a novel architectural unit, the squeeze-and-excitation (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. They proposed to squeeze global spatial information into a channel descriptor and modeled channel-wise relationships using a lightweight gating mechanism. They demonstrated that SE blocks brought significant improvements in the performance of the state-of-the-art CNNs at a minimal additional computational cost.
CRFs have been employed to enforce structure consistency in semantic segmentation. In [34], a fully connected CRF is used to consider the structural properties of the segmentation outputs. More recently, deep learning models integrating the densely connected CRF are proposed in many studies. DeepLab [8] proposes deep CNNs with atrous convolution, which is convolution with upsampled filters, and combines the responses at the final layer with a fully connected CRF. In [9], an RNN is introduced to approximate the mean-field iterations of CRF optimization, allowing for an end-to-end training of both the fully convolutional network and the RNN.
In this section, we describe our weakly supervised AED model, referred to as DSNet. The overall structure of the DSNet is depicted in Figure 1. The input of DSNet is a log mel spectrogram image X ∈ R N×M , where N denotes the number of segments and M is the number of mel filterbanks. First, convolution is performed on the log mel spectrogram images to extract feature maps. We use four DS blocks, which consist of a dense block, an SE block, and a max-pooling layer. Two fully connected layers are applied for segment-level prediction. To detect overlapping audio events simultaneously, we define AED as a multi-label classification problem. For this, segment-level predictions are calculated using sigmoid activation functions at the final fully connected layer. A global pooling layer is applied for clip-level prediction. For structured prediction, DSNet with an RNN is proposed and a fully connected CRF is applied as a post-processing method. The detailed architectures of DSNet and DSNet-RNN are given in Section 4.

DenseNet
In a standard CNN, the output of the lth layer x l is calculated by applying a non-linear transformation to the output of the previous layer x l−1 where L l is a convolution followed by a non-linearity activation function. Conventional CNNs consist of a stack of convolutional layers. However, deeper CNNs are more difficult to train due to vanishing gradients. In ResNet [5], residual blocks are used to train deeply structured neural networks. A residual block sums the identity mapping of the input to the output of the layer. The output x l of a residual block is given by where H l is a non-linear transformation, which usually consists of a single layer or a stack of multiple layers. The identity mapping acts as a skip connection from a lower layer to the upper layer, which enables input features to be reused and the gradient to flow directly from the upper layer to the lower layer. DenseNet [6] is built from stacks of dense blocks. To improve the information flow between layers, DenseNet uses skip connections from any layer to all subsequent layers in each dense block. Each layer in dense blocks can be expressed as where [ ] represents the concatenation of feature maps. DenseNet may look similar to ResNet, which introduces skip connections, however this small modification makes a noticeable difference between the two networks. DenseNet is more efficient than ResNet in parameter usage. Thanks to short connections to all feature maps in the architecture, information from previously computed feature maps can be reused easily.
We use four convolutional layers and a single bottleneck layer (a layer that contains few channels compared to the previous layer) in each dense block. To improve computational efficiency, the bottleneck layer compresses all feature maps in the dense block into a reduced number of feature maps using 1 × 1 convolution. For all convolutional layers in the model, each side of the inputs is zero-padded by one pixel to keep the feature map size fixed and batch normalization is applied before a rectified linear unit (ReLU) for better training performance. We use more feature maps on the upper dense blocks to compensate for feature map size reduction in each max pooling layer.

Squeeze-and-Excitation
We use SE blocks [7] to consider interdependencies between channels. In the SE block, global spatial information is squeezed into a channel descriptor using global average pooling. A channel descriptor z ∈ R C is extracted by averaging the input feature map U ∈ R H×W×C through its spatial dimensions H × W. To utilize the information aggregated in the squeeze operation, a simple gating mechanism is employed. The channel descriptor z is transformed into a set of channel weights s ∈ R C , which is given by where σ and δ, respectively, refer to sigmoid and rectified linear functions. To reduce model complexity and aid generalization, a bottleneck is formed by W 1 ∈ R C r ×C and W 2 ∈ R C× C r . We set the dimensionality-reduction ratio r to 4 in our system. The final output of the SE block is obtained by scaling U with channel weights s for each channel. In this manner, channels possessing more important information can be emphasized.

Global Pooling for Aggregation
The proposed DSNet aims to predict both segment-level and clip-level labels. To train DSNet with only weak labels (clip-level labels), we need to aggregate segment-level predictions to form clip-level predictions. A common approach would be taking an average over all segment predictions corresponding to a clip-level prediction. In this approach, all segments of the clip have the same influence on the clip-level prediction. However, clips with a positive label can also contain negative segments, which disturb the training process. In the multiple instance learning framework, a global max-pooling is applied to aggregate segment-level predictions into a clip-level prediction. In the max pooling approach, the clip-level prediction focuses on the most positive segment in the clip and disturbance from negative segments can be reduced. However, with global max-pooling aggregation, only the most positive segment in each clip is active in training during backpropagation and other segments are ignored.
To take advantage of both methods, we apply the LogSumExp (LSE) function, which is a smooth approximation of the max function. The LSE function is given as where y k is a clip-level prediction for class k, s t,k is the segment label of the tth segment for class k and T is the number of segments in a clip. In Equation (5), α is a hyperparameter (a parameter whose value is set before training) to control the sharpness of the function. As α increases, the function approaches to the max function and, as α decreases, the function approaches the average function. With the LSE pooling, we can use all the segments of the clip during training and also focus on positive segments in positive clips. We set the parameter α = 0.5 in our system. To train our model with only weak labels, we apply the mean square error as the cost function, which is given by where L n denotes the true label for the nth clip, Y n is the clip-level prediction for the nth clip, and N is the number of clips in the training data.

RNN-Based Structured Prediction
Segment-level prediction can be performed using DSNet as described above. However, segment-level predictions may not be robust since DSNet does not make good use of long-term contextual information. Better segment-level prediction results can be obtained by considering long-term contextual information and incorporating prior knowledge into our model. To consider long-term dependency between segment predictions, an RNN is applied at the top of DSNet. We refer to DSNet with RNN as DSNet-RNN. The structure of DSNet-RNN is almost the same with DSNet except that the dense layer marked with * in Figure 1 is replaced by a single layer RNN with bi-directional gated recurrent units (Bi-GRUs) [35]. However, in weakly supervised learning, there is a lack of accurate label information on each segment. The incorrect information on each segment may affect other segments through the RNN. To mitigate this problem, it is desirable to train our model by applying some prior knowledge that audio events are generally continuous. To utilize this prior knowledge, we define a prediction smoothness cost C ps as where y i is the segment prediction of the ith segment and p i denotes its normalized temporal position. The prediction smoothness cost C ps encourages segment predictions to be continuous over time by penalizing nearby segments with different predictions. The cost function for training the DSNet-RNN is given by where λ is a compromising parameter.

CRF Post-Processing
As the proposed model can produce segment-level predictions, we can determine the border of audio events through post-processing. A common approach is to smooth the segment-level predictions and threshold them for boundary decision. However, since this approach does not take dependency between the segments into account, it is not easy to determine the borders of audio events precisely. To address this issue, we apply CRF for post-processing the segment-level predictions. To reflect the full relationship among segments, we incorporate the fully connected CRF model proposed in [34] into our system.
In the conventional approach, segment-level predictions y i are smoothed and thresholded for segment-level classification. The threshold value th v is determined to have the best F1 score on the validation set. In the CRF post-processing approach, label assignment probability of each class for the ith segment P(i) is calculated as P(i) = sigmoid((y i − th v )).
The energy function for the fully connected CRF is given as where θ i represents the unary potential at the ith segment and θ ij is the pairwise potential between the ith and jth segments. In the pairwise potential, µ(i, j) = 1 if the ith and jth segments have different label assignments, and zero otherwise. p i denotes the temporal position and m i is the log mel spectrum of the ith segment. The hyperparameters w mel , σ mel , w pos , and σ pos control the Gaussian kernels. The pairwise potential penalizes segments with similar log mel spectra and positions having different labels. This model can efficiently infer the probabilities using mean field approximation and efficient message passing through high-dimensional filtering [34].

Dataset
The DCASE 2017 task 4 Dataset [30] was published for the task of "Large-scale weakly supervised sound event detection for smart cars" in the DCASE 2017 challenge. The dataset employs a subset of AudioSet by Google [26]. The DCASE 2017 task 4 Dataset consists of 17 audio events divided into two categories: "Warning" and "Vehicle". The dataset contains audio classes for self-driving cars, smart cities and related areas. The dataset contains 51,172 clips of training set, 488 clips of validation set and 1103 clips of evaluation set. Every clip is less than 10 s long. Each clip may correspond to more than one audio event and possibly has overlapping audio events. The dataset is obtained by collecting real-life recordings that contain noise and unknown class signals. The training set has weak labels denoting the presence of a given audio event in the clip and no timestamps are provided. For the validation and evaluation sets, strong labels with timestamps are provided for the purpose of performance evaluation.

Metrics
In our work, both clip-level and segment-level evaluation metrics were used. The default segment length used in this work was 100 ms, which is shorter than the segment length used in the DCASE challenge, 1 s. This was because our system aims to detect audio events accurately in time via structured prediction. Since the dataset for evaluation has multi-label annotations, we used the metrics proposed in [36]. F1 score with precision (P) and recall (R) was calculated as the primary evaluation metric for both clip-level and segment-level evaluation. For segment-level evaluation, segment-based error rate (ER) was also measured. A detailed explanation of both evaluation metrics is described in [36].

Feature Extraction
As inputs to the neural networks, we used log mel band features. We extracted 128 mel bands from 0 Hz to 22,050 Hz. We applied a window size of 1100 samples with a shift of 365 samples for frame segmentation to produce 800 frames in a 10-s clip. The logarithms of the mel band energies were calculated and each log mel energy was normalized by subtracting its mean and dividing by its standard deviation computed over the training set. As a result, a 800 × 128 normalized log mel spectrogram image was extracted for each 10-s clip.

DSNet and DSNet-RNN Structures
The specific configuration of the proposed model is described in Table 1. The extracted normalized log mel spectrogram image was used for input to the neural networks. A convolution layer was used to produce feature maps for dense blocks. These networks consisted of four dense blocks each with four convolution layers and one bottleneck layer. The convolution layers consisted of three consecutive operations: 3 × 3 convolution, batch normalization and ReLU. We used 1 × 1 convolution layer to reduce channels. An SE block and a max pooling layer were placed after each dense block. For segment-level prediction, two dense layers were applied in the DSNet, and Bi-GRU and a dense layer were applied in the DSNet-RNN. Finally, the segment-level predictions were aggregated through the global pooling layer for clip-level prediction. We set σ ps = 0.1 in Equation (7) and λ = 0.01 in Equation (8) to train the DSNet-RNN. The parameter size of the DSNet is 0.32M, which is similar to that of the baseline CNN. The DSNet-RNN has more parameters than the others due to the Bi-GRUs used for structured prediction.

Baseline CNN Structure
To verify the performance of the proposed method, we compared the proposed method with a baseline model. In the DCASE 2017 challenge, several CNN-based models were proposed and showed good performance in weakly supervised AED [31][32][33]. We chose a CNN baseline model similar to the models proposed in the DCASE 2017 Challenge. The specific configuration of the baseline model is described in Table 2. The audio feature for the baseline was the same as that of the proposed model, a 800 × 128 normalized log mel spectrogram image. The baseline model consisted of four stacks of two convolution layers and a max pooling layer. The last max pooling layer was connected to two dense layers to produce segment-level predictions, and the segment-level predictions were aggregated in the global pooling layer.

Training and Evaluation
The neural network models were implemented using TensorFlow [37]. We set the hyperparameters such that they provided the highest segmental F1 score on the validation set. All networks were trained with Adam (an algorithm for first-order gradient-based optimization of stochastic objective functions) [38]. A dropout [39] rate of 0.1 was applied to the output of the SE blocks and the dense layer with ReLU. We used mini-batches (a subset of data used for updating parameters during one iteration) of 10 clips and a learning rate of 0.0001. We used the validation set to earlystop (stop training based on the validation error to avoid overfitting) the training based on the segmental F1 score. To deal with the unbalance between classes on the training set, we applied undersampling to the classes with more than 1000 clips. The networks were trained on NVIDIA Tesla M40 GPUs.
For evaluation, the optimal thresholds were selected to have the best performance on the validation set. The segment-level predictions were smoothed with a Hanning window of length 41 before thresholding. We set the CRF parameters in Equation (12) to w mel = 1, σ mel = 1, w pos = 1 and σ pos = 25, which showed the best segment-level F1 score on the validation set. To perform multi-labeled classification, CRF post-processing was performed separately for each class. We employed 10 mean field iterations in the test phase. Tagging   Table 3 presents the clip-level tagging results on the DCASE 2017 task 4 evaluation set and parameter sizes of each model. The results show that the DSNet had an absolute improvement of 0.0347 over the baseline CNN in terms of F1 score. The performance of the DSNet indicates that DenseNet and SENet are suitable not only for image processing but also for audio processing. The DSNet-RNN showed almost the same performance as the DSNet in clip-level metrics, which means the structured prediction has little effect on the clip-level performance.

Audio
The class-wise F1 score results for the CNN, DSNet and DSNet-RNN models are presented in Table 4. While there was some variation across classes, the DSNet and DSNet-RNN showed better performance than CNN on most classes. The performance of the DSNet was considerably better compared to the baseline CNN for the "air horn, truck horn", "police car", "skateboard" and "motorcycle" classes and the DSNet-RNN showed better performance than the baseline CNN in the "air horn, truck horn", "police car", "screaming" and "motorcycle" classes. The best performing class for all models was "civil defense siren", which consists of long and high volume sounds, and the worst performing class was "car passing by", which consists of short and low volume sounds.  Table 5 presents the segment-level results on the DCASE 2017 task 4 evaluation set. Both the DSNet and DSNet-RNN outperformed the baseline CNN model in F1 score by 0.0148 and 0.0367, respectively. Similar to the clip-level results, the DSNet performed better than conventional CNN by using DenseNet and SENet. Especially, the DSNet-RNN showed the best performance in segment-level results. This indicates that each segment-level prediction benefits from considering contextual information in the neural network. The weight λ introduced in Equation (8) is a hyperparameter, which allows us to control the dependency of the cost function on structured prediction. The effect of the weight λ on the DSNet-RNN is presented in Table 6. The result shows that, when λ = 0, the model did not show significant performance improvement over the DSNet. This means that the flow of uncertain information in the RNN may hinder the training of the model in weakly supervised learning. The overall results show that the performance of the model could be improved by restricting uncertain information flow with appropriate constraints based on prior information. The model had the best performance when λ = 0.01, which was used in training the DSNet-RNN.  Table 7 presents the influence of CRF post-processing on the segment-level performances. All models show performance improvement through CRF post-processing. In the DSNet-RNN, the performance improvement was relatively low. This indicates that the DSNet-RNN already reflected contextual information and hence the additional benefit from CRF post-processing is relatively small. The results of the DSNet-RNN with and without CRF post-processing are visualized in Figure 2. We could correct isolated inaccurate predictions and improve the predictions particularly in the boundaries of the events by employing CRF post-processing.

Comparison with the DCASE 2017 Task 4 Results
For comparison, the results of our models and the top results from the DCASE 2017 task 4 are presented in Table 8. In the DCASE 2017 task 4, Xu et al. [33] and Lee et al. [31] showed the best performance in audio tagging (clip-level) and sound event detection (segment-level), respectively. Xu et al. [33] used the learnable gated activation function and Lee et al. [31] used a multi-scale input framework. Both also used the fusion or ensemble of models for the best performance. For a fair comparison, we compared the segment-level results of our proposed model in 1 s time resolution. Our models showed better performance in both clip-level and segment-level results, even without the fusion or ensemble of models. The proposed models outperformed Xu et al. [33] in clip-level F1 score. In the segment-level metrics, the DSNet-RNN achieved a similar performance as Lee et al. [31] in F1 score and showed a better performance in ER.  [31] 0.526 0.555 0.660

Conclusions
In this paper, we propose DSNet, which is a combination of DenseNet and SENet, for weakly supervised AED. DSNet allows better information and gradient flow through direct connections between any two layers in dense blocks and adaptively recalibrates channel-wise feature responses using SE blocks. Moreover, we propose a structured prediction framework and adopted it to DSNet. DSNet-RNN utilizes contextual information while minimizing the propagation of uncertainty and CRF post-processing helps to refine segment-level predictions. Experiments showed that DSNet with structured prediction achieved state-of-the-art results in the DCASE 2017 task 4 dataset.