Conﬁdence Learning for Semi-Supervised Acoustic Event Detection

: In recent years, the involvement of synthetic strongly labeled data, weakly labeled data, and unlabeled data has drawn much research attention in semi-supervised acoustic event detection (SAED). The classic self-training method carries out predictions for unlabeled data and then selects predictions with high probabilities as pseudo-labels for retraining. Such models have shown its effectiveness in SAED. However, probabilities are poorly calibrated conﬁdence estimates, and samples with low probabilities are ignored. Hence, we introduce a conﬁdence-based semi-supervised Acoustic event detection (C-SAED) framework. The C-SAED method learns conﬁdence deliberately and retrains all data distinctly by applying conﬁdence as weights. Additionally, we apply a power pooling function whose coefﬁcient can be trained automatically and use weakly labeled data more efﬁciently. The experimental results demonstrate that the generated conﬁdence is proportional to the accuracy of the predictions. Our C-SAED framework achieves a relative error rate reduction of 34% in contrast to the baseline model.


Introduction
Acoustic event detection (AED) is a task for identifying the categories and timestamps of target sound events in continuous audio recordings. As one of the core technologies in non-verbal sound perception and understanding, AED is widely deployed in various applications, such as noise monitoring for smart cities [1], nocturnally migrating bird detection [2], surveillance systems [3], and multimedia indexing [4].
Traditional approaches of AED mainly draw on the ideas of speech and the music signal process. Classic features such as Mel frequency cepstral coefficients (MFCCs) are sent into machine learning classifiers, such as the Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) [5,6]. Recently, deep neural networks (DNNs) [7,8], especially convolutional neural networks (CNNs) [4,9] and convolutional recurrent neural networks (CRNNs) [10], have been widely applied to AED tasks and achieve excellent performance. However, these methods are dependent on a large amount of strongly labeled training data with both types and timestamps. With the increase of acoustic event categories, high-quality manual annotations become hard to afford in terms of time and economy in practical applications. In comparison, synthetic strongly labeled data, weakly labeled data with only clip-level categories, and unlabeled data are widely available. Therefore, research and competitions [11,12] are turned to semi-supervised acoustic event detection (SAED) with the above data.
SAED inherits the paradigm of semi-supervised learning to use unlabeled data. That is, pseudo-labels are generated for unlabeled data as the training target. Therefore, the quality of pseudo-labels plays an important role. There are at least two approaches to improve the pseudo-labels quality in the SAED area. One is to improve the model that generates pseudolabels. As high-level features with a larger compression scale are more suitable for clip-level classification, Lin [13] and Yan [14] introduce an extra branch or model with larger sampling sizes to produce high-level features with broader vision and better clip-level pseudo-labels for unlabeled data. Another way is to design a better pseudo-labels generation policy. Two effective semi-supervised learning methods are led into SAED: mean teacher and self-training. Mean teacher [15] averages model weights over steps to form a targetgenerating teacher model. A modified mean teacher model benefits SAED by employing both frame-level and clip-level consistency loss [14,[16][17][18]. Self-training [19], a simple but effective bootstrapping semi-supervised method, cycles retraining the model with part of its predictions as pseudo-labels. Self-training methods adopted in SAED [20][21][22] retrained only once and employed a small part of unlabeled data with high probabilities. These approaches filtered unlabeled data by using the posterior distribution to ensure the quality of the pseudo-labels.
These two approaches have shown their respective strength, and their combination may bring further improvement. Nevertheless, the effect of the combination is out of the scope of this article. This paper focuses on introducing a better pseudo-labels generation and usage method. We notice that mean teacher and self-training yield impressive performance, but there are three problems with self-training. First, the probability is not a calibrated indicator for evaluating the correctness of the model predictions. As modern neural network classifiers are designed to produce output probabilities prone to extreme values, incorrect predictions can be generated with high probabilities [23]. Second, simplified self-training methods [20][21][22] lose a considerable amount of information. Third, these methods ignore true negative predictions. Nevertheless, data for SAED are extremely imbalanced. Massive correct negative predictions are beneficial to the retraining process.
To improve the training efficiency, we compress the number of iterations of the framework into two stages: in the first stage, pseudo-labels and the evaluation of pseudo-label quality are generated. In the second stage, unlabeled data are used distinguished according to the quality of pseudo-labels. Specifically, since the posterior probability cannot effectively measure the quality of the pseudo-label, we introduce a method of training confidence in the first stage without additional confidence annotation. Inspired by [24], the first-stage model is made to predict audio events and confidence value at the same time. In the second stage, we use unlabeled data discriminately, increasing the weight of the data with a more reliable label in the training process.
The contributions of this paper are as follows: (1) C-SAED realizes the co-training of classification and confidence with only classification labels in the first stage by designing a multi-task model. The experimental results show that the generated confidence can effectively measure the correctness of the label. (2) Compared with the traditional self-training method, differentiated training rather than screening strategies in the second stage effectively improves the utilization efficiency of unlabeled data. Our experiments illustrate that the training effect is significantly improved under the same number of iterations. (3) C-SAED uses the mean teacher model as the backbone of each stage model that effectively fuses two semi-supervised methods: the consistency principle and pseudolabels. The ER decreases compared to adapting mean teacher only.

Baseline: Mean Teacher
Mean teacher is a consistency regularization method that evaluates unlabeled data with two different noises, and then apply a consistency cost between the two predictions. In this case, the model assumes dual roles of a teacher and a student. The baseline model performs the following optimization on the basis of mean teacher model in [17]: First, we introduce data augmentation by shifting input features along the time axes (forward and backward with a Normal distribution with zero mean and a standard deviation of 16 frames). Second, we adopt a set of median filter window sizes that is proportional to the average duration of different event categories [13]. Third, the 128-dimensional log Mel spectrogram is extracted at each frame. The size of the window is 2048 and the hop length is 365. Fourth, parameters of feature extractor follow the settings in [18]. To get reliable confidence, we added a branch to train confidence in MT-SAED, as illustrated in Figure 1. When solving the issue of simultaneously generating AED predictions and their corresponding confidence without the confidence label, we draw on the successful experience in the field of out-of-distribution detection [24]. The motivation is equivalent to a special test that permits giving hints. Candidates are allowed to ask for hints according to their confidence of the questions. Furthermore, a certain penalty is carried out in order to prevent candidates from tending to ask for hints for all questions. To obtain the highest score, candidates must improve their ability to answer questions and self-assess at the same time.

Conv
BiGRU FC MT-SAED is constructed based on the baseline model. There are four outputs in the baseline model, the frame-level output y f t and clip-level output y ct of the teacher model, and the frame-level output y f s and clip-level output y cs of the student model. For each clip, frame-level outputs y f t and y f s are T × C vectors containing probabilities for each frame, and clip-level outputs y ct and y cs are 1 × C vectors only containing probabilities for the clip, where T,C are the number of frames and types. A power pooling [25] is adopted between frame-level outputs and clip-level outputs. To make the model selfassessment, we add a confidence branch in parallel with the original class prediction branch. The confidence branch, which shares the same structure with the frame-level classification branch, applies a fully-connected layer followed by sigmoid. The confidence branch generates corresponding confidence values c for the classification results of each sound event at every frame. Output c takes values between 0 and 1. If the model is confident about the classification, output c will be closer to 1. Conversely, if the model is uncertain about the correctness of classification predictions, the value of c will be closer to 0.
A crucial issue of confidence is how to achieve the training of two tasks with just the classification labels. Following the main idea of giving hints, we construct a new frame-level output of student model y f s with the label t f and two outputs y f s and c: The outputs of the student model y f s and y cs are in comparison with strong labels t f and t c utilizing the binary cross entropy (BCE) loss, as illustrated in Figure 2. The classification loss can be written as Outputs y f s and y cs are compared with the outputs y f t and y ct by applying the mean square error (MSE) loss. The consistency loss is  Figure 2. Framework C-SAED: Stage one introduces a muti-task system that can generate frame-level classification predictions and corresponding confidence estimates. The pseudo-labels and confidence for weakly labeled data and unlabeled data are produced and applied in stage two. The power pooling function is adopted in the first stage. n 1 , n 2 are noises added to the student model and the teacher model. After the weights of the student model (Θ) have been updated with a gradient descent, the teacher model weights (Θ ) are updated as an exponential moving average (EMA) of the student weights.
Training with L class and L con loss functions, the network will be lazy at learning the differences between classes. Instead, the model tends to make c approach 0 and receives ground truth for every sample. Thus, a log penalty is added to the loss function. The confidence loss can be interpreted as a BCE loss: The loss function of the multi-task system is and parameter µ increases with epochs and λ is a hyperparameter. When λ is too small, the MT-SAED model tends to ask for hints and performs poorly in classification. When λ is too large, the confidence c → 1 and loses the distinction. To ensure the effects of both classification and confidence estimation, we first optimize the mean teacher model and classification branch without L c . Then, the trained parameters are fixed, and L is deployed to train the confidence branch separately for five epochs.

Stage Two: Retraining with Pseudo-Labels and Confidence
In the second stage, the weakly labeled and unlabeled data are sent to the trained MT-SAED model to yield frame-level predictions and confidence estimates. The frame-level posterior probabilities are applied as soft pseudo-labels for the above data during retraining. Confidence estimates offer an ability of self-assessment for pseudo-labels. For weakly labeled data, we regulate the outputs. Pseudo-labels are revised to 0, and confidence estimates are revised to 1 for negative clips. For strongly labeled data, all confidence estimates are set to 1. In order to guarantee the contribution of each sample under the premise of discrimination, we interpolate the confidence values with a hyperparameter α to produce a new confidence: The frame-level classification loss L class f is weighted by c as follows: where i, k represents the index of frames and classes. For those samples with high confidence, the accuracy of their pseudo-labels is higher. We make them more important during retraining. Conversely, the proportion of classification loss function value is relatively small for other samples. As a result, all strongly labeled, weakly labeled, and unlabeled data information are learned distinctly. L class c and L c are omitted. The max pooling is adopted to produce clip-level predictions.

Pooling Functions
In C-SAED, we use different pooling functions between two-level predictions in two stages, as weak labels are only used in the first stage. To simplify, we briefly introduce two pooling functions for single event detection. Polyphonic SED can be considered as multiple binary classification problems.
In the first-stage training, the pooling function needs to complete two tasks simultaneously: one is to generate weights for frame-level predictions to form the clip-level prediction. The other is to generate frame-level gradients from clip-level gradients.
Because of the gradient backpropagation and the adaptability for sound events with various time scale, power pooling function [25] is the state-of-the-art method. Its formula and gradient formula are ∂y c ∂y f (i) y f is frame-level output probability, y c is clip-level output, and parameter n represents the exponent which should be non-negative. Here, n is a free parameter to be learned alongside the model parameters, which allows Equation (8) to automatically adapt to and interpolate between separate pooling functions for different sound events. For instance, when n = 0, Equation (8) reduces to mean pooling. When n = 1, Equation (8) simplifies to linear pooling. When n → ∞, Equation (8) approaches the max aggregation. The gradient backpropagation of power pooling is quite suitable for SED. For positive clips, larger y f is driven to 1 and smaller y f is driven to 0, benefiting the timestamps detection. For negative clips, y f is pushed towards y c × n/(n + 1). Considering that y c is trained towards 0, all the y f will converge to 0 as desired after enough iterations.
As for the second-stage training, only strong labels are applied. Then, max pooling is the default choice. After frame-level predictions are smoothed by the median filter, if at least one frame is positive, we regard that the clip is positive.

Dataset and Metrics
We carried out experiments on the DCASE 2019 Task4 dataset [12], DCASE 2018 Task4 dataset [11]. Since these two datasets contain the same ten sound events, we take the training set of DCASE 2019 as our training set, including synthetic strongly labeled (2045 clips), weakly labeled (1578 clips), and unlabeled (14,412 clips). The validation set (1168 clips) of DCASE 2019 and the evaluation set of DCASE 2018 (800 clips) are used for tests.
These two datasets consist of 10 classes of sound events in a domestic environment: Speech, Dog, Cat, Alarm/bell/ringing, Dishes, Frying, blender, Running water, Vacuum cleaner, and Electric shaver/toothbrush. The sampling rate is 44,100 Hz. The duration of each audio clip is around 10 s, and multiple audio events may occur at the same time.
In addition, we select clips with multiple events from the validation set of DCASE 2019 and name the set polyset. This set is used to evaluate the performance for the case with two or more acoustics events in a clip. The polyset contains 331 clips.
Experiments were evaluated with event-based macro-average error rate (ER) with a 200 ms collar on onsets and a 200 ms/20% of the events length collar on offsets. The formula is where c is the number of sound event types, I(i), the insertion error, represents the number of sound events i generated by model outputs that do not appear in the clips actually. D(i), deletion error, is the number of sound events i in the clips that were not identified by the model. N(i) is the number of active sound events i in the reference. A low ER indicates a more accurate SED system. The specific evaluation details can be found in [26].

Results and Analysis
In this section, we first compare the evaluation ability of posterior probability with confidence. Then, our methods are compared with other methods on the test sets. Finally, we analyze the effectiveness of the hyperparameter λ and specify a prior joint distribution for α and λ.

Comparison with Other Methods
We compared the proposed model with the following approaches: MT18: the official baseline for DCASE2019 task4, with the mean teacher structure [17]. Baseline: modified MT18 method with attention pooling. MT-SAED: the stage one model of C-SAED with power pooling. Prob0.9: only predictions with prob ≥ 0.9 added to pseudo-labels, samples retrained with equal weight [20]. Prob: all samples retrained with probabilities as weights. Prob0.5: only predictions with prob ≥ 0.5 added to pseudo-labels, samples retrained with confidence. Table 1 lists the performance of different models on the three test sets. The results illustrate that C-SAED models are significantly improved in contrast to the other models. The ER improvement is mainly due to the significant reduction of INS error. That is, the improvement comes from correcting false positives. When the model retrained with parameter α = 0, ER also decreased. Although the mean teacher method already makes use of weakly labeled and unlabeled data, applying an appropriate self-training structure can effectively reduce false alarms. However, Prob0.9 introduced massive false positive predictions as pseudo-labels and applied them equally, resulting in many insertion errors. Prob caused performance degradation due to the fact that the majority of true negatives owned small weights approximately 0.7 after applying parameter α = 0.3. Meanwhile, many false positives ( Figure 3) were introduced with high weights. In contrast, C-SAED with confidence further improved ER by successfully strengthening the attention to true negatives and true positives. The poor results of Prob0.5 reveal the importance of true negatives. In addition, for most models, the error rate on polyset is 0.13 to 0.2 higher than results on validation 2019. This phenomenon confirms that polyphonic audio event detection is more challenging than monophonic audio event detection.  Figure 4 shows the class-wise error rates. We omit "Prob0.9" which yielded high ERs, and "MT18" whose original paper does not report its detailed error rates. Some conclusions can be drawn. First, the parameter λ and α effect on the INS and the DEL errors-for example, when λ = 0.01, large α (α = 1.0) gives lower INS but higher DEL compared with small α (α = 0.3). The reason may be that large α gives low weights on the samples with low confidences, which makes the predicted events accurate but may fail with difficult samples. Second, the proposed 2-stage scheme outperforms the baseline, MT-SAED, and self-training on most events. Third, different events have different difficulties in classification. For example, event alarm/bell/ringing, electric shaver/toothbrush, and vacuum cleaner have relatively lower ERs, while event frying gives the highest ER.

The Effect of Hyperparameter λ
As mentioned in [24], parameter λ greatly influences the threshold and distinctiveness of confidence. We adjusted λ in a small range to explore the effect of λ. Figure 5 illustrates that all MT-AED models trained with λ ∈ [0.01, 0.1] could generate confidence that is positively related to the accuracy of predictions. However, if λ is relatively large, the confidence values are aggregated. If λ is relatively small, a majority of confidence estimates possess the similar accuracy. Polyline y = x represents the ideal discrimination of prediction quality measurement. Therefore, the smaller the area between the curve and the Polyline y = x, the better the discrimination of confidence. In our experiment, curve λ = 0.03 brought out the best distinction.   Figure 5 demonstrate that a suitable combination of parameters λ and α can produce more balanced detection results. INS errors were reduced under the premise of small fluctuations in DEL errors. As an example, when λ = 0.1, α = 1, or λ = 0.01, α = 0.7, the total number of events correctly predicted did not reduce while ER decreased significantly. These combinations increase the lower bound of confidence values, ensure the information contribution of samples with low confidence, and preserve the distinction of confidence.
In practice, the joint usage of λ and α is recommended to obey the following rules. First, λ is expected from 0.01 to 0.1; α can be set from 0.3 to 1.0. Second, a small λ usually corresponds to a relative small α. The reason might be that a small λ leads to a broader distribution of confidence and a small α forces the model to pay attention to more samples. The averaging of different models with different combinations can bring further improvement. Third, under the condition of fixed λ, a larger α is a better choice when the task is simple or the model has demonstrated relatively high performance.

Conclusions
In this paper, we propose a C-SAED framework with a confidence learning scheme. Our experiments verified that the proper combination of self-training and mean teacher method is better than employing mean teacher alone. Furthermore, the multi-task structure with a joint learning strategy can generate more reliable confidence values for classification probabilities. The confidence estimates are used as weights to optimize the self-training retraining process, which creates a further improvement. The C-SAED framework can also be extended to other semi-supervised tasks. In addition, this paper introduces a confidence training method to SAED, but confidence can also be applied in other scenes, such as optimizing focal loss.