Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal–frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.


Introduction
Polyphonic sound event detection (SED) has attracted increasing research attention and numerous challenges [1,2] in recent years, and is mainly used for acoustic event classification and time detection. In real environments, multiple audio events may occur simultaneously. Polyphonic SED has important practical application value and theoretical significance [3]. In recent years, polyphonic SED has been applied to smart city traffic systems [4], equipment failure monitoring [5], smart home devices [6,7], telemedicine [8] and wildlife monitoring [9]. Environmental polyphonic sounds don't show regular temporal patterns, such as phonemes in speech or rhythm in music. Therefore, it is difficult to accurately capture the characteristic information of time frames. In addition, the complexity and variability of polyphonic sound events in real scenes make them more difficult to detect.
To solve the above problems, various methods have been applied to perform polyphonic SED. In all polyphonic SED models, the extraction of acoustic features with strong characterization ability and effective classification algorithms are the keys to improving the overall model classification performance. The main features include linear predictive coding (LPC) [3], linear predictive cepstral coefficients (LPCC), discrete wavelet transform (DWT), Mel frequency cepstral coefficients (MFCCs) [3] and Log-Mel spectrograms (Log-Mel). Traditional classifiers include support vector machines (SVMs) [10], Gaussian mixture models (GMM) [11], hidden Markov models (HMM) [12], multi-layer perceptron (MLP) [13], and so on. However, these traditional models are only applicable to single acoustic events and small datasets [14]. With the increasing of dataset scale and audio complexity, the above traditional classification models cannot meet the classification requirements of the system. With the development of machine learning, classification models of the neural network are far superior to traditional classifiers, including feedforward neural networks (FNN), recurrent neural networks (RNN) [15], convolutional neural networks (CNN) [16] and convolutional recurrent neural networks (CRNN) [17][18][19][20]. In recent years, most participants of the DCASE challenge have used classification models based on deep learning for SED [1]. CRNN not only has the powerful ability of CNN to capture time-frequency features and process multi-dimensional feature information [21][22][23][24] but also has the advantages of RNN for sequence recognition. Therefore, CRNN is very suitable for polyphonic SED tasks [24]. In particular, RNN with the BGRU module provides better access to contextual information [25], which more accurately predicts the start times and offset times of each sound event.
In recent years, machine learning models based on attention mechanisms have been popularly adopted in image recognition [26], machine translation [27], text classification [28] and speech recognition [29,30]. In addition, relevant literature [31,32] has proved that networks based on attention mechanisms can further improve the classification performance of SED, such as channel attention [33], spatial attention [34,35], temporal attention [36] and frequency attention [37]. Hu et al. demonstrated the superiority of the channel attention mechanism [38]. Shen et al. presented a TF-attention mechanism for SED [37]. Li et al. focused on the mechanism of temporal attention by calculating the weight of the spectrograms [39]. The above studies of attention mechanisms just focused on the difference in time steps but ignored the importance of different frequency bands and different feature dimensions. Although classification models based on the neural network have been popularly applied in the acoustics field, the following challenges exist in the detection of environmental polyphonic sounds: (1) The temporal-frequency structure of polyphonic sounds is very complicated, which may be continuous (rain), abrupt (gunshots) or periodic (clock tick). (2) Discontinuity and uncertain duration of polyphonic sounds affect model classification performance. (3) The polyphonic SED model has larger datasets, more parameters and more feature space dimensions.
In polyphonic SED, this paper proposes an innovative model named TFFS-CRNN to address the mentioned challenges effectively. The TFFS-CRNN model contains four modules: dual-input TF-attention module, CNN module, FS-attention module and RNN with BGRU module. Our model innovatively combines the dual-input TF-attention mechanism and FS-attention mechanism, filtering out unimportant frequency bands information and weighting the important feature dimensions to characterize the key feature information of the polyphonic sound event. The TFFS-CRNN model can capture key temporal-frequency features and spatial features of sound events. In the experiment, the performance of the TFFS-CRNN model was evaluated on the public DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset. The TFFS-CRNN model proposed in this paper is compared with the DCASE challenge's winning system models, which fully demonstrates the superiority of the TFFS-CRNN model in polyphonic SED. Our contributions are as follows: (1) The dualinput TF-attention mechanism was introduced for polyphonic sound events with complex time-frequency structures. (2) In the CRNN module, the multi-dimensional higher-order features are extracted to solve the problem of discontinuous polyphonic sounds. (3) In order to solve the problem of feature dimension redundancy, we introduce the FS-attention mechanism to weight important dimensions. The rest of the paper is arranged as follows: Section 2 involves a detailed description of the TFFS-CRNN model and its rationale. Section 3 reports and compares the experimental results. Section 4 discusses our findings. Conclusions are placed in Section 5.

Methods
The overall structure of the proposed TFFS-CRNN model is shown in Figure 1, which includes five main components: the dual-input TF-attention module, the CNN module, the FS-attention module, the RNN module and the fully connected (FC) output layer. Firstly, the MFCCs feature and Log-Mel spectrograms are extracted as input of the dual-input TF-attention module and integrated into the TF-attention spectrograms of two features. The dual-input TF-attention features not only enhance the representation capabilities of features for polyphonic sounds, but also highlight important time-frame information and key frequency-bands information. CNN is responsible for extracting multi-dimensional higher-order features from the TF-attention feature of MFCCs and Log-Mel spectrograms. The FS-attention module dynamically learns the importance of each dimension of multidimensional high-order features and gives different attention to different dimensions. The FS-attention module could extract important feature space information and ignore unimportant dimensions. The RNN module uses 32-unit BGRU to gain contextual information and predict the start times and offset times of polyphonic sounds accurately. Finally, the output features of BGRU are fed into the FC layer to obtain the classification results of the TFFS-CRNN model. In this section, the whole network model is described in detail. sounds. (3) In order to solve the problem of feature dimension redundancy, we introduce the FS-attention mechanism to weight important dimensions.
The rest of the paper is arranged as follows: Section 2 involves a detailed description of the TFFS-CRNN model and its rationale. Section 3 reports and compares the experimental results. Section 4 discusses our findings. Conclusions are placed in Section 5.

Methods
The overall structure of the proposed TFFS-CRNN model is shown in Figure 1, which includes five main components: the dual-input TF-attention module, the CNN module, the FS-attention module, the RNN module and the fully connected (FC) output layer. Firstly, the MFCCs feature and Log-Mel spectrograms are extracted as input of the dualinput TF-attention module and integrated into the TF-attention spectrograms of two features. The dual-input TF-attention features not only enhance the representation capabilities of features for polyphonic sounds, but also highlight important time-frame information and key frequency-bands information. CNN is responsible for extracting multidimensional higher-order features from the TF-attention feature of MFCCs and Log-Mel spectrograms. The FS-attention module dynamically learns the importance of each dimension of multi-dimensional high-order features and gives different attention to different dimensions. The FS-attention module could extract important feature space information and ignore unimportant dimensions. The RNN module uses 32-unit BGRU to gain contextual information and predict the start times and offset times of polyphonic sounds accurately. Finally, the output features of BGRU are fed into the FC layer to obtain the classification results of the TFFS-CRNN model. In this section, the whole network model is described in detail.

Dual-Input Temporal-Frequency Attention
This sub-section focuses on the dual-input TF-attention algorithm model shown in Figure 2. The polyphonic SED model introduces a TF-attention mechanism that assigns larger weight values to the critical time frames and critical frequency bands. At the same time, the attention weight of time frames and irrelevant bands with less information will be decreased. Log-Mel features XLog-Mel ∈ R F×T and MFCCs features XMFCCs ∈ R F×T are chosen as inputs to the dual-input TF-attention model. The harmonic spectrograms of the two features can visually and clearly display the frequency band information of polyphonic sounds, and the vertical structure of the percussive spectrograms can highlight the difference between time frames and noise. Therefore, the TF-attention algorithm model uses the harmonic-percussive source separation (HPSS) algorithm [32]. Then, the harmonic

Dual-Input Temporal-Frequency Attention
This sub-section focuses on the dual-input TF-attention algorithm model shown in Figure 2. The polyphonic SED model introduces a TF-attention mechanism that assigns larger weight values to the critical time frames and critical frequency bands. At the same time, the attention weight of time frames and irrelevant bands with less information will be decreased. Log-Mel features X Log-Mel ∈ R F×T and MFCCs features X MFCCs ∈ R F×T are chosen as inputs to the dual-input TF-attention model. The harmonic spectrograms of the two features can visually and clearly display the frequency band information of polyphonic sounds, and the vertical structure of the percussive spectrograms can highlight the difference between time frames and noise. Therefore, the TF-attention algorithm model uses the harmonic-percussive source separation (HPSS) algorithm [32]. Then, the harmonic spectrograms and percussive spectrograms of Log-Mel and MFCCs features will be gained via HPSS. The harmonic spectrograms are input to the frequency attention network (F-Attention-Net) and the percussive spectrograms are fed into the temporal attention network (T-Attention-Net). The specific structure is shown in Figure 2. First, the Log-Mel spectrograms X Log-Mel ∈ R F×T or the MFCCs X MFCCs ∈ R F×T will be normalized. After that, we perform the convolution operations on the harmonic and percussive spectrograms to extract high-order features. The convolution kernel sizes are (1 × 3) and (2 × 1), respectively. A couple of convolution operations reduce the time dimension of harmonic spectrograms to 1. Similarly, the same convolution operation is performed on the frequency dimension of percussive spectrograms. After that, we use the (1 × 1) channel convolution kernel to compress. In this way, we can obtain the I F ∈ R F×1 with size (F,1) and the I T ∈ R 1×T with size (1,T). Finally, we normalize I F ∈ R F×1 and I T ∈ R 1×T by using the SoftMax function, thus obtaining the frequency weight F W ∈ R F×1 and the temporal weight T W ∈ R 1×T .
Next, the spectrogram X Log-Mel ∈ R F×T or X MFCCs ∈ R F×T point-multiplied with the attention weight matrices along the temporal and frequency directions respectively to produce the frequency attention spectrogram A F ∈ R F×T and the temporal attention spectrogram A T ∈ R F×T with the following expressions.
Sensors 2022, 21, x FOR PEER REVIEW spectrograms and percussive spectrograms of Log-Mel and MFCCs fe gained via HPSS. The harmonic spectrograms are input to the frequency work (F-Attention-Net) and the percussive spectrograms are fed into the t tion network (T-Attention-Net). The specific structure is shown in Figure 2 Mel spectrograms XLog-Mel ∈ R F×T or the MFCCs XMFCCs ∈ R F×T will be normali we perform the convolution operations on the harmonic and percussive s extract high-order features. The convolution kernel sizes are (1 × 3) and tively. A couple of convolution operations reduce the time dimension of h trograms to 1. Similarly, the same convolution operation is performed on dimension of percussive spectrograms. After that, we use the (1 × 1) chann kernel to compress. In this way, we can obtain the IF ∈ R F×1 with size (F,1) a with size (1,T). Finally, we normalize IF ∈ R F×1 and IT ∈ R 1×T by using the Sof thus obtaining the frequency weight FW ∈ R F×1 and the temporal weight TW

Frequency Attention
Temporal Attention   As the time and frequency domains of spectrograms contain information about temporal and frequency features, respectively, they are quite different from images in the image classification domain. The TF-attention mechanism assigns different weights to the time frame and frequency band. Both temporal and frequency features can be enhanced by the TF-attention mechanism, improving the ability of features to characterize the key feature information of the polyphonic sound event. In general, the combined approach is available in either parallel or concatenation. However, the concatenation methods combine two attention mechanisms, leading to poor performance. In this paper, three fusion strategies are designed, including average combination, weighted combination and channel combination. In the experiment in Section 4.2, the weighted combination strategy has the best performance. So, the weighted combination strategy was chosen to merge the temporal attention mechanisms and frequency attention mechanisms into a unified model. In the training, set up two variable parameters α and β and α + β = 1. Finally, the TF-attention spectrogram A T&F is computed as follows: Log-Mel-TF-attention and MFCCs-TF-attention fusion strategies: The first strategy is called average combination. The obtained Log-Mel-T&F-Attention and MFCCs-T&F-Attention are fused in a 1:1 ratio to produce the final temporal-frequency attention spectrogram LM-T&F-Average. The detailed process is as follows: The second strategy is called weighted combination. In the training, set up two variable parameters λ and µ and λ + µ = 1. Then the L-TF-attention and M-TF-attention spectrograms are fused according to the ratio of the parameters. Finally, the TF-attention spectrogram LM-T&F-Weight that fuses the two features of Log-Mel and MFCCs is then obtained. The detailed process is as follows: The third strategy is called channel combination. The obtained two attention spectrograms L-TF-attention and M-TF-attention are concatenated as two-channel output. The process is as follows:

Feature Space Attention
This sub-section focuses on the FS-attention algorithm model shown in Figure 3. In the neural network model, all dimensions of the feature spectrogram are treated equally, thus drowning out some important feature dimensions that contain critical information. The TFFS-CRNN model uses the FS-attention module to obtain the important feature dimension. Each dimension of multi-dimensional high-order features was assigned different learnable weights to obtain the space attention features F Attention ∈ R T×K×F . This sub-section focuses on the FS-attention algorithm model shown in Fig  the neural network model, all dimensions of the feature spectrogram are treated thus drowning out some important feature dimensions that contain critical info The TFFS-CRNN model uses the FS-attention module to obtain the important fe mension. Each dimension of multi-dimensional high-order features was assigne ent learnable weights to obtain the space attention features FAttention ∈ R T×K×F .

Permute
Reshape Fully connected Feedforward Softmax activation Repeat and Reshape  The feature space attention algorithm is shown in Figure 3, the multi-dimensional high-order features F LM ∈ R K×T×F are obtained from the CNN module, where K represents the number of channels, F represents the frequency and T represents the time frame length. Then the high-order feature F LM ∈ R K×T×F was input to the FS-attention model. The FS-attention contains a SoftMax activation layer and a FC feedforward layer to calculate the importance weight of each feature dimension of F LM . The importance weight I is the outputs of the SoftMax layer, which is assigned to different feature dimensions. First, F LM ∈ R K×T×F is permuted into a 3-dimensional tensor F LM .
where R K×T×F→T×K×F denotes the F LM with dimension K × T × F becomes the F LM with dimension T × K × F. Subsequently, F LM is flattened as a 2-dimensional tensor F " LM by fixing the time dimension T, as shown in the following: Next, the F " LM is input to the feedforward layer. In the FC feedforward layer, the number of hidden units is set to KF. The dimension of weights I is M = KF, which is expressed as follows: where I m influences the mth dimensional feature of F " LM , the expression of I m is: The dimension of F " LM is M. The jth dimensional output of the SoftMax activation layer is O j . The importance weight I is repeated T times, and the dimension of weight I becomes T × K × F. The FS-attention vector I ' ∈ R T×K×F can be expressed as: where R T×KF→T×K×F denotes the feature space importance weight I is reshaped from T × KF to T × K × F. The outputs of the FS-attention module can be expressed as: where "•" represent the Hadamard product. Then the outputs F Attention of the FS-attention module are fed into the RNN with the BGRU module.

BGRU
In the RNN module, BGRU is used to learn contextual information from higher-order attention features, which can more accurately predict the start times and offset times of each sound event. As is shown in Figure 4, the BGRU module consists of two gated recurrent units (GRU), training direction of each neuron is reversed. Therefore, the correlation between the pre and post information can be fully exploited and information from both past and future directions can be introduced. GRU integrates the forget gate and the input gate into one gate called the update gate (denoted as z t ), and the GRU unit only contains the update gate and the reset gate. The update gate is mainly used to control how much of the memory information from the h t−1 moment is retained at the current moment. While the reset gate determines how much of the memory information from the previous moment is combined with the new input information to form the new memory content. The single-layer GRU and the output vector y are computed as: where z t is the output of the update gate, r t is the output value of the reset gate, y ∈ [0, 1] N , N represents the number of sound events. The symbol ⊗ denotes matrix multiplication. W* and U* in Equations (20)-(24) denote the weight matrix, b* represents the bias term, and δ g represents the sigmoid activation function. Each dimension of y t means the probability of a certain sound event occurring at time t. If the hidden layer of GRU has a dimension of d, then r t , z t , h t , h t ∈ R d . Using the BGRU module further enhances the performance of the TFFS-CRNN model. The binary cross entropy loss function (BCE loss) of the BGRU is expressed as: N represents the number of sound events, andŷ ∈ {0, 1} N is the binary indicator of sound events.

Experiment
This section describes the experimental datasets, evaluation metrics and mental configurations in the domain of polyphonic SED [40][41][42]. Experiments ducted on publicly available datasets to validate the effectiveness of the model, the results of the method provided in this study are compared with the results of methods.

Experiment
This section describes the experimental datasets, evaluation metrics and experimental configurations in the domain of polyphonic SED [40][41][42]. Experiments are conducted on publicly available datasets to validate the effectiveness of the model, and then the results of the method provided in this study are compared with the results of existing methods.

Environmental Sound Datasets
In this study, the DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset [42,43] were used to evaluate our proposed TFFS-CRNN model. It is the datasets of DCASE 2016 Task3 and DCASE 2017 Task3 mentioned above. The experimental results were compared with the winner systems. The datasets are as follows.
Both datasets contain environmental sounds of daily life, which are divided into indoor and outdoor scenes. The audio of the DCASE 2016 Task3 dataset is mono with 44.1 kHz sampling rate. The DCASE 2017 Task3 dataset contains more street sounds and human voices in real-life recordings. Each audio of the DCASE 2017 Task3 dataset is 3-5 min in length and the sampling frequency is also 44.1 kHz. As shown in Table 1, the DCASE 2017 Task3 includes two daily environments: an outdoor residential area and an indoor home. Both the DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset contain a development set accounting for 70% of the total sample, and an evaluation set accounting for 30%. In this study, the four-fold cross-validation method is used to train and test.

Evaluation Metrics
In this paper, we use the widely recognized measurements proposed for polyphonic in [40] to compare the performance of different algorithms. In the experiment of this paper, the evaluation metrics are the segment-based F1-score (F1) and the error rate (ER). In addition, F1 is the harmonic average of precision (P) and recall (R), which takes values in the range of 0-1. The following equation describes the calculation process: where TP, FP, and FN denote true positive, false positive, and false negative respectively. The ER is the number of samples classified incorrectly. The ER is calculated as follows: where T is the number of sound events in segments t. Substitution events S(t) indicates the number of events where the model misclassifies sound event A as sound event B. Insertion event I(t) refers to an event A that is only detected in the model output, but no type of event occurs in the tag annotation at this moment. Deleted events D(t) refers to sound events that were present but not detected. N(t) is the total number of acoustic events from the annotations.

Experimental Configurations
In this research, the TFFS-CRNN model algorithm was all accomplished in the Python language. In this paper, all audio datasets are 44.1 kHz mono wave files, and the dimension of Log-Mel spectrograms and MFCCs is 40 × 256 (T = 256, F = 40). The frame size is 40 ms and the overlapping frames are 50%. As shown in Table 2, the hyperparameters of the TFFS-CRNN model were enhanced with a random search strategy. The CRNN model architecture used the Adam optimizer to train, and the ReLU activation function was used to introduce non-linearity. The value of the learning rate was 0.0001. In this study, we used early stopping to solve the problem of overfitting. Then, batch normalization (BN) of the TFFS-CRNN model aims to reduce the insensitivity of the network to initialization weights, using a loss rate of 0.25 after each convolutional layer. The global threshold τ is set to 0.5. The τ is used to determine the active acoustic events. If above the threshold, we determine that a segment contains the event classes. Table 2 shows the specific parameter settings for the CRNN network from the input to the output.

Discussion
The performance of the TFFS-CRNN model proposed was evaluated under different fusion strategies, different features, different classifiers and different methods. Simultane-ously, we designed the three experiments on the DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset.

Comparison of Different Features
Six of the most used features were chosen for the experiments, including the Log-Mel spectrograms, MFCCs and short-time Fourier transform (STFT). The sampling rate of audio signals was 44.1 kHz, the frame size was 40 ms and the frame overlap was 50%. The STFT is computed at 1024 points with a size of 40 × 256. The other three features were the fusion of Log-Mel spectrograms, MFCCs and STFT. F MS is the fusion of MFCCs and STFT. F LS is the fusion of Log-Mel spectrograms and STFT. F LM is the fusion of Log-Mel spectrograms and MFCCs. In the process of experiments, the same TFFS-CRNN algorithm was used for all features to compare the classification effect of different features. Table 3 and Figure 5 show the comparison results of different features on the TFFS-CRNN model.

Comparison of Different Fusion Strategies
In the previous section, FLM is the best performance feature, so FLM is used as the input Compared with the other features, F LM improved F1 and ER values and has a better feature characterization capability. In experiments on the DCASE 2016 Task3 dataset, the performance of the aggregation feature F LM reached a maximum F1 of 60.2% and ER of 0.40. Using the DCASE 2017 Task3 dataset, its F1 and ER were 66.9% and 0.49, respectively. The results of experiments indicate that the fusion of Log-Mel spectrograms and MFCCs enhanced the classification performance. In contrast, the performances of the individual feature were poorer than the aggregated features.

Comparison of Different Fusion Strategies
In the previous section, F LM is the best performance feature, so F LM is used as the input feature in the fusion strategy comparison experiment. As shown in Table 4, three fusion strategies are proposed, including averaging combination, weighted combination and channel combination. The principle is shown in Equations (11)- (13). The experimental result shows that the second fusion strategy is the best compared with others. The learnable weight factor is more adaptive. So, we choose the second strategy to fuse Log-Mel-T&F-Attention and MFCCs-T&F-Attention. In the same way, we experiment on the strategies to fusing Log-Mel-T-Attention and Log-Mel-F-Attention. The experiment results show the weighted combination strategy performed best. In this way, the performance of the TF-attention mechanism can be optimally tuned.

Comparison of Different Classifiers
Compared to the other different classifiers, the proposed model TFFS-CRNN improved the F1 and reduce the ER. The comparison experiments of this section were under the same situation, using the same feature F LM . Figure 5 shows the detailed parameters for the TFFS-CRNN network. Other classifiers include DNN, CNN, RNN, CRNN, FS-CRNN, TF-CRNN and TFFS-CRNN. The following are the primary parameter settings. SVM: One-vs-Rest SVMs, sigmoid function. CNN: five convolutional layers, ReLU activation function. RNN: two BGRU recurrent layers, a time-distributed fully connected layer; CRNN: three layers of CNN and an RNN with BGRU. FS-CRNN: an FS-attention layer, three convolutional layers and two BGRU recurrent layers. TF-CRNN: a TF-attention layer, three convolutional layers and two recurrent layers with BGRU. TFFS-CRNN: a TF-attention layer, three convolutional layers, an FS-attention layer and two recurrent layers with BGRU. Table 5 shows the comparison results of the different classifiers using the feature F LM .

Comparison of Different Methods
The model provided was then compared with existing methods. Other compared models are specified below: MFCCs+CRNN [43]: the model uses CRNN as the classifier and MFCCs feature as input.
MFCCs+GMM [42]: the model uses GMM as the classifier and MFCCs feature as input, which is the baseline model for DCASE2016 task3, MFCCs+CNN [44]: the network model is a three-layer CNN, and the input feature is MFCCs.
Log-Mel+CRNN [46]: the network model is CRNN and the Log-Mel spectrograms as input.
Log-Mel+RNN [15]: the network model uses bidirectional LSTM RNN as the classifier, and the input feature is Log-Mel spectrograms.
Log-Mel+CNN [16]: the network model consists of two convolutional layers and two FC layers, and the input feature is Log-Mel spectrograms.
Log-Mel-CRNN [17]: the network model is CRNN and the Log-Mel spectrograms as input.
Binaural Mel energy + CRNN(BGRU) [47]: the model uses a capsule network with pixel-based attention and BGRU for polyphonic SED tasks, and the input feature is Bin- On the DCASE 2016 Task3 dataset, the F1 and ER values of FS-CRNN improved by 3.3% and 7%, respectively, compared with the CRNN; the F1 and ER values of TF-CRNN improved by 13.5% and 15%, respectively, compared with the CRNN; the F1 and ER values of TFFS-CRNN improved by 18.6% and 9%, respectively, compared with the FS-CRNN; TFFS-CRNN improved by 5.1% and 1%, respectively, compared with TF-CRNN, and TFFS-CRNN improved by 21.9% and 16%, respectively, compared with CRNN.
The experiment demonstrates the effectiveness of the attention mechanism. Compared with the CRNN model, the TF-CRNN model and FS-CRNN model have improved performance to a certain extent. In particular, the TFFS model combines the advantages of the TF-attention attention mechanism and the FS-attention at the same time, and its performance is greatly improved. We can conclude that the extraction of key temporal-frequency feature information and key dimensions information can greatly improve the performance of the TFFS-CRNN model in polyphonic SED. The improvement may be due to the increased attention to key information and the enhancement of the feature representation ability. Other classifiers were only a little improved, because of the submergence of key temporal-frequency information and key feature dimensions.

Comparison of Different Methods
The model provided was then compared with existing methods. Other compared models are specified below: MFCCs+CRNN [43]: the model uses CRNN as the classifier and MFCCs feature as input.
MFCCs+GMM [42]: the model uses GMM as the classifier and MFCCs feature as input, which is the baseline model for DCASE2016 task3, MFCCs+CNN [44]: the network model is a three-layer CNN, and the input feature is MFCCs.
Log-Mel+CRNN [46]: the network model is CRNN and the Log-Mel spectrograms as input.
Log-Mel+RNN [15]: the network model uses bidirectional LSTM RNN as the classifier, and the input feature is Log-Mel spectrograms.
Log-Mel+CNN [16]: the network model consists of two convolutional layers and two FC layers, and the input feature is Log-Mel spectrograms.
Log-Mel-CRNN [17]: the network model is CRNN and the Log-Mel spectrograms as input.
Binaural Mel energy + CRNN(BGRU) [47]: the model uses a capsule network with pixel-based attention and BGRU for polyphonic SED tasks, and the input feature is Binaural Mel energy.
Log-Mel+CRNN [48]: the network model is CRNN with GRU, and the input feature is Log-Mel spectrograms.
The experimental results in Table 6 and Figure 7 show that the proposed TFFS-CRNN model outperforms other methods. On the two TUT sound event datasets, F LM + TFFS-CRNN achieved the best (F1, ER) performance with F1 scores of 60.2% and 66.9% and ERs of 0.40 and 0.42, respectively. Compared with the winning systems of DCASE challenge, the F1 improve 12.4% and 25.2%, and the ER is reduced 0.41 and 0.37 as well. Compared with the baseline systems of DCASE challenge, the F1 improve 25.9% and 24.1%, and the ER is reduced 0.44 and 0.52 as well. Compared to the latest algorithmic models, the TFFS-CRNN model still has superior performance. The models with "*" are the baseline system of DCASE challenge. The models with "**" are the winning system of DCASE challenge. Log-Mel+CRNN [48]: the network model is CRNN with GRU, and the input feature is Log-Mel spectrograms.
The experimental results in Table 6 and Figure 7 show that the proposed TFFS-CRNN model outperforms other methods. On the two TUT sound event datasets, FLM + TFFS-CRNN achieved the best (F1, ER) performance with F1 scores of 60.2% and 66.9% and ERs of 0.40 and 0.42, respectively. Compared with the winning systems of DCASE challenge, the F1 improve 12.4% and 25.2%, and the ER is reduced 0.41 and 0.37 as well. Compared with the baseline systems of DCASE challenge, the F1 improve 25.9% and 24.1%, and the ER is reduced 0.44 and 0.52 as well. Compared to the latest algorithmic models, the TFFS-CRNN model still has superior performance. The models with "*" are the baseline system of DCASE challenge. The models with "**" are the winning system of DCASE challenge.  The feature fusion strategy adds rich feature information, and the TF-attention mechanism and FS-attention mechanism increase the weight of key feature information, thus improving the feature representation ability. The experiments fully demonstrate that the combination of TF-attention and FS-attention can greatly improve the performance in the polyphonic SED, as the network can focus on the key information of the feature. Other methods without attention mechanisms drown out critical temporal-frequency infor- The feature fusion strategy adds rich feature information, and the TF-attention mechanism and FS-attention mechanism increase the weight of key feature information, thus improving the feature representation ability. The experiments fully demonstrate that the combination of TF-attention and FS-attention can greatly improve the performance in the polyphonic SED, as the network can focus on the key information of the feature. Other methods without attention mechanisms drown out critical temporal-frequency information and key feature dimensions, and the characterization capability of a single feature is weaker than the fusion feature.

Conclusions
In this study, the TFFS-CRNN model is proposed for polyphonic SED. By introducing the TF-attention mechanism and FS-attention mechanism into the basic CRNN architecture, the TF-attention mechanism used for representational learning can effectively capture the key temporal-frequency features, and FS-attention can effectively enhance the features of important dimensions. The experiments of DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset show that the F-score is improved to 60.2% and 66.9%, and the ER is reduced to 0.40 and 0.42, respectively. The TFFS-CRNN model has a better classification performance than previous models for polyphonic SED. In particular, the model is far superior to the baseline systems and the winning systems of the DCASE challenge. In addition, this paper uses the BGRU module and FC layer to collect the previous moment state and the future moment state, thus obtaining contextual information. To further improve the generalization capability of the TFFS-CRNN model in polyphonic SED tasks, this paper also needs to enhance the training on weakly labeled datasets, and semi-supervised or unsupervised models are still worthy of further investigation in SED.