High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism

In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improving the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.


Introduction
As a key technology for recognizing and analyzing environmental audio signals, environmental sound classification (ESC) [1] is rapidly developed during the past few years with broad applications in home automation, machine hearing, as well as surveillance. Compared with traditional sound classification tasks, such as music or speech recognition [2], the development of this field is relatively slow. This is due to the reason that ESC requires to cover a wide range of frequency spectrum, non-stationary characteristic, and noise-like fluctuations [3][4][5].
Since traditional ESC methods usually consist of feature extraction and feature-based classification processes, a general extension scheme is to improve the classification accuracy in a block-by-block manner. For example, zero crossing rate, audio tone, and short-time energy have been proposed in [6] to improve the feature extraction capability in the low noise environments, while a significant computational complexity is usually required to achieve a reasonable classification accuracy. To solve this problem, extracting features in frequency domain and using temporal-frequency spectrograms to represent environmental sounds becomes the most widely used method recently [7] on the basis of making full use of the recent developments in the field of image processing tasks. Mel-frequency cepstrum coefficient (MFCC) [8] could be one example, while log mel spectrogram (Logmel) [9] and log Gammatone spectrogram [10] are more popular in the recent years. Another approach is to update the feature-based classification block, and typical examples include K-nearest neighbors [11], random forest [12], support vector machine [13], and Gaussian mixture model [14]. With the recent development of supervised learning technologies, the featurebased classification algorithms have also been extended to cover dictionary learning [4], matrix factorization [15], and deep learning based solutions, such as deep neural networks (DNN) [16].
In recent years, the record of achievable classification accuracy has been updated simply through different combinations of feature extraction methods and feature-based classification blocks [9,[17][18][19]. For example, when MFCC is combined with two different DNN structures, multi-layer perception (MLP) and convolutional neural network (CNN), the classification accuracy of 44.9% and 53.1% can be achieved in the public environmental sound dataset ESC-50, and when log Gammatone is used in combination with CNN, the classification accuracy can reach 78.9% in the same dataset. However, the above schemes failed to incorporate some specific domain knowledge and the achievable classification accuracy is in general limited. In addition, we should note that most of the existing researches focus on the improvement of neural networks, which is generally applicable to any machine learning field, but few consider the improved processing method of features specific to environmental sound. As far as we are aware, the following issues need to be addressed at the current stage.
• Sub-spectrogram segmentation: It is very necessary to study the spectrograms of environmental sound more carefully. This is because the low frequency spectrum usually contains more fruitful information, as explained in [10]. Although a straight-forward sub-spectrogram segmentation as proposed in [20] is shown to be effective to improve the acoustic scene classification accuracy, the extension to ESC tasks still remains open. In addition, according to the existing literature, the number of sub-spectrogram segments, as well as the truncation rules need to be optimized as well; • Attention mechanism: Another possible approach to improve the ESC performance is to incorporate the attention mechanism like human beings [21][22][23][24][25] in the convolutional feature layers, either through the temporal [24], frequency [26], channel [27] domain information, or even hybrid of them [27]. However, the previous joint attention scheme [27] focuses on combining the temporal and channel knowledge without considering the frequency domain characteristics, and the joint time-frequency feature is not fully exploited. As shown later, with the joint time and frequency domain attention, the ESC accuracy can be greatly improved; • Recurrent architecture with data augmentation: The sound of many consecutive frames, such as helicopter, has strong correlations in the time domain, and the prediction via recurrent architecture will be possible. As shown in [28], exploiting the correlations among different scales of sequences can be applied to improve the classification accuracy as well. However, this method usually requires a large amount of data to support and comes with the problem of limited dataset. Therefore, it is very necessary to jointly consider the effective methods for expanding the dataset, such as mixup [29] and SpecAugment [30].
In this paper, a sub-spectrogram segmentation [31] (Part of this paper has been published in 2019 IEEE International Workshop on Signal Processing Systems) mechanism has been firstly proposed to address the above concerns, which truncates the entire spectrogram into different pieces in order to conduct experiments separately. Score level fusion has been adopted to combine different classification results from different sub-spectrograms. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges. Based on that, we propose a joint attention mechanism with temporal and frequency domain to adjust the temporal-frequency feature map, which can be similarly regarded as automatically assigning the weight map to the feature map. Numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on ESC-50 dataset, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.
The rest of this paper contains the following parts. We first gave a brief introduction to log Gammatone spectrogram and different types of DNN in Section 2, and the overviews of our proposed methods are introduced in Section 3. The proposed sub-spectrogram segmentation and temporal-frequency attention mechanism based ESC classification frameworks are, respectively, introduced in Sections 4 and 5. In addition, we gave a demonstration of the numerical experiments in Sections 6 and 7 gives a summary of this paper at the end.

Preliminary
In this section, we give a brief introduction to the well-known log Gammatone spectrogram and different types of DNNs.

Log Gammatone Spectrogram
By performing T point discrete short-time Fourier transform (STFT) on the sampled time domain audio signal s(t), the energy spectrum density, |S(m, n)| 2 , can be obtained, and the formula is as follows, for m ∈ [1, T/2], n ∈ [0, N − 1]. Then apply K order Gammatone-filter banks to it to determine the log Gammatone spectrogram, and the mathematical expression of this process can be expressed as, for k ∈ [1, K], where H(m, k) denotes the frequency response of the kth Gammatone-filter in the mth sub-band. In addition, H(m, k) and the associated time domain impulse response, h(t), can be, respectively, given by, where f 0 and φ denote the center frequency and the corresponding phase information, k and b > 0 denote the order of filter and the decay rate, and C is an empirical constant to adjust the entire value. In the actual system, usually as shown in [10], select {H(m, k)} to model the characteristics over the entire frequency band, ( f L , f H ), e.g., from zero to half of the sampling frequency. Moreover, according to [27], the log Gammatone spectrogram is often a twodimensional channel tensor, which contains itself and its delta information, e.g., S GT (n, k), and Figure 1 shows the log Gammatone spectrograms of four typical sound classes.

Deep Neural Networks
In general, DNN refers to a more powerful neural network formed by connecting multiple layers of neurons, such as multi-layer perception (MLP), convolutional neural network (CNN), and recurrent neural network (RNN). The design philosophy of MLP and CNN is roughly the same, and the difference is that the neurons within each layer are isolated and neurons across neighboring layers are fully connected in MLP, while the neurons across neighboring layers are connected together through convolution kernels and pooling operations in CNN. In addition, CNN can learn local patterns among different input elements with the help of convolutional architecture, for instance, image pixels or environmental sound spectrograms. RNN is proposed to solve the problem of the temporal correlation among different input vectors or patterns, which is not considered in the above two structures. It cannot only use previous frame-level functions, but also learn complex temporal dynamics. In previous researches, DNN has been proven to be able to handle many challenging tasks in the fields of ESC and computer vision by combining different architectures together.

Overview of the Proposed High Accurate ESC
In this part, we put forward two approaches for environmental sound classification, namely sub-spectrogram segmentation and temporal-frequency attention, where an overview of them is shown in Figure 2.

Overview
In general, the ESC task relies on the observed sound signal s(t) or the equivalent energy spectrum |S(m, n)| 2 to classify different sound classes. The mathematical expression of the classification task of N cls classes of sounds is, where p N cls = [p 1 , p 2 , . . . , p N cls ] T denotes the probability distribution across N cls sound classes. The non-linear function F(·) is directly approximated by its equivalent log Gammatone spectrogram and the corresponding neural network defined by θ in the traditional approaches, e.g., p N cls = G({S GT (n, k)}; θ).

Sub-Spectrogram Segmentation
We truncated the whole spectrograms into N ss parts, e.g., ( f L , f 1 ), . . . , ( f N ss −1 , f H ) instead of generating the log Gammatone spectrogram based on the entire frequency band, and use score level fusion when performing the decision. The overall operations can be described mathematically as follows, where p i N cls and ω i denote the score of the ith sub-spectrogram and the fusion weight, respectively, and ∑ N ss i=1 ω i = 1. S i GT (n, k) defines the generated log Gammatone spectrograms based on the ith band (For illustration purpose, we define f 0 = f L and f N ss = f H .), e.g., from f i−1 to f i , and G(·; θ) represents a non-linear mapping between log Gammatone spectrogram and classification results.

Conv1
Conv2  Figure 2. Compare the baseline system, the proposed sub-spectrogram segmentation system and the proposed temporalfrequency attention mechanism system. In this figure, the first branch denotes the baseline system, which extracts log Gammatone spectrogram features on the entire frequency band, the second branch denotes the proposed sub-spectrogram segmentation method, which extracts log Gammatone spectrogram on several sub-frequency bands as illustrated, and the last branch denotes the proposed temporal-frequency attention mechanism (TFAM) and temporal-frequency attention system.

Temporal-Frequency Attention
The above sub-spectrogram segmentation mechanism only considers the frequency domain, which ignores the temporal domain characteristics. To address this issue, we propose a temporal-frequency attention mechanism (TFAM) as illustrated in Figure 2. Given the input log Gammatone spectrogram, {S GT (n, k)}, we first use CNN to extract temporal-frequency representations. Mathematically, we generate the feature maps, M ∈ R T×F×C , on top of the log Gammatone spectrograms according to the following expression, where T, F and C represents the dimension of feature maps and H 1 (·; θ 1 ) represents the non-linear transformation provided by CNN.
In order to keep the implementation complexity, we restrict the attention map according to W AT ⊗ 1, where W AT ∈ R T×F denotes the temporal-frequency attention patterns, 1 ∈ R C denotes an all-one vector with dimension C, and ⊗ is Kronecker product as defined in [32]. With the generated attention map, the overall operations can be described as follows, where · denotes the element-wise multiplication, and G(·; θ) represents a non-linear mapping between log Gammatone spectrogram and classification results.

Proposed Sub-Spectrogram Segmentation Based Classification Framework
In this section, we respectively introduce sub-spectrogram segmentation based feature extraction, CRNN based classification, and score level fusion, which are all components of the proposed sub-spectrogram segmentation based classification framework.

Sub-Spectrogram Segmentation
From Figure 1, we can see that the difference of behaviour in different scales of the spectrogram is really significant. Firstly, we divide the entire log Gammatone spectrogram into two parts, and the parameter settings are N ss = 2, f L = 0 kHz, f 1 = 10 kHz, and f H = 22.05 kHz. The classification accuracy on ESC-50 dataset [33] changes with ω 1 as shown in Figure 3, and it can be concluded that as long as there is an appropriate weight assignment, the proposed sub-spectrogram segmentation can outperform the baseline system.
Secondly, we identify the optimal number of sub-spectrogram segments through experiments, e.g., N ss , and we provide extensive numerical studies. Specifically speaking, we evaluate the system performance under different values of N ss and { f i }, and Table 1 lists the results when using the optimal weight coefficients, {ω i }. From Table 1 we can see that the accuracy does NOT increase monotonically with regard to N ss , and the optimal number of sub-spectrogram segments is N ss = 4.

CRNN with Mixup
Inspired by the complementary modeling capabilities of CNN and RNN, we combine them into a unified architecture called convolutional recurrent neural network (CRNN), which can be represented as the approximate original non-linear function G(·; θ). The complementary modeling function mentioned here, respectively, refers to using a convolution kernel with a small receptive field on spectrogram features to capture the local spectro-temporal pattern and learning the temporal relationship of the environmental sound features. Specifically, in this system, the learned features obtained through conventional convolutional layer will first be forwarded into the bi-directional gated recurrent unit (GRU) for temporal processing, and then the score of the ith sub-spectrogram, p i N cls , can be obtained. In addition, detailed architecture of the proposed CRNN and its parameters are presented in Table 2. For avoiding overfitting that may be caused by the limited training dataset, here we use the data augmentation method, mixup, for constructing virtual training data to achieve the purpose of expanding the training distribution [34]. mixup generates the virtual training data by mixing two training samples, e.g., to attain a mixed virtual feature by mixing a crying baby log Gammatone spectrogram and a dog bark log Gammatone spectrogram, the formula can be expressed as where {S i GT (n, k)} j and {S i GT (n, k)} j are two randomly selected samples in log Gammatone spectrograms of training data. Correspondingly, labels should also be mixed in the same ratio. λ is decided by two hyper-parameters, α and λ ∼ Beta(α, α) [34]. Table 3 shows that using CRNN or mixup can increase the classification accuracy by 2.3% and 3.2%, respectively, while CRNN can increase the classification accuracy by 5% higher than the baseline system.

Score Level Fusion
Finally, we experimented to identify the optimal weights, {ω ,i }, in score level fusion. We can obtain the optimal weights by exhaustively searching over all possible combinations of {ω i }, and the classification accuracy are shown in Table 4. According to the results, it can be seen that the accuracy of score level fusion can be improved by 2.3% to 3.9% over the uniform weights assignment.

Proposed Temporal-Frequency Attention Based Classification Framework
In the above method, segmentation boundaries and number of segments need to be optimized over a multi-dimensional search spaces, which is in general computationally prohibitive. In this section, we propose a low-complexity joint temporal-frequency domain searching mechanism to generate the temporal-frequency attention map and figure out a temporal-frequency attention based classification framework with data augmentation. The network structure use here is the same as the CRNN mentioned in Section 4.2.
It is worth mentioning that a similar attention method was introduced in [35]. Apart from the concatenation pattern of the temporal attention and frequency attention, the main difference is the method of obtaining the temporal and frequency attention map. In this paper, we first use the combination of 1 × 1 convolution and pooling to squeeze channel information, while only 1 × 1 convolution is used in [35]. The results in Table 5 are shown that the combination way performs better. Then we use 3 × 3 convolution to learn attention map based on the channel-squeezed feature, while a global average pooling is used in [35]. The learnable attention network usually has ability to learn more valuable information from input feature. Table 5. Classification accuracy comparison for max and average pooling, and 1 × 1 convolution.

Attention Map Generation
In order to efficiently search the most important temporal-frequency features of an audio spectrogram, we propose a temporal-frequency attention mechanism (TFAM) as shown in Figure 2. Different from the previous sub-spectrogram segmentation based scheme, TFAM directly focuses on the most important frames and frequency bands through training samples, which is more or less the same as semantic segmentation in computer vision tasks. By applying TFAM, the most important temporal-frequency blocks are automatically cached and selected by multiplying an attention map, which eventually helps to the classification tasks thereafter.
To generate the attention map W AT in (8), we have, where H 2 (·; θ 2 ) represents the non-linear transformation defined by CNN, and g(t, f ) is the concatenated spatial map. Mathematically, g(t, f ) can be obtained by the following expression.
where ⊕ denotes the concatenation operation along the channel axis and * 1×1 denotes the 1 × 1 convolution operation. Although we can apply different combinations of pooling and convolution operations, the numerical results in Table 5 show that the concatenated approach achieves better performance in terms of the classification accuracy.
Since the frequency domain characteristics of spectrogram features remains static over different time frames, we choose to process temporal and frequency domains separately as proposed in [36] instead of jointly processing them together as an image. Through this approach, we extract global temporal and frequency attention vectors, e.g., a T ∈ R T×1×1 and a F ∈ R 1×F×1 , and generate the final attention map W AT , according to To obtain a T and a F , we forward g(t, f ) into a standard CNN network, which consists of three two-dimensional convolutions with 3 × 3 receptive field for learning the hidden representations and three one-dimensional max pooling layers for reducing the time, frequency or channel dimension. We can use the following formula to describe this process, where H 3 (·; θ 3 ) and H 4 (·; θ 4 ) represent the non-linear transformations obtained by CNN, and σ(·) denotes the sigmoid activation function, which is use to restrict the vector elements to a range of (0, 1). In order to further improve ESC accuracy, we cascade the proposed TFAM blocks after different CRNN pooling layers in Table 2, and the simulation results are shown in Table 6. A classification accuracy up to 83.1% can be reached if the proposed TFAM blocks are cascaded after each CRNN pooling layers, which also outperforms the previous sub-spectrogram segmentation mechanism.

Data Augmentation Schemes
The entire network architecture with CRNN and four TFAM blocks are depicted in Figure 4, where the final learned feature map, M · (W AT ⊗ 1), is forwarded to bi-directional GRU for the temporal processing. The overall classification results, p N cls , are obtained from a fully connected network, with dimension 50 × 1.
Since we usually have limited sizes of datasets for environmental sound classification, SpecAugment [37] and mixup [34] strategies are adopted to increase the diversity of training sample. SpecAugment applies multiple temporal and frequency masking schemes to generate multiple masked log Gammatone spectrogram, and mixup adopts a randomly mixing strategy between training samples to generate virtual mixed log Gammatone spectrogram and extend the training distribution. By jointly utilizing the above data augmentation schemes, we have the classification accuracy results as listed in Table 7.  Figure 4. Illustration of our proposed environmental sound classification framework with temporal-frequency attention mechanism (TFAM).

Experiments
In order to prove the effectiveness of our proposed schemes, we numerically perform the experiments on a public environmental sound dataset called "ESC-50" [33] in this section. The ESC-50 dataset collects 2000 environmental recordings, which belong to 50 classes of 5 major categories, including animals, natural soundscapes and water sounds, human non-speech sounds, interior or domestic sounds, and exterior and urban noises. All audio samples are 5 s with a 44.1 kHz sampling frequency. In addition, all the experiments in this paper are obtained through five-fold cross-validation.

Experiment Setup
All the experiments are evaluated on the Nvidia P100 GPU for a fair comparison, and all models are trained by using Keras library with TensorFlow backend. In the training stage, we use the mini-batch stochastic gradient descent with Nesterov momentum of 0.9 and the learning rate scheme of reducing by 10 times per 100 epochs with initial value of 0.1. Moreover, we choose cross entropy as the loss function and the batch size is set as 200. Listed in Table 8 are some other important parameters. In the following scenarios, a simple CNN architecture shown in Figure 2 is use as a baseline, which models the relation between the log Gammatone spectrogram and the final results. Meanwhile, we numerically compare the classification performance of our proposed schemes with that of the baseline scheme.

Effect of Sub-Spectrogram Segmentation
We analyze the results under the influence of different N ss , { f i }, and {ω i }, as shown in Table 9. Firstly, we choose some f i , and then we can attain a variety of different situations, such as different N ss and the same N ss with different { f i }, by combining some of them. Then, we assign different {ω i } to them and test the classification performance of the models.
The effect of N ss in this system has been analyzed in Section 4.1, and the optimal number is N ss = 4. Here, we also compare the classification accuracy in three selection ways of { f i }, including more segments in low-frequency portion, roughly average segmentation and more segments in high-frequency portion, and Table 9 shows the results, which can prove that a higher classification accuracy can be obtained when more segments are applied in low-frequency portion. We can reach two conclusions by analyzing the curve in Figure 3, the one is that low-frequency band contains a large proportion of the characteristics of environmental sounds, and the other is that high-frequency band is still indispensable for ESC although it contains few of the characteristics. Therefore, in order to obtain better performance, we appropriately increased {ω i } of low-frequency segments during fusion, and in Table 9, all {ω i } are optimal in its corresponding situation.

Accuracy under Sub-Spectrogram Segmentation Based Classification Framework
Further, we compared the classification accuracy in different combinations of mixup, CNN, RNN, segmentation, and score level fusion. As shown in Table 10, the classification accuracy can be improved when using them together. Specifically, our highest classification accuracy is 82.1%, which has an absolutely improvement of 9.2% over the baseline system.

Accuracy under Temporal-Frequency Attention Based Classification Framework
We finally combine different strategies together to improve the overall classification accuracy, including CRNN architecture, different data augmentation schemes, as well as the proposed TFAM blocks. As shown in Table 11, by jointly utilizing all the above strategies, we can achieve a classification accuracy up to 86.4%, which corresponds to 3.9% improvement than sub-spectrogram segmentation based classification framework. This can also significantly demonstrate that the method we proposed is effective.

Network
Mixup SpecAugment Segmentation TFAM Accuracy In addition, Figure 5 shows the confusion matrix when the classification accuracy is 86.4%. The confusion matrix fully displays the correctness and wrongness of the classification of each class. The horizontal axis and vertical axis in the figure, respectively, represent the predicted labels and the true labels of 50 classes of environmental sound. Among them, the corresponding relationship between the number from 1 to 50 and the environmental sound class is: 1-Dog; 2-Rooster; 3-Pig; 4-Cow; 5-Frog; 6-Cat; 7-Hen; 8-Insect (flying); 9-Sheep; 10-Crow; 11-Rain; 12-Sea waves; 13-Crackling fire; 14-Crickets; 15 Finally, we compare the classification accuracy of the method proposed in this paper with the existing methods, as shown in Table 12. It can be seen that compared with most existing methods, the proposed method has obvious advantages in classification accuracy.

Conclusions
In this paper, we have successively proposed two effective environmental sound classification frameworks based on sub-spectrogram segmentation and temporal-frequency domain attention. The proposed frameworks jointly consider the recurrent network architecture, the data augmentation policies, as well as feature enhancement schemes to improve the classification accuracy of ESC-50. Numerical results show that our proposed frameworks can achieve 82.1% and 86.4% classification accuracy on ESC-50 dataset, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.