1. Introduction
Acoustic scene classification (ASC) has become an important research topic in recent years [
1,
2,
3]. The goal of ASC is to categorize given recordings into a set of given classes.
Deep neural networks are currently the best performing techniques in a wide range of applications in computer vision, bioinformatics, medical disease study, robotics, and audio processing applications [
4]. Many network structures and architectures have first been developed in the context of computer vision or image classification problems and later adapted to other application domains. Acoustic scene classification, which classifies sound recordings into a set of predefined classes, has also adopted variants of the two dimensional Convolutional Neural Network (CNN) from computer vision in recent studies [
5,
6,
7,
8]. Variants of VGG [
9], ResNet [
10] and DenseNet [
11] are state of the art architectures from image classification which have been successfully applied to acoustic scene classification problems [
5,
6,
7,
8,
12]. Similar to the networks used in computer vision [
13], millions of parameters are often required for applying deep neural networks in acoustic scene classification problems [
7]. Such large networks require a lot of computational power for training, and present challenges for deployment on mobile phones or low-power-consumption devices. As a consequence, low-complexity neural network solutions are a topic of great interest in acoustic classification applications. In this study, we propose a method to decompose a traditional 2D convolution operation [
14] into a series of small convolution operators in order to design a low complexity neural network for acoustic scene application.
In order to design a smaller network for a given task, one can start by training a large network for performance, and then train a network of similar structure with fewer parameters to match the output of the original network in the training set [
15]. Alternatively, pruning a network zeros out a large fraction of the network parameters to reduce complexity [
16,
17,
18]. In pruning schemes, a low-complexity network is a model which has a small number of non-zero parameters within the original complex network structure. A pruned network can be achieved by iteratively zeroing out a small fraction of parameters having lower magnitude and then retraining the model until a desired compression ratio is reached [
16,
18].
In contrast, designing network structures that inherently have only a small number of parameters directly reduces the computational cost in both training and inference. Mobilenets [
13,
19] are examples of networks that can reduce the number of parameters required while maintaining reasonable performance. Key features of these networks include separable convolutions, depth-wise separable convolutions, and linear bottlenecks [
19,
20]. Our study aims at finding a low complexity architecture for audio. The key idea is to recognize the distinction between a 2D spectrogram of an audio clip, in which the time and frequency dimensions represent fundamentally different characteristics, whereas both dimensions in a 2D image are spatial translation. For example, in image classification an image and its transpose would normally be classified as the same class, while this would rarely be the case for a spectrogram and its transposed version.
Our approach shares some similarities with EEGNet-based networks [
21,
22] which are low-complexity networks used in BMI (Brain-Computer Interface) applications. EEGNet classifies Electroencephalography (EEG) signals by using multiple 1D convolutions along temporal and spatial dimensions in the different layers of the network. The success of EEGNets in providing the low-computational-complexity networks with high accuracy in BMI applications supports our idea to exploit 1D convolutions to design low-complexity networks in acoustic scene applications. There are two main differences from our approach and EEGNets. Firstly, our study focuses on acoustic scene classification from audio recordings, which have their own distinct characteristics. Secondly, our approach tries to build a decomposition of 2D convolution into a series of 1D convolutions, and then apply the decomposition into high-performance deep neural networks in different acoustic scene applications.
Our main contributions of this paper are: first, time-frequency separable convolution is introduced to decompose 2D convolution for acoustic scene classification. Secondly, we show how to apply time-frequency separable convolution into a given network structure in order to reduce the number of parameters significantly while maintaining similar performances. In our experiments, we can reduce total parameters by 14 times for a simple CNN and more than 6 times for a complex Resnet.
In this study, we demonstrate our contributions through low-complexity architectures on the dataset from DCASE 2020 task 1 subtask B [
8] and DCASE 2021 task 1 subtask A [
23]. For the DCASE 2020 task 1 dataset, we extend the baseline network from the DCASE 2020 [
8] to work with binaural audio as our baseline for comparison. Meanwhile, in the DCASE 2021 dataset, as our baseline we selected the much more complex Residual network solution [
24] that had a high performance on the dataset. To make the trade-off clear we limited the architecture changes, and our solutions are mainly achieved by replacing 2D-convolution operations in the baseline networks with our proposed decomposition.
The rest of this paper is organized as follows: First, a description of time-frequency separable convolution is provided before introducing the datasets for the experiments. Next, for each experiment, each dataset is explained. Each proposed network is described after outlining the corresponding baseline network. Lastly, a discussion of the experimental results is followed by the conclusion.
2. Time-Frequency Separable Convolution
When applying a deep neural network to acoustic scene classification, a spectrogram of an audio clip is treated as a 2D image and fed into a deep convolutional neural network. From the sample spectrogram in
Figure 1, we can see that the frequency and time axes are not interchangeable for each other; therefore simply applying 2D convolutional operations for audio spectrum input fails to exploit the unique characteristics of the audio domain. As a result, we propose the time-frequency separable convolution structure.
Before diving into details of the proposed structure, we can review 2D convolution in an audio context as shown in
Figure 2. Given a multiple-channel two dimensional input P, the output at the given frequency bin
f, time
t is defined as
where
are input at frequency
i, time step
j, and channel
k.
is the corresponding weight of the convolution layer.
is the number of input channels to the convolution layer, while
and
specify the size of the convolution operator. Note that
is set to zero if
i or
k is larger than the number of frequencies and time steps of the input, respectively.
The 2D convolution can be decomposed into the separable time-frequency structure by a two-step process. First, each channel of the input is convolved along the frequency axis before applying convolution along the time axis as shown in
Figure 3. Equations (
2) and (
3) describe these steps in detail
where
and
are the parameters of the convolution along the frequency and time axes of input channel
k, respectively. Secondly, outputs from the frequency convolutions and time convolutions are concatenated to form intermediate input
as defined in Equation (
4). Input
I is fed to a 1 × 1 convolution as depicted in
Figure 4. Equation (
5) expresses the final output of the time-frequency separable convolution.
The proposed convolution structure decomposes the representation into two separable time and frequency components, thereby reducing the number of parameters compared to the traditional 2D convolution. For example, if a hidden convolution layer of a neural network has
input channels and
output channels, where the size of convolution is
and
, then the total parameters of 2D convolution
is given by Equation (
6)
while building a similar hidden layer using our time-frequency separable convolution requires the number of parameters
provided by Equation (
7)
As a consequence, the compression ratio is provided by Equation (
8)
For example, if a hidden layer has
,
,
, and
, then by applying Equations (
6) and (
7), the number of parameters for traditional convolution and the proposed convolution are 102,400 and 8832. Hence, the compression ratio in terms of parameters required is roughly 11.5 times in the given example.
The time-frequency separable convolution structure can easily be extended to increase its representations. For example, a non-linear activation function can be applied to the output of certain stages in the time-frequency convolution structure. In our implementation, batch normalization [
25] and rectified linear units (Relu) [
25] are applied before
2D convolution. Furthermore, when the proposed structure is applied to the input of a model, a 2D convolution of size
, which is a convolution along the frequency axis, should be chosen for the model to learn extra low-level features in the frequency dimension. Because of the flexibility of the proposed structure, the total number of parameters in some practical implementations can be generalized by adding extra positive term
as shown in Equation (
9).
5. Discussions and Conclusions
All classes in the datasets in our experiments are balanced; therefore the classification accuracy is a reasonable performance metric. From the experimental results, we can conclude that in many acoustic scene classifications, utilizing the proposed time-frequency separable convolution structure can lead to a neural network model offering high performance while requiring many fewer parameters. In a simple audio classification task, the proposed time-frequency separable convolution structure actually led to improvement in performance in both log loss and accuracy. Given a well-performing model in a complex audio classification task, simply replacing convolutional layers by the time-frequency separable convolutions leads to at least 6-fold decrease in the number of parameters with only small differences in the classification performance. As a result, the time-frequency separable convolution structure is a promising configuration for learning audio features in audio classification applications. In addition, we also show that network architectures employing time-frequency separable convolution can combine with mix-up data augmentation for additional performance improvement.
The correlation between frequency components at a given time and the patterns of frequencies over a time window in spectrograms are essential features for developing audio classifiers. We think the proposed time-frequency separable convolution structure forces the networks to capture these features through the convolutions along frequency and time axes. Therefore, it significantly reduces the number of parameters while maintaining the high performance of the networks in acoustic scene classification. In addition, our proposed network also suggests that if 1D convolutions are configured properly, it can be very efficient to design low complexity solutions in audio applications.
We approached the low-complexity solution by designing a general model with fewer parameters; therefore, other techniques for compressing models such as pruning can still be applied on top of our proposed model for further reduction in size. An extension of our work could explore the combination of time-frequency separable convolution with pruning techniques to create a model which is smaller in size but still performs well. The time-frequency separable convolution may benefit other audio-oriented machine-learning applications by directly exploiting the time and frequency characteristic of the audio domain.