Clustering by Errors: A Self-Organized Multitask Learning Method for Acoustic Scene Classification

Acoustic scene classification (ASC) tries to inference information about the environment using audio segments. The inter-class similarity is a significant issue in ASC as acoustic scenes with different labels may sound quite similar. In this paper, the similarity relations amongst scenes are correlated with the classification error. A class hierarchy construction method by using classification error is then proposed and integrated into a multitask learning framework. The experiments have shown that the proposed multitask learning method improves the performance of ASC. On the TUT Acoustic Scene 2017 dataset, we obtain the ensemble fine-grained accuracy of 81.4%, which is better than the state-of-the-art. By using multitask learning, the basic Convolutional Neural Network (CNN) model can be improved by about 2.0 to 3.5 percent according to different spectrograms. The coarse category accuracies (for two to six super-classes) range from 77.0% to 96.2% by single models. On the revised version of the LITIS Rouen dataset, we achieve the ensemble fine-grained accuracy of 83.9%. The multitask learning models obtain an improvement of 1.6% to 1.8% compared to their basic models. The coarse category accuracies range from 94.9% to 97.9% for two to six super-classes with single models.


Introduction
Acoustic scene classification (ASC) refers to the task of associating a semantic label to an audio stream that identifies the environment in which it has been produced [1]. This task takes as input a relatively long sound clip and outputs predicted acoustic scene class, e.g., home, park, and bus. Classifying scenes by audio data has its unique advantages. The recording of audio data is not restricted by the camera angle and illumination condition, etc. As a result, the equipment for sound collection can be installed in a wider range where object occlusion is no more a problem. The collection can run indiscriminately in a dark environment. Moreover, the storage cost of audio data is relatively low compared to image or video data. Recently, ASC has shown huge potentials in many industrial and business applications [2,3], such as surveillance, life-logging, and advanced multimedia retrieval [4].
The inter-class similarity is a common challenge in machine learning research [5]. However, it is getting more prominent in ASC, as the labels of the scenes are commonly annotated according to the spatial functions related to the place where the audio segments are recorded. Consequently, there are audio segments that are quite similar in terms of acoustical characteristics while assigned with different labels, e.g., the segments of a library and those of an office. It is therefore a challenging task to distinguish these similar scenes even for humans and they are often misclassified by the machine learning algorithms.
Furthermore, in most cases, misclassification occurs among similar scenes. For example, in the TUT Acoustic Scenes 2017 dataset [6], the scene of the beach is misclassified in most cases as a residential area. Additionally, home is frequently misclassified as a library, and so on. These errors seem forgivable considering the similarities existing among the scenes. For example, pedestrians, laughter, blowing wind, and other sounds exist in both beach and residential areas. The scenes of home and library may have common aspects, for instance, the quietness, low-voice speaker, and phone ring. Hence, acoustic scenes tend to be misclassified as those having similar characteristics.
Based on the above, here we propose learning the similarity relations of acoustic scenes by taking advantage of the classification errors. In our method, we use the spectral clustering algorithm on the confusion matrix. The scene (class) set of a certain acoustic scene dataset is then divided into several subsets according to the similarity relation of the corresponding acoustic scenes. Each subset is assigned a super-class label. Using this approach, a two-level class hierarchy can be easily built in the label space of the acoustic scene dataset.
Ye et al. [7] proposed an acoustic event taxonomy construction approach based on between-dictionary distances. Li et al. [8] also proposed an acoustic scene clustering method using agglomerative hierarchical clustering on deep embedding extracted by Convolutional Neural Network (CNN). Their taxonomy heavily depends on the quality of acoustic feature learnt and the distance metric selected. Conversely, our construction approach is a simple solution that does not need any feature embedding.
In this paper, the two-level class hierarchy is further integrated into a multitask learning framework for ASC. To take advantage of the relevance between the super-class (coarse category) and original class (fine-grained category), a regularization method is adopted to optimize the training. Note that multitask learning is not a new idea for ASC. Tonami et al. [9] proposed a multitask learning-based solution for joint analysis of acoustic events and scenes where each sample was given both scene and event labels through manual annotation. Abrol et al. [10] also proposed a multitask model which was trained with hierarchical coarse and fine labels for ASC. They manually created a two-level class hierarchy by arranging the fine scene classes into coarse classes. In our proposed self-organized multitask learning method, the original label space is organized into a hierarchical structure by learning the similarity relationship from the confusion matrix. In our method, manual annotation is not required, and the class hierarchy is constructed automatically solely based on the original dataset. This is the reason that the proposed method is called "self-organized" multitask learning.
The proposed method is evaluated comprehensively on two publicly available datasets including the TUT Acoustic Scenes 2017, and the LITIS Rouen [11] datasets. As shown in Figure 1, there are 15 acoustic scenes and 3 super-classes in the TUT Acoustic Scenes 2017 dataset. It is arranged as a two-level class hierarchy by the original dataset publishers. We compare the constructed class hierarchy with the original one in the experiment. The LITIS Rouen dataset provides single-level classes. The experiment demonstrates that a single-level class dataset can also benefit from the proposed method.

Related Works
Audio classification has become a hot topic in the field of signal processing. As an essential part of the audio classification, ASC has been one of the main tasks in the IEEE DCASE Challenges (2013, 2016-2021). In the conventional ASC techniques, cepstrum coefficients, as well as other handcrafted audio features, are classified by the Gaussian Nevertheless, not all augmentation techniques are helpful. Those samples augmented far from their original ones are harmful to the classification performance. To solve this problem, Lu et al. [33] proposed a metric learning-based framework to ensure appropriate augmentation for the appropriate training data. In [31], a GAN-based [34] method was used to generate additional samples for ASC. These samples were selected by an SVM hyperplane to ensure augmentation quality.
Zhong et al. [35] proposed a random erasing method for CNN data augmentation. In the method, a rectangle region is randomly selected within an image. The pixels in the region are erased with random values. This method is easy to implement, and random erasing keeps most of the information in the original image. As a result, the filtering operation to remove the harmful augmented samples performed in [31,33] is not necessary here. Gharib et al. [36] applied a similar random erasing method for ASC and achieved an improvement of 0.13 percent compared with their baseline system.
Mixup [37] is another interesting data augmentation method. This method constructs a new example using a linear interpolation of two random examples from the training set and their labels. Xu et al. [18] used a multi-channel CNN in ASC and applied mixup to improve prediction accuracy. In this paper, a class hierarchy construction method is proposed which appends super-class labels for the training examples. The method does not increase the size of the dataset. However, it extends the label space of the samples, hence providing more information. It is demonstrated in our experiment that using multitask learning with a two-level class hierarchy can effectively enhance the generalization of the CNN model. Although both mixup and class hierarchy construction bring changes into the label space, mixup modifies the labels from the one-hot into the ratio type, whereas the class hierarchy construction provides more one-hot labels by constructing super-class labels.

Overview
In this paper, we propose a self-organized multitask learning method. The proposed solution includes two stages: a two-level class hierarchy is automatically constructed in the first stage using a basic model. The final classifier is then obtained by training a multitask learning model using the constructed super-class labels and the original fine-grained labels in the second stage. As shown in Figure 2, the proposed method for ASC includes the following four steps: (1) Preparing spectrograms: transforming the raw audio segments into spectrograms that are suitable for CNN models. (2) Getting a basic model: training a single-task CNN model as a basic model using the spectrograms and original fine-grained scene labels. (3) Constructing a class hierarchy: testing the validation set on the basic model to obtain a confusion matrix. The spectral clustering is performed on the confusion matrix to generate super-classes. (4) Getting the final model: training a multitask CNN model as the final classifier to predict both the original scene class and the constructed super-class using hierarchical labels.  [31,32]. Salamon et al. [32] augmented the data by deforming the audio signal directly before converting it into log-Mel spectrograms. The applied deformation included time stretching, pitch shifting, dynamic range compression, and background noise. Nevertheless, not all augmentation techniques are helpful. Those samples augmented far from their original ones are harmful to the classification performance. To solve this problem, Lu et al. [33] proposed a metric learning-based framework to ensure appropriate augmentation for the appropriate training data. In [31], a GAN-based [34] method was used to generate additional samples for ASC. These samples were selected by an SVM hyperplane to ensure augmentation quality.
Zhong et al. [35] proposed a random erasing method for CNN data augmentation. In the method, a rectangle region is randomly selected within an image. The pixels in the region are erased with random values. This method is easy to implement, and random erasing keeps most of the information in the original image. As a result, the filtering operation to remove the harmful augmented samples performed in [31,33] is not necessary here. Gharib et al. [36] applied a similar random erasing method for ASC and achieved an improvement of 0.13 percent compared with their baseline system.
Mixup [37] is another interesting data augmentation method. This method constructs a new example using a linear interpolation of two random examples from the training set and their labels. Xu et al. [18] used a multi-channel CNN in ASC and applied mixup to improve prediction accuracy. In this paper, a class hierarchy construction method is proposed which appends super-class labels for the training examples. The method does not increase the size of the dataset. However, it extends the label space of the samples, hence providing more information. It is demonstrated in our experiment that using multitask learning with a two-level class hierarchy can effectively enhance the generalization of the CNN model. Although both mixup and class hierarchy construction bring changes into the label space, mixup modifies the labels from the one-hot into the ratio type, whereas the class hierarchy construction provides more one-hot labels by constructing super-class labels.

Overview
In this paper, we propose a self-organized multitask learning method. The proposed solution includes two stages: a two-level class hierarchy is automatically constructed in the first stage using a basic model. The final classifier is then obtained by training a multitask learning model using the constructed super-class labels and the original finegrained labels in the second stage. As shown in Figure 2, the proposed method for ASC includes the following four steps: (1) Preparing spectrograms: transforming the raw audio segments into spectrograms that are suitable for CNN models. (2) Getting a basic model: training a single-task CNN model as a basic model using the spectrograms and original fine-grained scene labels. (3) Constructing a class hierarchy: testing the validation set on the basic model to obtain a confusion matrix. The spectral clustering is performed on the confusion matrix to generate super-classes. (4) Getting the final model: training a multitask CNN model as the final classifier to predict both the original scene class and the constructed super-class using hierarchical labels.

Spectrograms Generation
To apply CNN models, spectrograms are generated from the audio segments using certain signal processing methods, e.g., the Short-Time Fourier Transform (STFT) [38], Constant-Q-Transform (CQT) [39], and Mel Frequency Cepstral Coefficients (MFCC) [40]. They are split into multiple samples and fed into the CNN model. The spectrogram is considered as a time-frequency representation of the acoustic scene [41]. As CNN is effective in learning spatially local correlations from images, it can use the spatial and temporal information in the spectrograms. However, different spectrograms have different processing abilities for the corresponding frequency range, which can be used for characterizing different acoustic scenes. Therefore, CNN models are widely used as the deep feature extractor, and multiple spectrogram fusions are usually applied in ASC for performance enhancement [17,42,43].
In this paper, we generate three kinds of spectrograms, STFT, CQT, and log-Mel spectrograms, and then evaluate the proposed method on these different presentations, respectively. Details about the spectrogram generation are described in Section 4.1.

Basic Model
Using these spectrograms with fine-grained labels, a CNN model is trained to classify the acoustic scenes. The trained CNN model is referred to as the basic model. A VGGlike network [17] is considered here as the basic model and its structure is illustrated in Table 1. After training on different spectrograms, various basic models become available. Specifically, three basic CNN models are evaluated in the paper, namely the VGG-STFT, VGG-CQT, and VGG-Log-Mel models.   Note: Symbol C in the last two columns represents the number of classes.
Without loss of generality, the architecture of CNN and the spectrogram are not specified below. Suppose we are given a set of n training samples TS = x 1 , y o 1 , . . . , (x n , y o n ) with y o i ∈ {1, . . . , C} indicating the fine-grained acoustic scene class label of image, x i , i ∈ [1, n] (namely a spectrogram patch); superscript o denotes original labels of the dataset.
The CNN network consists of multiple convolutional and pooling layers. At the end of the network, the output layer uses a softmax activation function to assign probabilities to each possible class, where there are C nodes in the output layer. Let P(y o i x i ) be the probability corresponding to the true-ground class of x i . There are also L nodes in the next-to-last layer, which are mapped to C nodes using the fully connected layer. Let W v,u (v ∈ [1, C]; u ∈ [1, L]) denote the weights of connections between these two layers. A negative log-likelihood loss is adopted in the basic model, i.e.,

Super-Class Labels Construction
Constructing the super-classes by merging similar acoustic scenes is a natural and straightforward method. However, it is difficult to find out scenes with similar acoustical properties. Generally, the audio segments are transformed into a certain kind of embedding, and distances defined on the embedding spaces are used to group the scenes into coarse categories. However, learning of embedding and distance definition is not easy, and the clustering results are hard to explain.
In our research, we use misclassification information to approximate the similarities among classes. Specifically, a certain set (e.g., the validation set) of samples is evaluated on a basic model. These predicted results are counted into a confusion matrix. Finally, a spectral clustering algorithm is applied to construct super-class labels and thus expand the label space. The pipeline is illustrated in Figure 3.

Super-Class Labels Construction
Constructing the super-classes by merging similar acoustic scenes is a natural and straightforward method. However, it is difficult to find out scenes with similar acoustical properties. Generally, the audio segments are transformed into a certain kind of embedding, and distances defined on the embedding spaces are used to group the scenes into coarse categories. However, learning of embedding and distance definition is not easy, and the clustering results are hard to explain.
In our research, we use misclassification information to approximate the similarities among classes. Specifically, a certain set (e.g., the validation set) of samples is evaluated on a basic model. These predicted results are counted into a confusion matrix. Finally, a spectral clustering algorithm is applied to construct super-class labels and thus expand the label space. The pipeline is illustrated in Figure 3. For a certain basic CNN model, such as the VGG-STFT, a confusion matrix can be calculated, with , denoting the number of the samples of class that are classified as class by that model. After pre-processing, is transformed to a matrix to ensure symmetry: Using this matrix D, we apply spectral clustering [44] to divide the original classes into . The proposed clustering algorithm is provided in Algorithm 1.
Each subset is assigned a super-class label. The can be rewritten as = {( 1 , < 1 , 1 >), … , ( , < , >)} with ∈ {1, … , } indicating the super-class label of , ∈ [1, ], where superscript denotes the expanded label. Therefore, The number of super-classes in the above construction (i.e., N) is a hyperparameter and is selected using experiments. For a certain basic CNN model, such as the VGG-STFT, a confusion matrix F can be calculated, with F ci, cj denoting the number of the samples of class ci that are classified as class cj by that model. After pre-processing, F is transformed to a matrix D to ensure symmetry: Using this matrix D, we apply spectral clustering [44] to divide the original C classes into N subsets H 1 , . . . , The proposed clustering algorithm is provided in Algorithm 1.
Each subset is assigned a super-class label. The TS can be rewritten as TS = x 1 , y o 1 , y e 1 , . . . , (x n , y o n , y e n ) with y e i ∈ {1, . . . , N} indicating the super-class label of x i , i ∈ [1, n], where superscript e denotes the expanded label. Therefore, The number of super-classes in the above construction (i.e., N) is a hyperparameter and is selected using experiments.

Algorithm 1. Clustering algorithm in super-class generation
Input: the confusion matrix F, number of clusters N. Output: super-class clusters H 1 , . . . , H N 1. Set the diagonal elements of F to zero: Normalize each row of F by the following equations: Assume that B is a diagonal matrix whose elements are set as:

Multitask Learning Model
In the second stage, the constructed two-level class hierarchy is incorporated into a multitask learning framework. The structure of the multitask learning model is illustrated in Figure 4. As it is seen the discrimination of super-class has become an additional task in the classification process. Consequently, the basic model is transformed into a multitask learning paradigm. The details are provided in the following.

Multitask Learning Model
In the second stage, the constructed two-level class hierarchy is incorporated into a multitask learning framework. The structure of the multitask learning model is illustrated in Figure 4. As it is seen the discrimination of super-class has become an additional task in the classification process. Consequently, the basic model is transformed into a multitask learning paradigm. The details are provided in the following. ... ...
where ∈ [0,1] controls the proportion between the original task and the new task in the reconstruction error. The weight vector = ( , , … , , ) for original class should capture similar high-level patterns [28,45] as the weight vector for the super-class s( ) of To make the model aware of the inter-class similarities, we add another output layer onto the basic CNN model, leaving all other details of the model unchanged. The newly added layer has N output nodes and is fully connected onto the original next-to-last layer. The weights of the newly added connections are denoted as U mj,mi (mj ∈ [1, N]; mi ∈ [1, L]). We then update the reconstruction error of the new model into a multitask learning form as the following: where γ ∈ [0, 1] controls the proportion between the original task and the new task in the reconstruction error. The weight vector W t = (W t,1 , . . . , W t,L ) for original class t should capture similar high-level patterns [28,45] as the weight vector for the super-class s(t) of class t, i.e., U s(t) = U s(t),1 , . . . , U s(t),L . Therefore, we introduce the following regularization into the loss function: Finally, the loss function of the new model can be defined as: where α and β are set to 0.0001. After performing the self-organized multitask learning method, respectively, we can obtain boosted models, e.g., VGG-STFT-ML (from VGG-STFT), VGG-CQT-ML (from VGG-CQT), and VGG-Log-Mel-ML (from VGG-Log-Mel). As expected, our experiments confirm that the updated model outperforms the basic models.
Note that the CNN model is a building block in the proposed framework. It can be replaced by any other popular CNN architecture, such as ResNet [46] and GoogleNet [47]. The backbone of the multitask learning network is not necessary to be restricted by the basic model. We keep most of the layers unchanged in multitask learning model to facilitate performance comparison.
Furthermore, the class hierarchy construction and self-organized multitask learning approaches are not limited to CNN. Hence, similar ideas apply to other models such as RNN/LSTM, DNN, and DBN, and might be suitable for other applications.

Experiment Setup
The TUT Acoustic Scenes 2017 dataset [6] and LITIS Rouen dataset [11] (a revised version) are selected to evaluate the performance of our method. The TUT Acoustic Scenes 2017 dataset includes Development and Evaluation sets. We have trained the model on the Development set and evaluated it on the Evaluation set. We also follow the four-fold split provided by the dataset publishers.
The LITIS Rouen dataset is one of the commonly used publicly available datasets for ASC. However, it tends to provide over-optimistic results as some examples cut from the same long recordings are distributed into the training set and test set, respectively. To avoid the "album effect", Rakotomamonjy [48]  Three kinds of spectrograms are generated for the evaluation experiments including STFT, CQT, and log-Mel spectrograms. Spectrograms are generated for each channel (left and right) from the audio clips. To generate STFT spectrograms, the window size is set to 16 ms (706 points) and the hop length is 9.75 ms (430 points) at 44.1 KHz for the TUT Acoustic Scenes 2017 dataset. For the LITIS Rouen dataset, the window size is 32 ms (706 points) and the hop length is 19.5 ms (430 points) at 22.05 KHz.
Logarithmic power spectral densities (10 log 10 PSD) are utilized to plot the spectrograms, which are generated in a one-sided fashion. The sizes of the spectrograms are 1024 × 354 pixels and 1537 × 354 pixels for the two datasets. The spectrograms are divided into patches with a width of 143 pixels and a shift step of 126 pixels. The size of each patch is 143 × 354 pixels. Therefore, we obtain 8 and 12 patches for each spectrogram on the two datasets, respectively.
CQT spectrograms are generated using a python library, Librosa 0.5.0. In the generation function, the sampling rate is set as 22.05 KHz, the filter scale is set to 2, and the frequency bin is 110. Other parameters are set to their default values. The CQT spectrograms with the sizes of 862 × 110 pixels and 1292 × 110 pixels are generated for the TUT Acoustic Scenes 2017 and the LITIS Rouen, respectively. The spectrograms are split into patches with a 143-pixels width and an 80-pixels shift step. Consequently, we obtain 10 and 15 patches for each spectrogram per channel on the two datasets, respectively.
We then extract log-scaled Mel-spectrograms with 128 Mel-bands, using a window size of 92.8 ms (2048 points at 22.05 KHz) and a hop length of 46.4 ms. The sizes of the log-Mel spectrograms for the two datasets are 430 × 128 pixels and 646 × 128 pixels, respectively. The patch width is set as 143 pixels and the shift step is 71 pixels. We can generate 5 and 8 patches from each spectrogram on the two datasets. All patches are resized into 143 × 143 pixels before they are fed into the CNN networks. In addition, patches derived from both channels are separately treated as samples. Note that the above settings of hop lengths and shift steps are decided and justified in our previous work [17] and reused here for convenience. The experiments are implemented using the TensorFlow [49] platform. A mini-batch size of 256 is used as well as an early stopping strategy with a patience parameter of 30 and a maximum epoch of 200. We use Adam [50] optimizer with a learning rate of 0.0001. In the following experiments, γ in Equation (4) is set to 0.6 for all the multitask models. An example-level majority voting accuracy is selected as a performance metric in the following experiments.

Selection of Super-Class Number
In the self-organized multitask learning models, the number of super-class is an important parameter that is closely related to the performance. For a given dataset with C classes, the maximum and the minimum super-class number is C-1 and 2, respectively. One way is to train and test the multitask learning models with all the possible super-class numbers and select the model with the highest accuracy. This, however, causes substantial waste of computing resources. To observe the differences, we have implemented the multitask learning model using CQT spectrograms on both datasets with all super-class numbers. Their corresponding accuracies are presented in Figures 5 and 6. size of 92.8 ms (2048 points at 22.05 KHz) and a hop length of 46.4 ms. The sizes of the log-Mel spectrograms for the two datasets are 430 × 128 pixels and 646 × 128 pixels, respectively. The patch width is set as 143 pixels and the shift step is 71 pixels. We can generate 5 and 8 patches from each spectrogram on the two datasets. All patches are resized into 143 × 143 pixels before they are fed into the CNN networks. In addition, patches derived from both channels are separately treated as samples. Note that the above settings of hop lengths and shift steps are decided and justified in our previous work [17] and reused here for convenience.
The experiments are implemented using the TensorFlow [49] platform. A mini-batch size of 256 is used as well as an early stopping strategy with a patience parameter of 30 and a maximum epoch of 200. We use Adam [50] optimizer with a learning rate of 0.0001. In the following experiments, γ in Equation (4) is set to 0.6 for all the multitask models. An example-level majority voting accuracy is selected as a performance metric in the following experiments.

Selection of Super-Class Number
In the self-organized multitask learning models, the number of super-class is an important parameter that is closely related to the performance. For a given dataset with C classes, the maximum and the minimum super-class number is C-1 and 2, respectively. One way is to train and test the multitask learning models with all the possible super-class numbers and select the model with the highest accuracy. This, however, causes substantial waste of computing resources. To observe the differences, we have implemented the multitask learning model using CQT spectrograms on both datasets with all super-class numbers. Their corresponding accuracies are presented in Figures 5 and 6.  As it is seen in Figure 5, applied on the TUT Acoustic Scenes 2017 dataset the previous five models (with super-class numbers 2, 3, 4, 5, and 6) achieve higher accuracies than that of other models. In other words, the multitask learning model with a very large superclass number is not competitive in performance. Similarly, in Figure 6, the previous five models (except for the one with 4 super-classes) provide higher accuracies on the LITIS Rouen dataset. Although the decline of performances with the increase in super-class numbers is not apparent, the best results are still achieved with relatively small superclass numbers. Consequently, only the multitask learning models with two to six superclasses are explored in the following experiments. As it is seen in Figure 5, applied on the TUT Acoustic Scenes 2017 dataset the previous five models (with super-class numbers 2, 3, 4, 5, and 6) achieve higher accuracies than that of other models. In other words, the multitask learning model with a very large super-class number is not competitive in performance. Similarly, in Figure 6, the previous five models (except for the one with 4 super-classes) provide higher accuracies on the LITIS Rouen dataset. Although the decline of performances with the increase in super-class numbers is not apparent, the best results are still achieved with relatively small super-class numbers. Consequently, only the multitask learning models with two to six super-classes are explored in the following experiments.
Note that in Figures 5 and 6, it can be seen that the accuracies of the multitask learning models outperform the basic model in most cases.

Evaluations on the TUT Acoustic Scenes 2017 Dataset
Using the proposed method, the basic CNN models are trained on the STFT, CQT, and log-Mel spectrograms. Based on a specific basic model, a confusion matrix is generated for each validation set. To obtain more stable divisions, we repeat the process three times. The final confusion matrix used is calculated using the sum of the twelve confusion matrices (three times × four splits), as shown in Figure 7.  (5), forest path (6), grocery store (7), home (8), library (9), metro station (10), office (11), park (12), residential area (13), train (14), and tram (15). Table 2. Clustering details of the constructed super-class for the TUT Acoustic Scenes 2017 dataset. The classes marked by the same shape with the same color are grouped into the same super-class.
Based on the confusion matrices, the original 15 acoustic scenes are grouped into two to six super-classes. The division details are listed in Table 2. For example, using the confusion matrix generated by the STFT basic model (Figure 7a), the 15 acoustic scenes can be divided into two super-classes: the classes of bus, car, train, and tram are grouped as one super-class (represented as the blue squares); the other 11 classes are grouped as another super-class (represented as the red circles). Compared with the original super-classes (namely Indoor, Outdoor, and Vehicle categories, see Figure 1a), there are some interesting findings with the constructed divisions. First, for the two super-classes' divisions, each of them keeps one original super-class and merges the other two into another super-class. For the three super-classes' divisions, the results for CQT and log-Mel model are identical to the original Indoor, Outdoor, and Vehicle divisions. The four super-classes' divisions for STFT, CQT, and log-Mel models are the same. In addition, the five super-classes' divisions for CQT and log-Mel models are also identical. In fact, in most cases, the divisions are very similar to each other. The above results confirm the robustness of the class hierarchy construction method.
In general, we rate the divisions for the log-Mel model as the best divisions. For example, its three super-classes' division is identical to the original division, where the two super-classes' division merges the Outdoor and Vehicle into one super-class, which seems more reasonable. It is believed that the superiority in divisions is due to the high performance of the log-Mel basic model (see Table 3). Hence, the basic classifier and the evaluated samples for confusion matrix creation should be well-chosen. Table 2. Clustering details of the constructed super-class for the TUT Acoustic Scenes 2017 dataset. The classes marked by the same shape with the same color are grouped into the same super-class.  Table 2. Clustering details of the constructed super-class for the TUT Acoustic Scenes 2017 dataset. The classes marked by the same shape with the same color are grouped into the same super-class.

Supper-Class for Log-Mel Model
Two Super-Classes In general, we rate the divisions for the log-Mel model as the best divisions. For example, its three super-classes' division is identical to the original division, where the two super-classes' division merges the Outdoor and Vehicle into one super-class, which  Table 2. Clustering details of the constructed super-class for the TUT Acoustic Scenes 2017 dataset. The classes marked by the same shape with the same color are grouped into the same super-class.

Supper-Class for Log-Mel Model
Two Super-Classes In general, we rate the divisions for the log-Mel model as the best divisions. For example, its three super-classes' division is identical to the original division, where the two super-classes' division merges the Outdoor and Vehicle into one super-class, which

Supper-Class for Log-Mel Model
Two Super-Classes In general, we rate the divisions for the log-Mel model as the best divisions. For example, its three super-classes' division is identical to the original division, where the two super-classes' division merges the Outdoor and Vehicle into one super-class, which

Supper-Class for Log-Mel Model
Two Super-Classes Six Super-Classes In general, we rate the divisions for the log-Mel model as the best divisions. For example, its three super-classes' division is identical to the original division, where the two super-classes' division merges the Outdoor and Vehicle into one super-class, which

Supper-Class for Log-Mel Model
Two Super-Classes In general, we rate the divisions for the log-Mel model as the best divisions. For example, its three super-classes' division is identical to the original division, where the two super-classes' division merges the Outdoor and Vehicle into one super-class, which To demonstrate the effectiveness of the proposed multitask learning method, the performances of the multitask models with different super-class numbers using different spectrograms are given in Table 3. The super-class division used in the experiments here is presented in Table 2. The experiments are carried out three times and the results are then reported using the average and standard deviation in percentage. For the STFT models, the best-achieved accuracy is 62.0% obtained by the multitask model with three super-classes. This confirms an improvement of 2.0% in comparison with the basic model. Similarly, an accuracy of 69.5% is achieved by the multitask CQT model with four super-classes which is equivalent to an improvement of 2.0% over the basic model.
The accuracy of the multitask CQT model with three super-classes is 68.4%, which is identical to the original Indoor, Outdoor, and Vehicle division. As we can see, the performance of the model using original manually grouped division is outperformed by the one using super-classes generated by the proposed method. This means that even the datasets with hierarchical labels benefit from the proposed method.
The same situation can be observed in the log-Mel models. The accuracy of the multitask log-Mel model using three super-classes (they are the same as the original division) is 72.1%. However, the best accuracy among the multitask log-Mel models is 72.8%. This indicates an improvement of 3.5% over the basic model. All the multitask models (with two to six super-classes) have outperformed their corresponding basic models in Table 3.
A one-sided paired t-test was applied to obtain the statistical significant difference of the accuracy of 15-scenes between the basic model and the corresponding best-performed multitask model. The results revealed the statistical significance (significance level < 0.05) of the accuracy improvement on the STFT, CQT, and log-Mel models, respectively.
According to the above confusion matrices, the 19 scene classes in the LITIS Rouen dataset are grouped into two to six super-classes, as shown in Table 4. The divisions indicate the following findings: First, it is found that the outputs of the class hierarchy construction method are stable and robust. For example, the three super-classes' divisions for the STFT model and log-Mel model are the same. Likewise, the five super-classes' divisions and the six super-classes' divisions for the STFT model and CQT model, respectively, are identical. The two super-classes' divisions and four super-classes' devisions for the STFT model and log-Mel model are very similar as well. The difference only lies in the division of a single class. Second, the classes bus, train, metro Rouen, car, high-speed train, and metro Paris are grouped into one super-class in almost all cases. It is the equivalent of the Vehicle category in the TUT Acoustic Scenes 2017 dataset. However, the class plane is separated from the Vehicle super-class and divided as a one-element super-class, which seems more reasonable, as the plane is a kind of non-ground transportation. The classes restaurant, billiard pool hall, and student hall are also clustered as a fixed combination regularly. The features they have in common include their medium-sized indoor space and people's close-talk. These features may produce similar acoustical characteristics. Table 4. Clustering details of the constructed super-class for the LITIS Rouen dataset. The classes marked by the same shape with the same color are grouped into the same super-class.

Generated by Log-Mel Model
Two Super-Classes car (11), tubestation (12), high speed train (13), kid game hall (14), metro Paris (15), billiard pool hall (16), student hall (17), pedestrian street (18), and train station hall (19). Table 4. Clustering details of the constructed super-class for the LITIS Rouen dataset. The classes marked by the same shape with the same color are grouped into the same super-class.

Generated by Log-Mel Model
Two Super-Classes car (11), tubestation (12), high speed train (13), kid game hall (14), metro Paris (15), billiard pool hall (16), student hall (17), pedestrian street (18), and train station hall (19). Table 4. Clustering details of the constructed super-class for the LITIS Rouen dataset. The classes marked by the same shape with the same color are grouped into the same super-class.

Generated by Log-Mel Model
Two Super-Classes car (11), tubestation (12), high speed train (13), kid game hall (14), metro Paris (15), billiard pool hall (16), student hall (17), pedestrian street (18), and train station hall (19). Table 4. Clustering details of the constructed super-class for the LITIS Rouen dataset. The classes marked by the same shape with the same color are grouped into the same super-class.

Generated by Log-Mel Model
Two Super-Classes  Table 5 compares the accuracies of basic models, as well as their corresponding multitask learning models. The experiments are repeated three times and the average and standard deviation accuracies are provided. For the STFT models, the best accuracy is 76.0% which is achieved by the multitask model with four super-classes. It is equivalent to an improvement of 1.6% over the basic model. Additionally, the five multitask models Table 5 compares the accuracies of basic models, as well as their corresponding multitask learning models. The experiments are repeated three times and the average and standard deviation accuracies are provided. For the STFT models, the best accuracy is 76.0% which is achieved by the multitask model with four super-classes. It is equivalent to an improvement of 1.6% over the basic model. Additionally, the five multitask models all outperform the basic one. For the CQT models, the multitask model with two superclasses achieves the best accuracy, providing an improvement of 1.9% over the basic model. Again, the five multitask CQT models are all superior to the basic model. For the log-Mel models, the best accuracy of 78.1% is achieved by the multitask model with two superclasses. This is equivalent to an improvement of 1.8% over the basic one. Similarly, the results of the t-test had revealed the statistical significance (significance level < 0.05) of the accuracy improvement by the corresponding best-performed multitask model (over the basic model) on the STFT, CQT, and log-Mel models, respectively. According to the results, the classification performance of LITIS Rouen dataset has been significantly improved by constructing super-classes and integrating them into the multitask learning framework. Consequently, we can see that the proposed self-organized multitask learning method is also helpful for the acoustic scene datasets with single-level labels. According to the extended super-class labels for each sample, the super-class results predicted by the multitask learning models are also evaluated and shown in Table 5. High accuracies have been achieved on the super-class classification tasks. For instance, for the log-Mel models, an accuracy of 97.9% is achieved for the two super-classes classification and it is 94.9% for the six super-classes classification.
super-classes, and the multitask log-Mel model with two super-classes. The ensemble results, as well as the state-of-the-art results for the two datasets, are displayed in Table 6. As shown in Table 6, the ensemble result using the three best multitask models for the TUT Acoustic Scenes 2017 is 81.4%. This is higher than the accuracies of most of the state-of-the-art techniques listed in Table 6 except for the model in [31] which used GAN data augmentation. The ensemble result using the three best multitask models for our revised version of LITIS Rouen is 83.9%. For a rough comparison, the accuracy of Rouen-15 (81.8%) [48] is listed here. This is as the evaluated datasets are different. To comprehensively evaluate the multitask learning method, the three basic models are also late fused using the same ensemble method. The ensemble result on the TUT Acoustic Scenes 2017 dataset is 77.8% and the one on the LITIS Rouen dataset is 78.1%. Hence, our proposed ensemble results outperform the corresponding basic ensemble results by 3.6% and 5.8%, respectively, on the two considered datasets.

Similarity Relationship of Acoustic Scenes
The experiment results confirm our assumption that the similarity relation amongst the acoustic scenes can be reflected by the classification errors. For example, according to the confusion matrices (Figure 7), in most cases, the classes of beach, city center, and park are, respectively, misclassified as residential areas (for example, there are 1167 beach samples misclassified as residential areas in Figure 7a). Additionally, the residential areas are misclassified as parks; grocery stores are misclassified as cafés; and homes are misclassified as libraries and vice versa. It is also seen that a train is misclassified as a tram, a car is misclassified as a bus, and so on. Similarly, for the LITIS Rouen dataset (Figure 8), in most cases, the class of shop is misclassified as market; metro Rouen is misclassified as metro Paris, and the train station hall is misclassified as a market. The above scenes are similar and understandable from the human point of view, hence seem convincing to us that the similarity relationship among scenes can be learnt from the confusion matrix.

Advantages of the Super-Class Construction Method
Deviating from the classical acoustic scene/event taxonomy methods [7,8], the proposed super-class construction method only depends on the classification results by a basic classifier. It does not need any feature embedding. The method is simple and effective. As shown in Tables 2 and 3, identical class hierarchies can be achieved by using classifiers on different spectrograms. Furthermore, the method does not limit the type of basic classifier. It can be SVM, random forest, and other models, although the CNN model is applied in this paper. In this sense, the proposed method has good robustness and general applicability. On the other hand, the super-class is constructed based on the similarity relations among scenes. It is more explainable compared to the embedding-based results. Finally, the proposed method is also capable to construct a multi-level class hierarchy.
Although the construction method is proposed for ASC, it can be easily extended into acoustic event clustering and other audio taxonomy.

Foundation of Self-Organized Multitask Learning
As shown in the experiments, the proposed self-organized multitask learning method can improve the ASC performance compared to the corresponding basic models. The achieved improvement comes in three ways. First, the constructed super-class labels provide more information in supervised learning. Second, teaching the model to classify the fine-grained category along with the coarse category accords with the cognitive law of "learning the easy things first" [56]. Third, according to [57], the performance of the harder task can be improved by using the information obtained from easier tasks, where predicting the super-class is an easier task. As shown in Table 5, high super-class accuracies are achieved by our models. For example, the two to six super-classes accuracies of multitask learning models using log-Mel spectrograms in the LITIS Rouen dataset are 97.9%, 97.9%, 95.6%, 95.6%, and 94.9%, respectively. Consequently, to keep the auxiliary task easy, the number of super-class should not be too large. This also justifies why competitive results are not obtained by the models with larger super-class numbers (see Figures 5 and 6).

Regularization by Similarity Relation
The relevance between super-class and original class is expressed as a regularization item in the multitask learning loss function (see Equation (5)). According to our experiments, this regularization slightly improves the performance. The evaluation experiments are only performed on the multitask learning models with three super-classes using STFT, CQT, and log-Mel spectrograms in both datasets. The improvements in the TUT Acoustic Scene 2017 dataset are 0.3%, 0.6%, and 0.8% and those in the LITIS Rouen dataset are 0.4%, 0.4%, and 0.6%, for STFT, CQT, and log-Mel models, respectively.

Conclusions
In this paper, the similarity relation among acoustic scenes is utilized to construct a two-level class hierarchy. The class hierarchy is further incorporated into a self-organized multitask learning framework. The experimental results show that the proposed multitask learning method can improve the classification performance effectively using different spectrograms.
The best improvements of the STFT, CQT, and log-Mel multitask models over their corresponding basic models are 2.0%, 2.0%, and 3.5%, respectively, on the TUT Acoustic Scenes 2017 dataset. Corresponding to two to six super-classes, respectively, the coarse category classification accuracies range from 93.1 to 77.0% for STFT models, from 96.2% to 84.0% for CQT models, and from 94.9% to 85.2% for log-Mel models. By applying the late fusion strategy, a fine-grained category accuracy of 81.4% is achieved on the dataset. On the LITIS Rouen dataset, the best improvements over the corresponding basic models are 1.6%, 1.9%, and 1.8% for the STFT, CQT, and log-Mel multitask models, respectively. The super-class accuracies range from 97.1% to 95.2%, from 97.7% to 95.1%, and from 97.9% to 94.9% corresponding to two to six super-classes, for STFT, CQT, and log-Mel models, respectively. An ensemble fine-grained category of 83.9% is achieved.
According to the experiments, the following conclusions can be drawn: (1) The similarity relation based class hierarchy construction method is effective and reasonable. (2) The constructed class hierarchy can be utilized to improve the ASC performance effectively in multitask learning. (3) For a hierarchically arranged dataset, there may exist a hierarchy that is automatically constructed by our method. This may perform better than the original hierarchy in ASC. (4) In self-organized multitask learning, the number of super-class should be chosen carefully. The multitask models with large super-class numbers would not obtain competitive results. (5) The relevance between coarse and fine-grained classes can be utilized as regularization to improve the ASC performance. (6) By arranging the class hierarchy, the self-organized multitask learning method provides a feasible way to promote the performance of a certain model.
In future work, we will extend this confusion matrix based super-class construction method into the domain of acoustic event taxonomy. Furthermore, to improve the performance of scene identification, the multimodal fusion method will be explored. Specifically, image and sensor data, etc., can be employed to enhance the audio data in the scene identification task.