Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion

With the popularity of using deep learning-based models in various categorization problems and their proven robustness compared to conventional methods, a growing number of researchers have exploited such methods in environment sound classification tasks in recent years. However, the performances of existing models use auditory features like log-mel spectrogram (LM) and mel frequency cepstral coefficient (MFCC), or raw waveform to train deep neural networks for environment sound classification (ESC) are unsatisfactory. In this paper, we first propose two combined features to give a more comprehensive representation of environment sounds Then, a fourfour-layer convolutional neural network (CNN) is presented to improve the performance of ESC with the proposed aggregated features. Finally, the CNN trained with different features are fused using the Dempster–Shafer evidence theory to compose TSCNN-DS model. The experiment results indicate that our combined features with the four-layer CNN are appropriate for environment sound taxonomic problems and dramatically outperform other conventional methods. The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.


Introduction
Intelligent sound recognition (ISR) is a technology for identifying sound events that exist in the real environment. This method is mainly based on analyzing human auditory awareness characteristics and embedding such percept ability in machines or robots. Environmental sound classification (ESC), also known as sound event recognition, serves as a fundamental and essential step of ISR. The main goal of ESC is to precisely classify the class of a detected sound, such as children playing, car horn and gunshot. With the popular applications of ISR in audio surveillance systems [1] and healthcare [2], the ESC problem has received increasing attention in recent years. Depending on the different properties of various sound sources, sound signals can be roughly classified into human voice, music sound, and environmental sound. Recent developments have brought great improvements in automatic speech recognition (ASR) [3] and music information recognition (MIR) [4]. However, on account of considerably non-stationary characteristics of environmental sounds, this kind of signals cannot be described as speech or music only. We can imagine that the system developed for ASR and MIR will be inefficient when applying to ESC tasks. Therefore, it is essential to develop an efficient ISR system for environment sound recognition. appropriate auditory features and novel neural network models to achieve high categorization accuracy for ESC tasks.
In order to address these two deficiencies, we propose a novel four-layer stacked CNN architecture based on two combined auditory features and DS theory-based information fusion method, called TSCNN. The proposed system consists of three components: feature extraction and combination, CNN training and DS theory-based decision-level fusion. We extract five auditory features: log-mel spectrogram (LM), MFCC, chroma, spectral contrast and tonnetz (in order to facilitate the description in the rest of papers, we call the last features as CST). Then, LM and CST (LMC) are combined as one feature sets, MFCC and CST (MC) are aggregated as another for training two CNNs, respectively. At last, the outputs derived from the softmax layer of these two CNNs are fused by DS theory to exploit both combined features. The experimental results indicate that the taxonomic accuracy of the proposed architecture can surpass both LCNet (CNN use LMC feature) and MCNet (CNN use MC feature), and is also outperforming the existing models on Urbansound8K [27] dataset. To our best knowledge, this is the first time that the classification accuracy of CNN-based ISR system is higher than 97% in ESC tasks.
The remaining structure of this paper is organized as follows. Section 2 introduces the related works on environment sound recognition. Section 3 describes the feature extraction and the architecture of the proposed model. The experiment results and detailed analysis are shown in Section 4. In Section 5, the conclusion of our work is presented.

Related Works
With the growing number of evidence that the CNN-based models outperform conventional methods in various categorization tasks, they have been applied in sound recognition tasks in recent years. Ref. [28] first evaluated the performance of using CNN in ESC tasks. In this work, an ESC system consists of 2-layer CNN with max-pooling and 2 fully connected layers is proposed. Log-mel spectrograms are extracted as an auditory feature to train the neural network. The experiment results indicate that the classification accuracy of this model is 5.6% higher than traditional methods. Ref. [12] propose to use CNN with smoothed and de-noised spectrogram image feature in sound recognition tasks. Ref. [29] presents a CNN model using mel-spectrograms as features. The performance of three neural network layers as classifiers are investigated, which is a fully connected layer, convolutional layer and convolutional layer without max-pooling. The results indicate that using a convolutional layer as a classifier outperforms the model applying a fully connected layer as the classifier. Ref. [24] presents a six-layer CNN model for acoustic event recognition. In this work, the log-mel spectrograms with their first order derivation and second order derivation are extracted for each recording without segmentation. Then, multiple instance learning is applied and the softmax layer is replaced by an aggregation layer to aggregate the outputs of each network. The data augmentation is applied to prevent over-fitting and improve the robustness of the model. CNN has a strong ability to extract features directly from raw inputs, which has been verified in various image recognition problems. Based on this, Ref. [30] proposes to use CNN to extract features from raw waveform and use SVM or extreme learning machines as classifiers in ESC tasks. The results denote that this architecture outperforms the CNN trained by MFCC. However, the accuracy is only 70.74%. Ref. [26] and uses raw waveforms to train CNNs as well. In this work, the problem of how many layers are the most suitable for CNNs has been studied. With considerable experiments, it is pointed out that deeper layers do not give better performance. Meanwhile, the results also indicate that using waveforms just achieve an approximative performance of models using log-mel features.
Traditional CNN models have several drawbacks for auditory tasks. For example, pooling layers are generally applied in CNN models for feature dimensional reduction, however, these processes can lead to information loss and hinder the performance of neural networks. Therefore, a considerable number of works attempt to use improved CNNs for ESC tasks. Dilated convolution layers are exploited for ESC [31,32] to avoid the above-discussed obstacles. Several research works exploit CNN models which were originally developed for image recognition tasks, and achieve outstanding performance in ESC as well [25]; the environment sound classification accuracy of AlexNet and GoogLeNet [33] are evaluated on UrbanSound8K, ESC-10 and ESC-50 [34] datasets. Spectrograms (Spec), MFCC and Cross Recurrence Plot (CRP) feature sets are extracted and concatenated as three-channel image feature to train both models. The experiment results indicate that the image recognition models could also obtain good taxonomic accuracy for sound recognition problems. The authors in Ref. [35] use an end-to-end ESC system using a convolutional neural network. In this model, raw waveforms are used as inputs and two convolution layers are applied to extract features. Then, three max-pooling layers are performed for feature dimensional reduction followed by two fully connected layers as the classifier. A VGGNet [36] based ESC system is presented by Ref. [6], where the convolution filters are set to 1-D for learning frequency patterns and temporal patterns, respectively. Ref. [37] proposes a CNN based model called WaveNet, which uses multi-scale features to make a CNN that learns comprehensive information of environment sounds. First, features are extracted from one recording through the first convolution layer using three types of filter size. The second convolution layer uses corresponding pooling stride to equal the dimension of these features and then, the three features are concatenated to form the multi-scale features. This feature is further combined with a log-mel spectrogram and perform better than other systems on an ESC-50 dataset. The DS-CNN model presented by Ref. [20] also uses a raw waveform and log-mel spectrogram as inputs to train CNN based ESC system. The difference between WaveNet and DS-CNN is: the WaveNet combined two kinds of features together while in DS-CNN, two different CNN use raw waveform and log-mel spectrogram as inputs, respectively, and the outputs are fused by DS theory.
From these works, we can notice that most ESC models use raw waveform directly or single auditory features to train neural networks. However, after a comprehensive investigation of a considerable number of sound recognition works, Ref. [5] pointed out that aggregate features will give better performance than single features in ESC problems. Meanwhile, from the classification accuracy derived from these recently published works, we can also find out that the CNN-based ESC or ISR systems still has great potentials for making further progress. Hence, we hope to find efficient aggregated features and appropriate CNN architecture to elevate the performance for environment sound categorization.

Two-Stream CNN with Decision-Level Fusion
In this section, we first describe the feature extraction and combination method. Then, the structure of CNN model and DS theory-based information fusion algorithm will be presented.

Feature Extraction and Combination
Several works [4,9,38,39] have proven that aggregated features achieve higher classification accuracy of environment sounds than single features for both ASR and MIR. The same feature combination methods are introduced in our work to classify environment sounds.
As log-mel spectrogram and MFCC are the most widely used auditory features in sound recognition, these two feature sets are extracted at first. Then, chroma [40], spectral contrast [41] and tonnetz [42] are extracted through Librosa [43] library. Log-mel spectrogram, chroma, spectral contrast and tonnetz are aggregated to form the LMC feature sets, and MFCC is combined with chroma, spectral contrast and tonnetz to form the MC feature sets. Both feature sets are combined in a linear way, and their time-frequency representations are shown in Figure 1.

Structure of the MCNet and LMCNet
The two networks of TSCNN both contain four convolution layers and one fully connected layer. The framework of the proposed four-layer CNN is shown in Figure 2; the architecture of the model is as follows: (1) The first layer uses 32 kernels with 3 3 × receptive field and the stride step is set to 2 2 × and batch-normalization is performed. The Rectified Linear Unit (ReLU) is exploited as the activation function.
(2) The second layer uses the same settings as the first layer, where 32 convolution kernels with receptive filed of 3 3 × and stride step of 2 2 × . The batch-normalization is performed and activation function is ReLU as well. The difference is that the second layer applies max-pooling for dimensionality reduction of feature maps.
(3) The third layer uses 64 convolution kernels with a receptive field of 3 3 × and the stride step is also 2 2 × , where batch-normalization is used. Followed by the activation function, ReLU.
(4) The fourth layer 64 convolution kernels with receptive filed of 3 3 × and stride step of 2 2 × . The batch-normalization is performed and activation function is ReLU. (5) The fifth layer is the fully connected layer with 1024 hidden units and the activation function is Sigmoid. (6) The output is ten units according to the datasets, followed by the softmax activation function.
At the training stage, we use a 0.5 dropout probability for the second layer, fourth layer and the fully connected layer to prevent overfitting. The CNN is trained through a variant of stochastic gradient descent [44]. The batch size is set to 32, while all weight parameters are subjected to 2 L regularization and learning rate is set to 0.001 with the momentum of 0.9. The cross-entropy is applied as loss function. At the testing stage, all parameters are the same as the training stage, while the dropout will not be implemented.

Structure of the MCNet and LMCNet
The two networks of TSCNN both contain four convolution layers and one fully connected layer. The framework of the proposed four-layer CNN is shown in Figure 2; the architecture of the model is as follows: (1) The first layer uses 32 kernels with 3 × 3 receptive field and the stride step is set to 2 × 2 and batch-normalization is performed. The Rectified Linear Unit (ReLU) is exploited as the activation function. (2) The second layer uses the same settings as the first layer, where 32 convolution kernels with receptive filed of 3 × 3 and stride step of 2 × 2. The batch-normalization is performed and activation function is ReLU as well. The difference is that the second layer applies max-pooling for dimensionality reduction of feature maps. (3) The third layer uses 64 convolution kernels with a receptive field of 3 × 3 and the stride step is also 2 × 2, where batch-normalization is used. Followed by the activation function, ReLU. (4) The fourth layer 64 convolution kernels with receptive filed of 3 × 3 and stride step of 2 × 2.
The batch-normalization is performed and activation function is ReLU. (5) The fifth layer is the fully connected layer with 1024 hidden units and the activation function is Sigmoid. (6) The output is ten units according to the datasets, followed by the softmax activation function.
At the training stage, we use a 0.5 dropout probability for the second layer, fourth layer and the fully connected layer to prevent overfitting. The CNN is trained through a variant of stochastic gradient descent [44]. The batch size is set to 32, while all weight parameters are subjected to L 2 regularization and learning rate is set to 0.001 with the momentum of 0.9. The cross-entropy is applied as loss function. At the testing stage, all parameters are the same as the training stage, while the dropout will not be implemented.

Dempster-Shafer Evidence Theory-Based Information Fusion
Dempster-Shafer evidence theory (DS theory) was originally established by Ref. [45], it is also known as belief function theory. The DS theory is mainly about quantified beliefs like Bayesian probability. The main idea of is the notion of evidence and how different pieces of evidence should be combined in order to make inferences [46].
The basis of DS theory is to establish a frame of discernment Θ and a subset of hypothesis (2) ( ) 0 M ∅ = , this indicate that the mass function cannot allocate any value to an empty set.
Meanwhile, a mass function with this characteristic is called normalized mass function.
In this work, the category of sounds in the dataset can be treated as an element in subset A under the frame of discernment Θ. Here, 10 n = according to the classes number of UrbanSound 8K and each element are independent. For solving reasoning problems, the mass function representing different part of evidence must be combined in a meaningful way. Here, we use Dempster's rule to combine the two mass functions derived from each CNN. This combination rule allows combining normalized mass function that are obtained over the same frame of discernment.
The outputs of softmax of LMCNet and MCNet are used as the mass function

Dempster-Shafer Evidence Theory-Based Information Fusion
Dempster-Shafer evidence theory (DS theory) was originally established by Ref. [45], it is also known as belief function theory. The DS theory is mainly about quantified beliefs like Bayesian probability. The main idea of is the notion of evidence and how different pieces of evidence should be combined in order to make inferences [46].
The basis of DS theory is to establish a frame of discernment Θ and a subset of hypothesis where n is the number of hypothesis. A i is an element of the power set P(Θ). Mass function or basic probability assignment M is a mapping: P(Θ) → [0, 1] distribute a mass value to each hypothesis A i ⊆ Θ. The mass function represents the trust level of each element itself. There are two constraints of mass function: (1) ∑ A⊆Θ M(A) = 1, which means the sum of each probability in subset A is 1.
(2) M(∅) = 0, this indicate that the mass function cannot allocate any value to an empty set. Meanwhile, a mass function with this characteristic is called normalized mass function.
In this work, the category of sounds in the dataset can be treated as an element in subset A under the frame of discernment Θ. Here, n = 10 according to the classes number of UrbanSound 8K and each element are independent. For solving reasoning problems, the mass function representing different part of evidence must be combined in a meaningful way. Here, we use Dempster's rule to combine the two mass functions derived from each CNN. This combination rule allows combining normalized mass function that are obtained over the same frame of discernment.
The outputs of softmax of LMCNet and MCNet are used as the mass function M 1 (B) and M 2 (C). The combination of mass function (M 1⊕2 = M 1 ⊕ M 2 ) based on Dempster's rule ⊕ is defined as: where, α is a normalization constant indicating the mass function is normalized. With the LMCNet, MCNet and the DS theory-based information fusion method, we propose the TSCNN. The overall framework of the this ISR system is shown in Figure 3. With the LMCNet, MCNet and the DS theory-based information fusion method, we propose the TSCNN. The overall framework of the this ISR system is shown in Figure 3.

Experiment and Analysis
The UrbanSound8K dataset includes 8732 labeled urban sounds (the length is less than or equal to 4 s) collected from the real-world, totaling 9.7 h. The dataset is separated into 10 audio event classes: air conditioner (ac), car horn (ch), children playing (cp), dog bark (db), drilling (dr), engine idling (ei), gunshot (gs), jackhammer (jh), siren (si) and street music (sm).
The same feature extraction method presented by Ref. [28] is used in this work. All sound clips are converted to the single channel wave files with the frequency of 22,050 Hz . Then, divided into 41 frames with an overlap of 50% (each frame is about 23 ms). We use the pre-setting channels of Librosa to extract the Chroma, Spectral Contrast and Tonnetz features. For the MFCC extraction, the value of first twenty channels with their first and second order derivatives are used, resulting in 60dimensional feature vectors. The channels of Log-Mel Spectrogram are set to 60, in order to make the dimension to be equal to the MFCC. Then, all the spectrograms are represented as a matrix with a size of 41 60 × . The feature size of chroma, tonnetz and spectral contrast is 41 7 × , 41 6 × and 41 12 × , separately. Therefore, the size of LMC and MC are all 41 85 × . Figure 4 shows the graphical representation of how does the feature learned by the proposed fourfour-layer CNN.

Experiment and Analysis
The UrbanSound8K dataset includes 8732 labeled urban sounds (the length is less than or equal to 4 s) collected from the real-world, totaling 9.7 h. The dataset is separated into 10 audio event classes: air conditioner (ac), car horn (ch), children playing (cp), dog bark (db), drilling (dr), engine idling (ei), gunshot (gs), jackhammer (jh), siren (si) and street music (sm).
The same feature extraction method presented by Ref. [28] is used in this work. All sound clips are converted to the single channel wave files with the frequency of 22,050 Hz. Then, divided into 41 frames with an overlap of 50% (each frame is about 23 ms). We use the pre-setting channels of Librosa to extract the Chroma, Spectral Contrast and Tonnetz features. For the MFCC extraction, the value of first twenty channels with their first and second order derivatives are used, resulting in 60-dimensional feature vectors. The channels of Log-Mel Spectrogram are set to 60, in order to make the dimension to be equal to the MFCC. Then, all the spectrograms are represented as a matrix with a size of 41 × 60. The feature size of chroma, tonnetz and spectral contrast is 41 × 7, 41 × 6 and 41 × 12, separately. Therefore, the size of LMC and MC are all 41 × 85. Figure 4 shows the graphical representation of how does the feature learned by the proposed fourfour-layer CNN.
It can be seen from Figure 4 that the feature maps derived from first and second convolutional layers have the same size as the input feature. After 2 × 2 max pooling processing, the size of input feature maps for third convolutional layer is 21 × 43. Since the max pooling is not performed after convolutional layer 3, so that the size of input features for 4th convolutional layer is 21 × 43 as well. Then, features with size of 11 × 22 are derived from the last hidden layer and feed to the fully-connected layer which has 1024 hidden units. The output is a 1 × 10 tensor according to the number of classed of UrbanSound8K dataset is 10. It can be seen from Figure 4 that the feature maps derived from first and second convolutional layers have the same size as the input feature. After 2 2 × max pooling processing, the size of input feature maps for third convolutional layer is 21 43 × . Since the max pooling is not performed after convolutional layer 3, so that the size of input features for 4th convolutional layer is 21 43 × as well. Then, features with size of 11 22 × are derived from the last hidden layer and feed to the fullyconnected layer which has 1024 hidden units. The output is a 1 10 × tensor according to the number of classed of UrbanSound8K dataset is 10.
For each experiment, the ten-fold cross-validation is performed to evaluate the proposed ISR model on UrbanSound8K dataset. The combined features and four-layer CNN architecture are two main contributions of this work. Hence, we first analyze the efficiency of the CNN model trained with combined features. Meanwhile, the influence of the different number of convolution layers (six and eight) on CNN-based ESC system is also investigated. The additional convolution layers in the CNNs for comparison use the same receptive fields of 3 3 × and stride step of 2 2 × , batchnormalization is performed on each layer with ReLU as activation function. Dropout with a rate of 0.5 is exploited for the sixth and eighth convolution layer in the two additional CNN models, respectively. Table 1 presents the number of parameters and the memory cost of CNN with different number of convolutional layers. Furthermore, the classification performance of feature level fusion method is also presented. We combined LM, MFCC and CST together to form a new feature set called MLMC, to make a further investigation of the influence of various feature combination strategies in ESC tasks. The feature size of MLMC is 41 145 × . The spectrogram of MLMC is shown in Figure 5. The experiment results are shown in Tables 2, 4   For each experiment, the ten-fold cross-validation is performed to evaluate the proposed ISR model on UrbanSound8K dataset. The combined features and four-layer CNN architecture are two main contributions of this work. Hence, we first analyze the efficiency of the CNN model trained with combined features. Meanwhile, the influence of the different number of convolution layers (six and eight) on CNN-based ESC system is also investigated. The additional convolution layers in the CNNs for comparison use the same receptive fields of 3 × 3 and stride step of 2 × 2, batch-normalization is performed on each layer with ReLU as activation function. Dropout with a rate of 0.5 is exploited for the sixth and eighth convolution layer in the two additional CNN models, respectively. Table 1 presents the number of parameters and the memory cost of CNN with different number of convolutional layers. Furthermore, the classification performance of feature level fusion method is also presented. We combined LM, MFCC and CST together to form a new feature set called MLMC, to make a further investigation of the influence of various feature combination strategies in ESC tasks. The feature size of MLMC is 41 × 145. The spectrogram of MLMC is shown in Figure 5. The experiment results are shown in Tables 2, 4 and 5. The class-wise classification accuracy and the average accuracy of ten-fold cross-validation of three combined features and the proposed TSCNN-DS model on UrbanSound8K dataset is presented in each table. The Table 2 describes the experimental results of each method with four-layer CNN models. We can find that the feature combination of LMC and MC performs well in the four-layer CNN based ISR system. Taxonomic accuracy of five and six classes are higher than 95% using LMC and MC, respectively. While the feature aggregated of all feature sets not only reduces the performance but also makes it slightly worse. The LMCNet and MCNet achieves 95.2% and 95.3%, which is 22.5% and 22.6% higher than the model presented in Ref. [28], respectively. The feature combination of MLMC has the worst performance among the four models, however, it is still 21.9% higher than the 72.7% of Piczak's model. It can be seen that for both methods, the classification accuracy of all categories is higher than 90% except for gunshot of LMC and MLMC. The proposed TSCNN-DS model reaches 97.2% which is 24.5% higher than Piczak's work, and it significantly improved the classification accuracy of gunshots (95.4%). Moreover, in order to further illustrate whether the proposed TSCNN-DS model outperform LMCNet, MCNet and four-layer CNN using MLMC feature sets, we show the standard deviation and time cost in Table 3. The classification accuracy obtained by TSCNN-DS is 2% and 1.9% higher than LMCNet and MCNet. It is also shown in Table 3 that the standard deviation of TSCNN-DS is much less than three other methods, which further demonstrate that the fusion model outperforms three other single models. The mean time cost for LMCNet, MCNet, MLMC and TSCNN-DS is 0.023 The Table 2 describes the experimental results of each method with four-layer CNN models. We can find that the feature combination of LMC and MC performs well in the four-layer CNN based ISR system. Taxonomic accuracy of five and six classes are higher than 95% using LMC and MC, respectively. While the feature aggregated of all feature sets not only reduces the performance but also makes it slightly worse. The LMCNet and MCNet achieves 95.2% and 95.3%, which is 22.5% and 22.6% higher than the model presented in Ref. [28], respectively. The feature combination of MLMC has the worst performance among the four models, however, it is still 21.9% higher than the 72.7% of Piczak's model. It can be seen that for both methods, the classification accuracy of all categories is higher than 90% except for gunshot of LMC and MLMC. The proposed TSCNN-DS model reaches 97.2% which is 24.5% higher than Piczak's work, and it significantly improved the classification accuracy of gunshots (95.4%). Moreover, in order to further illustrate whether the proposed TSCNN-DS model outperform LMCNet, MCNet and four-layer CNN using MLMC feature sets, we show the standard deviation and time cost in Table 3. The classification accuracy obtained by TSCNN-DS is 2% and 1.9% higher than LMCNet and MCNet. It is also shown in Table 3 that the standard deviation of TSCNN-DS is much less than three other methods, which further demonstrate that the fusion model outperforms three other single models. The mean time cost for LMCNet, MCNet, MLMC and TSCNN-DS is 0.023 s, 0.024 s, 0.028 s and 0.077 s, separately. The test is down in Python under Microsoft Windows 10 x64 OS on a computer with Intel Core i7-8700 CPU, two GTX 1080 GPU (the memory of each GPU is 8 GB) and 32 GB RAM. Although the time cost of proposed model is almost three times longer than single neural networks, the computational cost of TSCNN-DS is still well acceptable for ESC tasks in real time. It can be seen in Table 4 that, the six-layer CNN based models performs slightly worse than the methods use four-layer CNN. The LMCNet, MCNet, MLMC-CNN and TSCNN is 2.2%, 6.0%, 1.9% and 2.3% worse when compared with the four-layer CNN-based models. The categorization accuracy of gunshot for both methods is less than 90% and it is less than 80% for LMC and MC feature sets. Classification accuracy of dog barking with MCNet failed to reach 90%, and taxonomic accuracy on children playing of MCNet dramatically reduced to 69.4%. The MLMC feature cannot improve the classification performance as well, where the accuracy of children playing and gunshot failed to reach 90%. The same situation also appeared in the TSCNN model. Nevertheless, the proposed TSCNN model still achieves the best classification result (94.9%). From Table 5 we can find that the performance of all methods is unsatisfactory with the eight-layer CNN. Most of the categories and all methods obtain a taxonomic result that less than 90%. This indicates that using deeper layers may not give a better result for deep architectures, while appropriate layers and suitable parameter settings are the most important components of deep learning models. In general, we can find out that the proposed LMC and MC features present to be efficiency with the proposed ISR system, which clarifies the advantage of the proposed feature combination strategies in ESC tasks. The TSCNN-DS model outperforming other models for both CNN architectures with different convolution layers. Then, the four-layer CNN achieves the best taxonomic accuracy when compared with six-layer and eight-layer CNN models for both methods. Meanwhile, the classification accuracy of both methods with the proposed four-layer CNN are higher than existing models. These results demonstrate the efficiency of the proposed four-layer CNN and DS theory fusion method based TSCNN-DS model.
In order to make a comprehensively comparison, we also investigate the two-stream CNN with the layer stack method. This model combined the outputs of the second convolution layer of both CNN and the concatenate feature maps are than used as inputs for the next convolution layers. We test this stacked CNN with 4, 6 and 8 layers as well. The parameter settings of each convolution layers and fully connected layers are equal to the 4-, 6-and 8-layer CNN described above. The classification accuracy of these stacked CNNs on UrbanSound8K dataset are shown in Table 6. It is clearly that the stacked four-layer CNN models achieve the highest (86.4%) classification accuracy among the three models. Which is 6.6% and 6.3% higher than stacked six-and eight-layer CNN respectively. This result further proves that the proper number of layers and parameters is the key to the deep learning model based ISR system, where the advantage of the proposed four-layer CNN is further proved as well.
At last, we compare our TSCNN-DS model with several existing CNN based ISR models as presented by Refs. [6,20,25,28,32,35]. The results are shown in Table 7. The LMCNet uses LMC feature sets and achieves an accuracy of 95.2%, which is 22.5% higher than the Ref. [28] model that uses LM features. Meanwhile, it is 11.5% higher than the Ref. [32] model and uses LM and Gammatone Spectrogram combined feature. Furthermore, the performance of LMCNet is slightly higher (3%) than the model presented by Ref. [20], which also applies DS theory as a sfusion method to fuse two CNN models. The classification accuracy of MCNet is 95.3%, which is much higher than the 72.7% of the model proposed by Ref. [28]. Moreover, the proposed MCNet is also significantly higher than the Ref. [28] model and is 2.3% higher than the Ref. [25] model with MFCC based aggregated featurs. Finally, the proposed DS theory-based TSCNN-DS model obtains the highest taxonomic accuracy (97.2%) among all the ESC models. The performance of our algorithm is much higher than the Ref. [28] model and is 5% higher than the Ref. [20] model which uses the same fusion strategy. To our best knowledge, this is the first time that the categorization accuracy has reached over 95% on the UrbanSound8K dataset and is the highest accuracy compared with existing models. Table 7. Comparison of classification accuracy with other models on UrbanSound8K datasets. The bold is our result.

Conclusions
In this paper, we proposed the TSCNN-DS model of intelligent sound recognition problems. It consists of two four-layer convolutional neural networks, the LMCNet and MCNet trained by two combined features, LMC and MC feature sets, respectively. Then, the outputs of the softmax layer of both networks are fused through DS evidence theory; the result is the predicted categorization of an environment sound. The performances of two CNNs with the novel combined feature sets and the entire framework are tested on UrbanSound8K dataset and compared with existing models which published in recent years. The LMCNet and MCNet reaches 95.2% and 95.3% on UrbanSound8K dataset, which is 22.5% and 22.6% higher than the Piczak's model [28], respectively. Meanwhile, these two neural networks are all slightly higher than recent ESC models which use same feature (LM or MFCC) to form a combined eigenvector. These results indicate that the proposed CNN is more effective for environment sounds classification tasks according to the appropriate parameter settings and a comprehensive representation of sound recordings through the combined feature sets. The TSCNN-DS achieves 97.2% on the UrbanSound8K dataset, which is 4.2% higher than the state-of-art methods (the Ref. [25] model), and is 5% higher than the Ref. [20] model where the same fusion algorithm is exploited to fuse two CNNs.
The DS theory can substantially improve the taxonomic performance of the single CNN model in ESC problems, however, from Table 2 we can find out that the accuracy of repeated discrete sounds (car horn, dog barging and gunshot) is worse than other sound classes. This is maybe caused by the number of convolutional layers, which prevents the model from extracting enough feature maps to comprehensively represent important information of sound signals. Another probability is the feature (LC and MC) which may neglect some needed information for representing such discrete sound signals. To improve the categorization accuracy on these kinds of sounds with the TSCNN-DS model will be among our future work. Both new feature extraction methods and novel CNN architectures should be established for conquering these problems and improving the classification performance. Meanwhile, the computation cost should also be considered to make an ISR model which can be applied in real time.