An Ensemble One Dimensional Convolutional Neural Network with Bayesian Optimization for Environmental Sound Classiﬁcation

: With the growth of deep learning in various classiﬁcation problems, many researchers have used deep learning methods in environmental sound classiﬁcation tasks. This paper introduces an end-to-end method for environmental sound classiﬁcation based on a one-dimensional convolution neural network with Bayesian optimization and ensemble learning, which directly learns features representation from the audio signal. Several convolutional layers were used to capture the signal and learn various ﬁlters relevant to the classiﬁcation problem. Our proposed method can deal with any audio signal length, as a sliding window divides the signal into overlapped frames. Bayesian optimization accomplished hyperparameter selection and model evaluation with cross-validation. Multiple models with different settings have been developed based on Bayesian optimization to ensure network convergence in both convex and non-convex optimization. An UrbanSound8K dataset was evaluated for the performance of the proposed end-to-end model. The experimental results achieved a classiﬁcation accuracy of 94.46%, which is 5% higher than existing end-to-end approaches with fewer trainable parameters. Four measurement indices, namely: sensitivity, speciﬁcity, accuracy, precision, recall, F-measure, area under ROC curve, and the area under the precision-recall curve were used to measure the model performance. The proposed approach outperformed state-of-the-art end-to-end approaches that use hand-crafted features as input in selected measurement indices and time complexity. layer followed by batch normalization and max-pooling layer with a fully-connected layer and categorical softmax function. UrbanSound8K benchmarking dataset of 8732 audio samples was used to evaluate the proposed model’s performance. Our results indicate that the proposed end-to-end Bayesian Ensemble 1D CNN is superior and more efﬁcient for environment sounds classiﬁcation applications due to the appropriate hyperparameter selection compared to other state-of-the-art approaches. Sensitivity, speciﬁcity, accuracy, precision, recall, F-measure, area under the ROC curve, and the area under the precision-recall curve were used as measurement indices to measure the model’s performance. Statistical analysis reveals that the enhancements are signiﬁcant with a classiﬁcation accuracy of 94.46% on the UrbanSound8K dataset, which is higher than state-of-the-art end-to-end methods by 5.4%. Future research will look at a more precise method of selecting hyperparameters and additional datasets and repositories. Finding an adaptive way to adjust the searching pounds automatically would be recommended. Furthermore, considering hybrid networks have shown much better performance in other domains.


Related Work
The research community has paid more attention to the environmental sound recognition problems with the popularity of CNNs, and its application that exceeds conventional methods for different categorization problems [36,45]. Deep learning [34,35] is a particular technique of machine learning that is characterized by its hierarchical multi-level architecture with subsequent phases in data processing [35]. It can extract and learn features directly from raw waveform signals that have efficiently proven several problems in language and images recognition [55,56]. CNNs are widely applied as a modeling algorithm for multi-variable time-series input data due to the challenges in hand-crafted feature-based methods [32,45,46,49,57]. However, most of these methods require the input data to be transformed into a 2D representation and feed the input to CNNs architecture as AlexNet and VGG, which were developed initially for object recognition problems [58][59][60].
According to many studies, CNNs have fewer parameters to optimize than other unidirectional, and deep neural networks [61]. Deep learning methods, particularly CNNs, are easier to train, unlike other traditional methods, since they provide a structure that performs features extraction and classification in one single block instead of hand-crafted features and classification separately. Many applications consider the log-mel feature of audio signals as an effective audio recognition function. It calculates each sound frame and its magnitude of each frequency area. Each frame's features can be arranged alongside to time axis to create a features map [62]. That makes convolution layers in CNNs to be used as a classifier with the log-mel feature map. This method was first applied and proposed by Piczak et al. (2015) in the environmental sound classification [45]. They attempted to get the delta-log-mel and log-mel features extracted from each frame and then used the delta and two-dimensional static log-mel features map, and then they used two-channel CNN for classification.
Similarly, Bello (2017) [32] used a two-channel network classification technique with log-mel features map. They modified the network structure that consists of three convolution layers and one fully-connected layer. Besides, they utilized data augmentation techniques to increase training data variation in order to train the network efficiently, which enhanced the classification accuracy by 6%. In the same year, Dai et al. (2017) [63] introduced various deep convolutional models that consist of residual learning and utilize down-sampling and batch normalization technique in the CNN initial layers, which achieved an accuracy of 72% on the UrbanSound8k dataset.
On the one hand, many researchers attempt to automatically learn features from raw waveforms [64,65]. However, that could not outperform the classification accuracy obtained by log-mel features [49]. For instance, Boddapati et al. (2017) [58] experiments achieved an average accuracy of 92.5% by using mel-frequency, spectrogram, and crossrecurrence alongside AlexNet and GoogLeNet as classifiers with 2D input representations. Its efficiency in summarizing waveforms with high-dimensional spectrograms into a compact representation is one of the primary benefits of using 2D representations. However, they rely on large amounts of the dataset to learn without overfitting, making sound classification modeling difficult using two-dimensional CNNs [47].
1D CNNs have been widely used in major engineering applications, especially focused on signal processing. Their state-of-the-art performance is highlighted and verified with their unique properties [34]. There are still limited studies in end-to-end environmental research. Hoshen et al. (2015) [50] found that the time difference between channels has been found to indicate the input position based on the creation of an end-to-end multi-channel 1D CNN for audio recognition with an error rate or 27.1% for single-channel. They stated that direct learning from the sound waveform matches the log-mel filter-bank performance. Tokozume et al. (2017) [49] developed a method named between-class learning that mixture of two audio samples then trains neural networks to anticipate the mixing ratio of these samples. That method has improved efficiency for different architectures used for sound recognition tasks according to their experiments. Besides, they developed an end-to-end method that outperforms traditional learning techniques on various datasets using the between-class learning approach and 1D CNN.
End-to-end applications using 1D CNNs are becoming common in signal processing because of their ability to benefit from its structure that can learn from waveforms directly [34,50]. An end-to-end speech recognition approach based on waveforms and multi-scale convolutions that directly learn input data representation from signal was proposed by Zhu et al. (2017) [66]. They used several 1D convolution layers with different kernels to extract data features, and then to ensure consistent sampling frequency, they used a pooling layer to concatenate features. Their experiments reported an error rate of 23.28%. Ravanelli and Bengio developed another end-to-end technique for speech identification (2018) [44]. They tried to learn the (low and high) cut-off data frequencies of filters. Their model reduced the proposed model parameters by learning practical filters from the first layer. They achieved 0.85% as a sentence error rate based on the TIMIT dataset. In the same year, Zeghidour et al. [67] developed an end-to-end architecture using a 1D CNN with filter-banks without using mel-filter for speech recognition. Furthermore, Abdoli et al. (2019) [47] proposed a sound classification method based on the 1D CNN. They used a gamma-tone filter bank for features generation. Their method was tested on the UrbanSound8k dataset with a classification accuracy of 89%.
It was suggested that the hybrid model integrating two or more models could achieve high prediction accuracy [17]. A robust prediction can be achieved by combining several CNN networks that learn from waveform raw audio as one or two representations of the signal. Li et al. (2018) [23] combined two different networks; one learns the audio waveform directly, and the second using log-mel features to learn high-level representations from the waveform. Both models are independently trained, and the two model predictions are combined with the Dempster-Shafer method. An accuracy of 92.2% was achieved using this ensemble method on UrbanSound8k datasets with a 2% improvement in recognition accuracy compared to the current end-to-end model.

Methodology
Deep learning neural networks offer increased flexibility and scalability in a proportion of data availability. However, a downside to this versatility is that they are sensitive to the training data details and can find a new set of weights that generate different predictions from the learning procedure through a stochastic algorithm. We refer to this as high variance neural networks [35]. Creating several models rather than a single model, and combining them is a practical approach to reducing neural networks' variance. This approach decreases the uncertainty of predictions and can lead to a more stable and sometimes better prediction than each member model's predictions. CNNs substantially progressed in numerous recognition problems by replacing manually engineered features techniques. In a sequence of layers, the model is created, starting with convolutional and pooling layers. Units are organized into features map in the convolutional layers and linked to local patches in the previous layers features map. A nonlinearity activation function is then transferred via an activation function such as ReLU [39]. In this way, within local groups of values, CNN defines data correlated and invariant to the location. Several convolutions, pooling layers, and nonlinearity activation functions are stacked after each other, followed by dense layers. Finally, a fully-connected layer with softmax function as output classification layer.
The proposed network was developed and built to optimize network parameters and map the input data according to the convolutional layers hierarchical feature extraction mechanism. To increase the region wrapped by the next receptive fields, we stacked different pooling layers after each convolutional and before batch normalization layers. The final convolution layer output is then flattened and used as the input of several fully linked stacked layers. The primary drawback of using 1D CNNs is that the input samples must be in a fixed length. However, different durations of the sound captured from the environment can occur; splitting the audio signal into multiple fixed-length frames using the sliding window of acceptable width is one way to overcome this restriction. Therefore, we applied a variable width window in our approach to conditionalize the signal of each audio to the proposed model input layer. The width of the window mainly relies on the signal samples rate. Sequential audio frames can similarly have some overlap to optimize the use of data. Based on Abdoli et al. (2019) [47], a sample rate of 16 kHz or 18 is considered an acceptable compromise between the input sample quality and model computational complexity. Our proposed architecture aims to manage variable duration audio, and directly learn from the audio waveform with distinctive representation that will obtain an acceptable classification performance on various environmental sounds.

Ensemble Learning
Ensemble learning has become a hot subject recently. Many studies have demonstrated that ensemble learning performance is superior to a single classifier [68,69]. Ensemble machine learning algorithms combine several learning model predictions into a single "ensemble" model to maximize their efficiency [68]. Predictions that are good in various ways will result in a more stable and often better prediction than any individual member. Bagging, boosting, and stacking are traditional approaches to ensemble learning [70]. Since the classification method of ensemble learning often obtains better accuracy of classification than an individual classifier. However, the enhancement comes at the cost of computational time [71]. In essence, ensemble learning is proposed to decrease the randomness of the outcome of a one-time prediction. Multiple predictors can be obtained in two different primary ways, which are: (i) by using various algorithms to process the same dataset, and (ii) using various machine learning algorithms with a change of its hyperparameters. In general, changing algorithms' hyperparameters is crucial for optimization and is not related to integration. Therefore ensemble prediction models can be divided into two types; namely, homogeneous and heterogeneous ensemble models [72], which correspond to (i) and (ii), respectively.
Training each model on various training data folds is one way of achieving differences between models. Models are naturally trained on various subsets of training data using resampling methods, such as cross-validation and bootstrap [73]. For a deep neural network model, learning the weights involves solving a high-dimensional problem of non-convex optimization [74]. The difficulty in solving this problem of optimization is that there are many "good" solutions, and the learning algorithm can bounce around and settle in one of them [75]. This is referred to as convergence with a stochastic optimization solution, where a collection of unique weight values defines a solution. A type of heterogeneous ensemble strategy is the stacking of ensemble learning. The heterogeneous ensemble combines many different base classifiers into a robust meta-classifier to increase the strong classifier's generalization potential. The ensemble approach takes advantage of both the base classifiers and the meta-classifier learning abilities, significantly improving the accuracy of the classification and the correctness rate, as shown in Algorithm 1. Learn h t based on D 6: End For 7: Step 2: build up new dataset for prediction 8: For i = 1 to m Do: 9: Step 3: learn meta-classifier 12: Return H

Proposed Architecture
Ensemble models can achieve lower generalization problems than individual models, but they are difficult to construct with deep neural networks because of the computational cost of training each model. Alternatively, multiple model snapshots should be trained during a single training, and predictions should be combined to create the ensemble classifier. However, this method has some drawbacks, such as the saved models being identical, leading to similar prediction errors and predictions that do not have many benefits from combining their predictions. The use of an aggressive learning rate forces significant changes in the model weights, and the essence of the model saved at each snapshot is one approach to promoting various models saved during a single training run. Several callbacks were used at some frequencies to monitor and track models during training. We have used checkpointing to save the network weights only when the validation dataset increases the classification accuracy. It enables us to monitor the model at the end of the epoch or batch during training. Pseudocode of the proposed method shown in Algorithm 2. Split dataset into k-folds 5: For Each fold in l-folds Do:

6:
For Each predictor in ensemble Do: 7: Train learner on train set in fold 8: Validate class probabilities from learner in fold 9: Create prediction matrix of classes probabilities 10: End For 11: End For 12: Calculate probabilities across learners 13: Get loss value of loss function with probabilities 14: If loss < previous Then: 15

Experiments
This section includes an overview of the experiment's environment, data preprocessing, and hyperparameters configurations. The proposed architecture aims to manage variable duration audio signals, and directly learns from the audio waveform with a strong classification performance on various environmental sounds. There are six kinds of layers in the proposed model: (i) input layer; (ii) convolution layers for features extraction from input data; (iii) max-pooling layers for reducing dimensionality and enhancing the robustness of some features; (iv) flatten layer to convert features map to a single array; (v) fully-connected layer to integrate extracted features; and (vi) categorical softmax function as an output layer to represent distribution over different classes. Besides, we divided the dataset into a 5-cv following the technique used in [47], and a single training fold was used as a validation set for tuning the hyperparameter in Bayesian models. We implemented our network using TensorFlow/Keras [76], taking advantage of its native support for asynchronous execution and flexibility.

Feature Extraction
Several studies have shown that aggregate characteristics achieve higher environmental sound classification accuracy than single speech recognition characteristics [77]. Previous studies have found that the techniques of features extractions such as log-mel, mel-frequency cepstral coefficient, and mel spectrogram are considered to be the most suit-able and frequently used auditory features in sound recognition [40,78]. They can capture different audio data patterns. Thus, network recognition based on these input methods can further use complementary relationships to enhance recognition efficiency further. All feature sets are combined linearly, alongside their time-frequency representations [41]. We used the same feature combination in our work. Librosa [79] package was used for features extraction with (60) bands to cover the frequency range of the sound segments and represent each segment feature as (frequency × time × channel).

Dataset Furthermore, Preprocessing
A large amount of training data is needed in multi-class sound classification problems, particularly with a high feature vector dimension. This paper dataset comes from Urbansound8k [53] and includes ten classes of 8732 urban sounds obtained from the real world (the duration is less than or equal to 4 s), totalling 9.7 h. The dataset is divided into ten groups of audio events: street music (SM), jackhammer (JH), siren (SI), drilling (DR), engine idling (EI), dog bark (DB), air conditioner (AC), siren (SI), car-horn (CH), children playing (CP), and gunshot (GS), with 126 male speakers and 125 female speakers as in shown in Table 1 and Figure 1.  All sound clips with a frequency of 22,050 Hz are converted to single-channel wave files. Similar to the augmentation procedure in [80], we put the window size to 1024 (approximately 47 ms), which is half of the window size as the hop-size and divide it into 50% overlapping segments. Data preprocessing techniques include normalization, splitting, and transformation. We used a stratified split category strategy to ensure fair distribution between the various sections of the categories. The dataset randomly divided into five equivalent size subsets. We split the dataset into k-f old cross-validation (k = 5) and use a 20% of training folds for validation and tuning hyperparameter. We have carried out crossvalidation of our networks five times. Our approach's data splitting and transformation procedure are provided in Equation (1).
where δ m n represents the instance of 1d signal features, ζ m n is denoting to the classes output, 1 ≤ i ≤ m, 1 ≤ j ≤ n, m symbolizes the size of the dataset, and n is the size of input features.

Hyperparameter Tuning
Tuning hyperparameters involves the procedure that determines the network configurations' values, contributing to providing more accurate classification accuracy [81]. It highly influences machine learning algorithms. Tuning hyperparameters are typically done manually (trial and error). Nevertheless, this approach takes time to test all the possible hyperparameters that may result in an acceptable performance. Other approaches, such as grid searches [82,83], genetic algorithm [84], and particle swarm optimization [85,86], are widely used as hyperparameters tuning methods. Grid and random search divide the hyperparameter values into several folds to build a certain space range of grid search space. Then, it crosses over all grid points to find the best hyperparameters values. It requires that many hyperparameters be tested routinely by automatically re-training the model for each hyperparameter value [82], which makes it useful by mapping the problem space and giving more optimization possibilities. However, these methods and metaheuristic algorithms are restricted to a large-scale network. They also take time because experimental methods require several experiments to estimate the value.
Many researchers are attracted to BO as an efficient method for tuning networks hyperparameters [87]. Using a small number of samples can obtain the values of the optimum hyperparameters for optimization, unlike traditional optimization methods [88]. It does not need the functions of explicit expression [89]. BO is a very effective in solving this kind of optimization problem [87,90]. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. BO is slightly more sophisticated compared to manual tuning since it combines prior information about the unknown function with sample information to obtain posterior information of the function distribution. Then, based on this posterior information, we can deduce where the function obtains the optimal value [91]. In other words, It can be useful to model-based approaches for automatically configuring hyperparameters to generate a surrogate model of some unknown function that would otherwise be too expensive. In BO, rather than considering the objective function as a black-box about which we can only obtain point-wise information, regularity assumptions are made. These are used to actively learn a model of the objective function. The resulting algorithms are practical and provably find the global optimum of the objective function while evaluating the function at only few parameters [92].
These hyperparameters include the number of filters, kernels, optimizer, momentum, learning rate, batch size, and dropout rate. In experiments, the initial configurations of hyperparameters are randomly generated with shuffled training and validation data. Bayesian optimization searches for the best hyperparameters and fitting a new model of the best k-cv selection. For each point in the search space, Bayesian optimization cross-validation provides cross-validation estimates of performance statistics. Different data split between folds may generate different optimal tuning parameters. The tuning hyperparameters were then selected with the lowest average cross-validation error. We refer to it as the optimal cross-validation choice for tuning hyperparameters. A detailed list of hyperparameters and search bounds are shown in Table 2. For the neural network training, we trained it through a variant of adaptive optimizers like Adam, RMSprop, AdaBound, and EAG [93] with initial learning rate of 10 −4 , β 1 = 0.90 and β 2 = 0.99, f ilter = 128, and kernal = 2. We used an initial value of 0.25 as dropout probability in the ninth layer to prevent network overfitting at the training stage. The batch size is set to a value in the range of (4 and 128), while all weight parameters are subjected to L2 regularization. When analyzing a certain tested hyperparameters influence, the remaining parameters are fixed with default values. The detailed default model is shown in Table 3.

Training Procedure
We selected top five models, indicated by M = {M 0 , M 1 , · · · , M n }. M represents the top five models selected based on Bayesian optimization. The main difference between them is the optimization function, learning, and dropout rate. The proposed ensemble architecture is inspired by Li [47], who proposed a deep CNN architecture to learn sound representations directly with changes in the network structure and hyperparameters. Batch Normalization was applied to minimize overfitting, and the dropout method is used after each convolution layer's activation function for features map dimensionality reduction. Finally, a categorical cross-entropy softmax activation function is used as the loss function, with ten output units representing the labeled classes' number. We also apply both 3-cv, 5-cv and 10-cv cross-validation to all methods to determine our deep architecture performance with different folds. Every time, as the training set (k − 1), four subsets are combined, and the remaining subset is used for testing. This method is repeated until every subset is used. Table 3 presents the networks structure with initial hyperparameters.

Evaluation Metrics
Our proposed architecture generalization performance and efficiency require a practical and feasible experimental estimation method and standard measurement. Four measurement indices, namely: sensitivity, specificity, accuracy, precision, recall, F-measure, area under ROC curve, and the area under the precision-recall curve, were used to measure models performance. Each task output was assessed using different metrics depending on the task. Various performance metrics often contribute to distinct decision outcomes, especially when evaluating various classifiers' capabilities.
The sample is divided into four cases according to the combination of precise classes and algorithm prediction classes: true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN). Sensitivity is the probability that groups are correctly categorized. However, specificity is the probability that non-classes are classified correctly. In contrast, precision is the possibility of non-classes being correctly identified. The F-score or F-measure is commonly used to evaluate information in the network. Furthermore, used to combine the precision and recall of the model. Moreover, it measures the networks' accuracy on the selected dataset and evaluates binary classification systems, which classify examples into 'positive' or 'negative'. ROC curves plot y-axis sensitivity and x-axis specificity, corresponding to the decision thresholds. AUROC is indicating the area under the ROC curve, which is the model goodness-of-fit. A perfect model has an AUROC of 1, and random output is indicated by an AUROC of 0.5. The closer AUROC to 1, the better the performance of the model [94].

Hardware Requirements
To process and train such a deep model with such a large dataset requires a particular computational effort. This experimental research is being performed on an Intel(R) Core TM desktop workstation with a Core i7-9700 CPU @ 3.40 GHz, a RAM of 32 GB, and the hard drive includes 512 GB SSD. Nvidia GeForce with a GTX 960M model and 8 GB VRAM specifications are the graphics card. The use of the GPU was not needed because of the low computational complexity of the 1D CNN architecture.

Software Requirements
All of the experiments were performed on Windows 10, Anaconda package management, and python 3.7.9. Different libraries/packages are used in our implementation, such as Pandas, Keras [95]. Proposed models were developed using TensorFlow/Keras [76] to build a baseline of all learning models alongside Scikit-learn [96] for preprocessing data phase. Keras makes it very easy to add, drop, and control layers in our proposed architecture. Librosa [79] was used for audio analysis and features extraction.

Results
This section explains our experiment results in detail and presents our proposed model efficiency compared to other benchmark approaches and how selected hyperparameters influence our recognition models. The results show that the Bayesian optimization algo-rithm based on the Gaussian method can achieve higher accuracy in a few data samples based on the experiments we performed compared with manual and other search methods. The performance was assessed over 400 iterations, the initial batch size of 42 samples with early termination, and 12 patients, which is the number of iterations showing no improvement. Early stopping mode was set to "auto" to increase or decrease in validation accuracy automatically.
Based on Bayesian optimization, we selected the top five models over 25 epochs. Several convolutional layers and their number of filters plays an important role in detecting high-level features. It is also noticed that increasing the number of convolution layers will significantly impact the accuracy of the model recognition with waveform input and the four convolution layer networks achieve the best performance. We noticed that increasing the number of stacked network layers under the same features map does not improve recognition accuracy. Convolutional filters search array are set to {16, · · · , 512} with a step size of 2. The network uses the strategy of repeated stacking. We used multiple optimizers to optimize the convex and non-convex optimization network, such as Adam, AdaBound, and EAG. We trained the networks with various initial learning rate values (0.1, 0.01, and 0.001), and we found that (0.001) performed better in all models. In order to find the optimum batch size, we compared the models with various batch sizes within the range of [16, · · · , 256], and the batch size of 54 obtained the best results in conducted experiments. Finally, our models adopt 50 training epochs due to no further validation accuracy improvement. Table 4 summarizes the best configurations for the different models. The epoch's hyperparameters with the highest validation accuracy are determined as the final hyperparameters for each selected model. Bayesian optimization achieved good results after a few iterations, and no significant improvements were apparent in extra iterations. 1D CNN accuracy improved by 2% compare to default configurations. Performance on test set by the five models is summarized in Table 5 and the curves for validation set are displayed in Figures 2 and 3. As shown in training curves, the training was finished after 63 epochs due to no further validation accuracy improvement. Model loss decreased over time for both training and validation datasets, indicating that the model learned from the dataset as expected. Similarly, as learning efficiency improved, model training and validation accuracy increased over time. Model loss and accuracy variations over various epochs are due to hidden dropout layers. The gap between curves for training and validation implies that overfitting was minimal; dropout was used to prevent overfitting and improve computational efficiency. The maximum and minimum model loss and accuracy using the validation dataset were (0.353, 0.469) and (88.8%, 91.4%), respectively. The average accuracy of 25 iterations achieved by the 1D CNN on test set in selected models were 89.6%, 91.4%, 90.1%, 89.7%, and 90.0%, respectively, as shown in Table 6.    Table 6 and Figure 3 shows the accuracy and loss values of all the above mentioned models under the given number of training epochs. Figure 3, we concluded that the ensemble model's convergence speed is way faster than other method and its trends to convergence with fewer rounds and a lower loss value. These conclusions also demonstrate that the proposed method can guarantee a faster convergence speed, a lower loss value than other methods. Based on the stated evaluation criteria, the output of the proposed 1D CNN on the test results shown in Table 6.
The accuracy of the top five selected classifiers based on Bayesian optimization are 89% 89%, 92%, 0.93%, 0.94%, respectively. The highest sensitivity of 93.10% and specificity of 99.49% were obtained. The specificity was more than 99% for all validation data. Overall average accuracy obtained was 94.46%, and the maximum accuracy of 95.21% for all experiments iterations. Performance measurements are presented for the weights for which the best results obtained in the study. Figure 4 shows the graphical representation of the comparison of performances for different ensembles.  An ensemble model is developed by integrating five base predictors. Experimental results show that the performance of CNNs is much higher than any individual methods, as shown in Table 6 and Figure 3. Reasons behind ensemble method superiority are: (1) as the essential predictors, methods that produce satisfying results are implemented, and the variety of base predictors helps to construct models of high accuracy ensemble; (2) every base predictor is treated equally by the ensemble model, and no prior information is required.
Moreover, we used a box-plot in Figure 5 to show the accuracy of the different network structures on different subsets under five-fold cross-validations. The results show that different stacked convolutional neural networks' architecture influences their recognition accuracy. The best recognition accuracy (five-layer network) is 3.2% higher than other 1D CNN approaches. We further compared the proposed model's performance and explored how the proposed method's information fusion created the confusion matrix and AUC score. From AUC scores, it is observed that the proposed methods produced better results than other methods with scores of 0.99. Figure 6 shows the ROC curves and the calculated AUC scores. While Figure 7a shows the confusion matrix of the Bayesian optimization results, and Figure 7b of ensemble model. Values along the diagonal represent the number of correctly categorized samples for each particular class. Each confusion diagonal matrix displays the prediction accuracy of each sound class out of 10 classes. The comparison demonstrates that the ensemble model recognition accuracy is higher, with an average increase of 4.7%. This implies that the ensemble model can learn higher-level and more features from the raw waveform to improve model prediction performance. On average, the Bayesian optimization and ensemble model has enhanced overall performance by 1% and 5%, respectively.
Confusion matrices indicate that the CH and ST labels are the most challenging classes for 1D CNN. However, EN and GU classes are well classified by the proposed method. DB, CP, and CH are the classes that show relatively low performance. As can be seen in the confusion matrix before and after enabling, the recognition accuracy varies for each sound form. We only show the confusion matrix of the highest Bayesian model and the proposed ensemble model.
Our accuracy reached a maximum of 95.1%, which is higher than that achieved by Abdoli et al. (2019) [47], and Li et al. (2018) [23] with accuracy of 89% and 90.2%, respectively. Furthermore, higher than the 73.7% accuracy of Piczak (2015) [45], and the 79% of Salamon and Bello (2017) [32]. These results indicate that our ensemble 1D CNN has achieved significant sound recognition improvements. It also means that our model structure for ESC with raw waveform input has a higher recognition accuracy. When we applied ensemble theory to merge Bayesian models, the accuracy has increased by an average of 1.26% with each model up to four models and higher accuracy of 5% with the proposed model, proving our model's superiority with the environmental sound events recognition. Our algorithm performance is slightly lower than results reported by Su et al. (2019) [97] of two-stream CNN with the accuracy of 97%. However, that model was 15.9 M number of trainable parameters six times higher than our proposed method. The results of our experiments on the Urbansound8k dataset also confirm our findings. The average accuracy of the 5-cv on the models before the combination was 90.36% and 94.46% after model ensembling. It also shows that ensemble methods can combine various types of predictions to yield high-accuracy results that outperform current state-of-the-art methods. To the best of our knowledge, for the first time, the recognition accuracy has surpassed 94.4% using the 1D CNN as a classifier on UrbanSound8K dataset. It proves our approach superiority in environmental sound recognition applications. Compared to other algorithms, experimental results show that the ensemble learning method enhanced the classification accuracy of the 1D CNN by nearly 5%. Compared with other traditional machine learning algorithms, a more robust lifting algorithm, and a deep learning algorithm, the stacking ensemble learning model also has different degrees of improvement. The comparison results for all approaches are summarized in Table 7.
Our main contribution of this paper is a special 1D CNN architecture and its hyperparameter settings using Bayesian optimization and ensemble learning. We trained several different 1D CNNs with various network configurations to analyze different structures impact on recognition performance. First, we tested the 1D CNN architectures with several convolution layers to analyze their influence on the feature extraction. Ensemble multiple models with different configurations are our second main contribution of this work. Because short audio segments do not contain enough information to train the 1D CNNs properly, this study may be specific to the UrbanSound8K dataset and cannot be gener-alized to other audio classification tasks or datasets. However, it is ideal for mobile or handheld devices with limited power due to the low computational complexity of the 1D CNN architecture.

Approach
Year Representation Mean Accuracy # of Parameters M18 CNN [47] 2017 1D 72% 3.7 M EnvNet-v2 [63] 2017 1D 78% 101 M RawNet [23] 2018 1D 87% 377 K 1D CNN Rand [47] 2019 1D 87% 256 K 1D CNN Gamma [47] 2019 1D 89% 550 K Proposed Ensemble 1D CNN 2021 1D 94% 1.9 M Future work will explore an adaptive way to automatically adjust the searching process and make the neural network architecture evolve by automatically reshaping, adding, and removing various layers. Furthermore, considering hybrid networks like CNN-LSTM [98] to merge both advantages from each approach has shown much better performance in other domains. Obtaining a network training dataset is the key downside of the neural network. If training data is insufficient and not appropriate, the network will easily learn the dataset bias and generally make poor predictions of data with trends which the training data has never seen. Besides, it is typically more difficult to track back if a network does not function as intended. We have used only UrbanSound8k in this paper for network training, which may lead to bias by overlooking the effects of unusual audio events. Perhaps a more practical approach would be to train the network in a semisupervised to boost network performance on unseen data by using limited actual data and data generated during the training.

Conclusions
This paper proposed an end-to-end Bayesian ensemble one-dimensional CNN for environmental sound classification, archiving higher accuracy with fewer network trainable parameters. It learns the representation from audio waveform directly without additional features extraction and signal processing. We selected the highest performed models based on Bayesian optimization and fused them through the ensemble mechanism. Baseline models are constructed of a convolution layer followed by batch normalization and max-pooling layer with a fully-connected layer and categorical softmax function. Urban-Sound8K benchmarking dataset of 8732 audio samples was used to evaluate the proposed model's performance. Our results indicate that the proposed end-to-end Bayesian Ensemble 1D CNN is superior and more efficient for environment sounds classification applications due to the appropriate hyperparameter selection compared to other state-of-the-art approaches. Sensitivity, specificity, accuracy, precision, recall, F-measure, area under the ROC curve, and the area under the precision-recall curve were used as measurement indices to measure the model's performance. Statistical analysis reveals that the enhancements are significant with a classification accuracy of 94.46% on the UrbanSound8K dataset, which is higher than state-of-the-art end-to-end methods by 5.4%. Future research will look at a more precise method of selecting hyperparameters and additional datasets and repositories. Finding an adaptive way to adjust the searching pounds automatically would be recommended. Furthermore, considering hybrid networks have shown much better performance in other domains.