Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion

Zheng, Yonghao; Yao, Lingyun

doi:10.3390/app16052359

Open AccessArticle

Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion

by

Yonghao Zheng

^* and

Lingyun Yao

College of Engineering and Technology, Southwest University, Chongqing 400715, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2359; https://doi.org/10.3390/app16052359

Submission received: 26 January 2026 / Revised: 20 February 2026 / Accepted: 25 February 2026 / Published: 28 February 2026

Download

Browse Figures

Versions Notes

Abstract

The perception system of intelligent vehicles is designed to accurately recognize the surrounding traffic environment, which is crucial for achieving safe and efficient autonomous driving. However, it is noted that most existing systems rely solely on visual perception. Auditory perception has been identified as a complementary approach that can address the limitations of visual perception and provide a more reliable basis for decision-making. Sound event detection (SED) is regarded as the core technique for implementing auditory perception. This study proposes a traffic sound event detection model based on PANNs-CNN10. First, acoustic features of traffic sounds are extracted using a Filter Bank (FBank). FBank can retain the frequency domain information of the bandpass filter in its entirety, facilitating the learning of complex feature relationships. Additionally, a kind of multi-scale convolution block is introduced in the intermediate layers of the network to enable it to learn features at different scales and improve the expressiveness of the model. Furthermore, a hybrid multi-scale attention module and a Shuffle Attention module are introduced in the intermediate and deep layers. These modules effectively focus on the correlation of different channels and enhance the network’s ability to capture key features. The improvement resulted in the Traffic Sound Event Detection Convolutional Neural Network (TSED-CNN) model. The TSED-CNN achieved an accuracy of 96.378% for traffic sound event detection and improved the baseline model by 1.705%. The result shows that the proposed method is able to accurately detect the traffic sounds and further enhance the ability of intelligent vehicles to perceive the surrounding traffic environment.

Keywords:

sound event detection; attention mechanism; convolutional neural network; Filter Bank; multi-scale convolution

1. Introduction

With the widespread application of artificial intelligence techniques [1,2,3,4], the level of intelligence in various industries has been continuously improved. In particular, the performance of autonomous driving systems, as one of the key techniques, has also been continuously enhanced in the field of intelligent vehicles. However, most of the environmental perception systems integrated into current intelligent vehicle autonomous driving systems are primarily based on visual image recognition methods [5,6,7], neglecting the importance of auditory perception. The combination of the auditory perception technique and the visual perception technique can better improve the stability and accuracy of the perception system’s judgment of the driving environment. In driving situations with harsh environments, the visual sensors are highly susceptible to interference [8] and may not be able to accurately determine the surrounding environment. In such cases, the auditory perception system can be assisted by sound sensors to obtain sound information in the environment. And this ensures a more accurate judgment about the surrounding environment’s state.

Early sound event detection methods were primarily based on feature extraction, where the spectral features of extracted audio were combined with statistical models such as Hidden Markov Models or Gaussian Mixture Models for classification and detection [9,10,11]. However, these methods can only be used for some simple acoustic environments, and their computational capabilities are also limited. With the rise of machine learning techniques, more complex classification algorithms such as Support Vector Machines [12,13], Random Forests [14,15], and K-Nearest Neighbors [16] have been applied to sound event detection. Although the recognition accuracy and robustness of these methods have improved, they are still confined to limited audio environments and struggle to handle the complex acoustic environments encountered in real-world scenarios.

After 2010, deep learning techniques advanced rapidly. Due to their powerful feature extraction and automatic learning capability, significant progress was made in sound event detection. Deep learning frameworks, such as Convolutional Neural Networks (CNN) [17,18], Recurrent Neural Networks (RNN) [19], and Long Short-Term Memory networks (LSTM) [20], have achieved widespread applications across various domains. Compared to traditional machine learning, deep learning has the ability to automatically extract features, allowing it to handle more complex and diverse audio data. It demonstrates superior performance in processing large-scale datasets, which enables the extraction of more comprehensive and nuanced features.

In the past decade, deep learning has been increasingly applied to sound event detection across various fields. Piczak [21] was the first to propose the application of Convolutional Neural Networks to the task of environmental sound classification, demonstrating how these networks can be used to extract spectrogram features to improve the performance of audio classification. Xie et al. [22] investigated the classification and monitoring of bird calls using two different deep learning frameworks. The impact of three feature extraction methods on model performance was also compared. Wang et al. [23] proposed the use of a lightweight MobileNetV3_esnet model for distinguishing between estrus and non-estrus sow sounds, with the model’s size being reduced to only 5.94 MB. Bardou et al. [24] employed a CNN model for lung sound classification, and its performance was evaluated against two alternative classification approaches based on SVM, KNN, and GMM. Models trained with hand-crafted LBP features were further compared to those utilizing MFCC features. Choi et al. [25] utilized a CNN model to detect and classify water pipeline leakage sounds, and its performance was assessed in comparison with an SVM model. Li et al. [26] developed a modified CNN model for heart sound classification, integrating a variety of manually extracted features for improved accuracy.

Machine learning and deep learning techniques have also been gradually applied to the field of intelligent transportation. Luitel et al. [27] introduced a two-level classification method utilizing classifiers such as ANN and RF. In this approach, urban sound events were first analyzed at the signal level, and feature extraction was subsequently performed to classify sounds such as bus engine noise, bus horn sounds, car horn noise, and train whistles. Comparisons were also made with methods that directly extract features. Uchino et al. [28] proposed a two-level acoustic vehicle detection system designed for high-traffic environments. This system reduced the detection range during the pre-fitting stage by using estimated information and improved vehicle detection accuracy in the post-fitting stage through neighborhood point extraction. Hao et al. [29] presented a method for acoustic non-line-of-sight vehicle detection. This method utilized direction-of-arrival features and time-frequency features, calculated from microphone array data, as image inputs. A parallel neural network was designed and trained to detect the direction of moving vehicles obstructed at intersections. Jiang et al. [30] developed a traffic scene sound event detection method based on Graph Convolutional Networks (GCN). This method was designed to capture multimodal information, thereby enhancing model performance. Nithya et al. [31] introduced a TB-MFCC multifuse feature extraction method that combined data augmentation with feature extraction. The extracted features were then used to train a convolutional neural network for classifying emergency vehicle sounds. Liang et al. [32] proposed a network HADNet based on a 1D CNN and a multi-head attention mechanism for detecting abnormal sound on the highway.

Currently, the methods proposed for traffic sound event detection in intelligent vehicles are still limited, and there are still some unresolved issues. The types of traffic sound events are complex and diverse, requiring model networks that can stably perform multi-class sound event classification. Moreover, to ensure that the intelligent vehicle’s decision-making system can take appropriate actions, high accuracy is essential. During driving, environmental noise is highly variable, so high robustness must be a key consideration when processing sound data and building networks, ensuring that the model can accurately focus on relevant features for training and learning.

To improve the above issues, a TSED-CNN model based on a basic convolutional neural network (PANNs-CNN10) architecture is proposed in this work. First, a multi-scale convolutional structure is introduced in the TSED-CNN, with two different sizes of convolutional kernels used for convolution operations. And the residual structure connection is utilized overall. Second, a lightweight hybrid multi-scale attention module is implemented to minimize the increase in parameter count while enabling the network to effectively focus on learning critical features, thereby enhancing the robustness of the model. Finally, a Shuffle Attention module is used to further enhance the expressive capability of the TSED-CNN.

The rest of this paper is structured as follows. In Section 2, methods for sound signal feature extraction are introduced, with an analysis of different feature extraction methods and the selection of the most appropriate one. Section 3 details the basic network model and its structural improvements, including an in-depth description of the convolutional blocks and attention modules used, along with their roles in the network. Experimental setup and results are discussed in Section 4. Finally, Section 5 presents the conclusions.

2. Feature Extraction Method

Raw audio typically exhibits high-dimensional and complex time-series characteristics, and training directly on these unprocessed signals can lead to increased computational complexity and slower model convergence. Therefore, feature extraction is essential during the model training process to eliminate redundant information and capture the key characteristics of audio signals.

Mel-Frequency Cepstral Coefficient (MFCC) and Filter Bank (FBank) are the more commonly used methods for feature extraction of audio signals in machine learning. The differences between MFCC and FBank are shown in Figure 1. Under different noise conditions, MFCC features exhibit poor adaptability, resulting in models with lower robustness. Furthermore, the MFCC feature extraction process involves the discrete cosine transform (DCT), which can lead to the loss of some information, reducing the completeness of features and making it more difficult for the network to learn effectively. In contrast, the DCT step is eliminated in the FBank feature extraction process while leaving other steps unchanged. It allows the FBank method to fully preserve the frequency domain information of the bandpass filters, enabling the model to learn more complex feature relationships during training. Consequently, models that utilize FBank features are demonstrated to exhibit enhanced robustness and versatility when handling complex acoustic environments.

In this work, the FBank method is chosen as the feature extraction method for audio signals. Audio features are extracted using the FBank method implemented in Python, with a sampling rate set to 44.1 kHz and the number of Mel filters set to 80. The FBank feature maps are shown in Figure 2.

3. TSED-CNN Model

3.1. Sound Event Detection Model

Large-Scale Pretrained Audio Neural Networks (PANNs) [33] are a type of pretrained audio neural network trained on large-scale audio datasets. PANNs propose four types of network systems, including CNN, ResNet, MobileNet, and 1D CNN, for training comparisons. And two or more different network architectures of varying depths are proposed for each network system.

Within the CNN class of PANNs, three distinct network structures with varying depths are proposed: CNN6, CNN10, and CNN14. And the specific structures of these networks are shown in Figure 3. The 6-layer CNN network is composed of four convolutional layers and two fully connected layers, with a kernel size of 5 × 5. The 10-layer CNN network and the 14-layer CNN network are constructed with four convolutional blocks and six convolutional blocks, respectively, where each block contains two convolutional layers utilizing 3 × 3 convolutional kernels. The stack of two 3 × 3 convolutional layers results in an effective receptive field equivalent to that of a single 5 × 5 convolutional layers. This approach maintains the original receptive field while increasing the depth of the network.

This approach also leads to a reduction in the number of parameters. If the input and output channels of a stack of two 3 × 3 convolutional layers are both K, the number of parameters for this stack is calculated as 2 × (3 × K)² = 18 K², whereas the number of parameters for a single 5 × 5 convolutional layer is (5 × K)² = 25 K². Since each convolutional layer is followed by a nonlinear activation, the stack of two 3 × 3 convolutional layers is also demonstrated to enhance the model’s nonlinear expressive capability. Batch normalization is applied between each convolutional layer, and the ReLU function is used for nonlinear activation to stabilize and accelerate training. Downsampling is performed using average pooling with a 2 × 2 kernel size in each convolutional block. After the final convolution layer, global pooling is applied to reduce the feature map to a scalar. Additionally, dropout is implemented after each pooling layer and fully connected layer to mitigate the risk of overfitting.

3.2. The Improvement of Network Structure

After conducting a comprehensive comparison of the performance of each network, PANNs-CNN10 is selected as the baseline model in this work.

The experimental audio data in this study were recorded in natural and complex acoustic environments. The sample data may exhibit issues such as important sound events occurring only at certain moments, important sound events being masked by noise, and different sound events being sustained for different periods of time. Therefore, one of the key focuses of network structure adjustment is to enhance the robustness of the model.

The accuracy of the model is another key optimization focus. These CNN networks are typically stacked through repeated convolution and pooling operations, with hyperparameters such as convolution kernels and pooling methods optimized to achieve a more efficient structure. In general, as the depth of the CNN network increases, more detailed feature information can be extracted, leading to enhanced model performance. However, when the network is stacked to a certain depth, issues such as gradient explosion, gradient vanishing, or network degradation may arise, leading to a decline in model performance. In the training of large-scale data and deep networks, convolution operations may also introduce a large number of redundant features, making it difficult for the network to learn useful features and leading to lower accuracy rates.

Aiming at the above problems, a traffic sound event detection network named TSED-CNN is proposed here, and the improved structure of this network is shown in Figure 4. The specific structural improvements are as follows:

A multi-scale convolutional block (MSCB) embedded with a channel attention module is introduced into the network to improve the expressive power of the model, while residual connectivity is used to avoid problems that may be caused by the depth of the network being too deep. A hybrid multi-scale attention (HMSA) module is added to address the problem of feature redundancy. This module can learn the interdependencies between channels, filter out irrelevant feature information, improve feature selection capabilities, and enhance the model’s robustness. Finally, a Shuffle Attention module [34] is added after the last convolutional block to further optimize the high-level features extracted by the network.

3.2.1. Multi-Scale Convolutional Block

To enable the network to learn features at different scales, a multi-scale convolutional block is introduced. First, a 1 × 1 convolution operation is applied to reduce the dimensionality of input features, thereby decreasing computational complexity and the number of parameters. Then, 3 × 3 convolution and 5 × 5 convolution are performed on the dimensionality-decreased features. The 3 × 3 convolution captures localized, detailed features, while the 5 × 5 convolution captures a larger range of overall features. This multi-scale convolution structure enables the network to capture information at different levels within the same layer, enhancing the feature representation capability. On this basis, the Squeeze-and-Excitation (SE) [35] module is embedded into the structure. The SE module enhances the model’s ability to represent important features by dynamically adjusting the weights for each feature channel. The structure of the SE module is shown in Figure 5.

The module is embedded behind the multi-scale convolutional layers. The input X

\in

ℝ^{H′×W′×C′} is transformed into the feature map U

\in

ℝ^H×W×C using Ftr. The Ftr is a convolution operator. The formula is as follows:

u_{c} = v_{c} * X = \sum_{s = 1}^{C^{’}} v_{c}^{s} * x^{s}

(1)

Global average pooling is then applied to the input feature map to aggregate its spatial dimensions, compressing the global spatial information into channel descriptors. This compresses the feature map U containing global information into a feature vector z

\in

ℝ^C. The calculation formula for the c-th element of z is as follows:

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(2)

The vector z contains contextual information, alleviating the issue of channel dependencies. To utilize the gathered compressed information, an excitation operation is performed. A gating mechanism consisting of two fully connected layers is used. The first fully connected layer reduces the dimensionality of channel descriptors to reduce the computational effort. The second fully connected layer restores the original channel dimensions, followed by a Sigmoid function for nonlinear activation. This process generates a set of channel weights.

s = F_{e x} (z, W) = δ (g (z, W)) = δ (W_{2} δ (W_{1} z))

(3)

Finally, the output of the excitation operation is used to recalibrate the original input feature map. Each channel feature of the input feature map is weighted by the previously obtained attentional weights, thereby enabling dynamic enhancement of the salient features.

{\tilde{x}}_{c} = F_{s c a l e} (u_{c}, s_{c}) = s_{c} u_{c}

(4)

The outputs of multi-scale convolutions are then fused using a concatenation operation, preserving the information extracted at different scales by the two convolution operations. A 1 × 1 convolution is used for dimensionality expansion, restoring the number of channels to the original number of channels. As the depth of the network increases, issues such as gradient vanishing, gradient explosion, and degradation may occur. Therefore, residual connections are employed. By introducing skip connections, low-level features are directly passed to higher layers, thereby mitigating the aforementioned issues. The overall structure of this block is shown in Figure 6.

This structure not only retains the efficiency of residual learning but also further enhances the flexibility and discriminative power of feature representation. It can effectively enhance the important features related to sound events and suppress irrelevant noise features, enabling the network to perform better when handling complex data.

3.2.2. Hybrid Multi-Scale Attention Module

The attention mechanism assigns different weights to different parts of the input data, allowing the model to ignore irrelevant features and focus on more critical ones. It improves the model’s representational capacity. This work uses a hybrid multi-scale attention module [36,37] that does not require channel dimensionality reduction. By capturing the inter-channel correlations and effectively integrating global and local information, it enhances the neural network’s ability to capture important features. The overall structure of this module is shown in Figure 7.

The feature dependencies between channels need to be enhanced before grouping is processed. The input feature map is first allowed to go through global average pooling to obtain a channel descriptor. Then, a fast 1D convolution operation is used to capture the inter-channel correlations. This convolution operation adaptively adjusts the kernel size based on the channel dimension, allowing it to effectively operate across networks of different scales. And it also avoids the dimensionality reduction and large computational effort associated with the fully connected layer, improving computational efficiency. The attention weights for each channel are then obtained using the Sigmoid activation function and applied to the input feature map.

The obtained feature map is divided into groups, with each group containing adjacent channels, generating multiple grouped feature maps to capture finer local features. The grouped feature maps are then subjected to global average pooling in both horizontal and vertical directions. It can generate compact descriptions of the features and extract global features relevant to each direction. The pooled features are processed by a 1 × 1 convolutional layer to generate a joint feature representation, and the attention weights for both horizontal and vertical directions are generated using the Sigmoid activation function. The original grouped feature maps are weighted and adjusted using the obtained attention weights, enhancing the response of significant features. Meanwhile, Group Normalization (GN) is applied to normalize the feature distribution and enhance numerical stability, generating the feature map x1. Furthermore, a 3 × 3 convolution is applied to the third feature group to extract higher-level feature details, generating the feature map x2. Both x1 and x2 are subjected to global average pooling and Softmax nonlinearity to calculate the importance weights of global context. These attention weights from both parts are then fused through matrix multiplication and activated using the Sigmoid nonlinear activation function. Finally, the adjusted and weighted feature maps are restored to the original dimensions, yielding the final fusion result.

This module integrates spatial and channel attention mechanisms. The ability to capture complex features is enhanced through a multi-scale strategy, and the correlations between different channels are effectively emphasized, thereby highlighting important features. In the context of traffic sound event detection, background noise can be efficiently separated from target sound signals, and subtle sound features can be better extracted. As a result, the robustness, generalization capability, and recognition accuracy of the model are significantly improved.

4. Experimental Analysis

4.1. Dataset

The traffic sound data here is divided into 10 categories [38]: ambulance car, bus, civil defense siren, fire truck, motorcycle, police car, reversing beeps, screaming, truck, and vehicle collision. The original audio samples are selected from the DCASE 2017 Challenge Task 4 dataset. To simulate real traffic conditions, the original dataset is overlaid with actual traffic sounds recorded at urban intersections. During model training, a large amount of sample data is required. When the number of training samples is limited, overfitting may occur. Therefore, data augmentation techniques are applied to the original audio dataset to expand the training set. Several augmentation methods are employed, including Doppler frequency shift, convolutional reverberation, and phase modification. These can increase the diversity of training audio data and improve the robustness and generalization ability of the model. The final dataset consists of 7037 audio clips, which are randomly divided into a training set of 5629 clips and a validation and test set of 1408 clips at a ratio of 4:1. Each audio clip has a duration of 4 s and a sampling rate of 44.1 kHz.

4.2. Experimental Environment and Evaluation Metrics

The experimental platform for training and testing the traffic sound event detection dataset in this study is as follows: the operating system is 64-bit Windows 10, the processor is a 12th Gen Intel^® Core™ i5-12400F 2.50 GHz, the graphics card is an NVIDIA GeForce GTX 1650 (4 GB), the memory is 16 GB, the deep learning framework is PyTorch 2.4.0, the CUDA version is 11.8, the programming language version is Python 3.11.10, and the programming platform is PyCharm 2023.3.1.

During training, the Adam optimizer is used, and the loss function is the Binary Cross-Entropy Loss. The learning rate is set to 0.001, the batch size is 64, and the number of training epochs is 60. The model with the highest accuracy is saved, along with the latest three models, to ensure that the training process can be resumed in case of an interruption.

This study uses evaluation metrics such as accuracy, precision, recall, and F1-score to assess the performance of the sound event detection model. The specific formulas are as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(5)

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

In these formulas, TP denotes True Positives; FP denotes False Positives; TN denotes True Negatives; and FN denotes False Negatives.

4.3. Analysis of Experimental Results

PANNs include three convolutional neural networks with different depths. To select the appropriate baseline model for subsequent experiments, this study conducts comparative experiments on these three networks. The experimental results are shown in Table 1.

As shown in this table, the network can extract higher-level and more abstract feature information with the increase in depth, thereby enhancing the model’s expressive ability. The performance of CNN10 surpasses that of CNN6 in terms of accuracy, precision, recall, and F1-score, while the model’s number of parameters is much smaller than that of CNN14. Based on these results, PANNs-CNN10 is selected as the baseline model for this study.

The improved model TSED-CNN is validated by using the dataset, and the output of the confusion matrix is shown in Figure 8. In the confusion matrix, the horizontal axis represents the predicted labels, while the vertical axis represents the true labels. It can be observed from the confusion matrix that the recognition accuracy of certain sound events with more distinctive features, such as an ambulance car and a police car, reaches 100%. However, the recognition accuracy for motorcycle and truck sounds is relatively lower, possibly due to the similarity of their feature characteristics with other sound events. Additionally, the recognition rate of the reversing beep is also lower, likely due to its short duration, where critical features occur only at specific moments, and it is difficult to capture them. The overall accuracy reaches 96.378%. This can ensure the normal operation of the detection system.

4.4. Ablation Test

To evaluate the effectiveness of the proposed improvements in traffic sound event detection and to analyze the optimization impact of each improvement method on the network, an ablation test was designed and conducted. The improvements include inserting Multi-Scale Convolution Blocks (MSCB) and the Hybrid Multi-Scale Attention (HMSA) module between the intermediate convolution layers, and adding the Shuffle Attention (SA) module in the deeper layers. The experimental results are shown in Table 2.

The results of the ablation experiment show that each of the improvement methods had a positive impact on the network’s performance. The hybrid multi-scale attention module has shown significant improvements across all aspects, with accuracy, precision, recall, and F1-score increasing by 1.279%, 1.269%, 1.277%, and 0.01291, respectively. Both the hybrid multi-scale attention module and the Shuffle Attention module are lightweight modules, which effectively minimize additional computational costs and reduce the number of model parameters by avoiding the use of fully connected layers. This design enables substantial performance improvements to be achieved without significantly increasing the computational burden. After the improvements, the accuracy of the traffic sound event detection network increased from 94.673% to 96.378%, a 1.705% improvement, and all other evaluation metrics also showed improvements.

4.5. Analysis and Discussion

To further evaluate the effectiveness of the improved network for traffic sound event detection, the Res2Net, CAM++, TDNN, and EcapaTDNN networks were trained and tested using the same dataset and experimental environment for comparison. The experimental results are shown in Table 3.

The results of the comparison experiment show that the performance of the improved TSED-CNN model outperforms that of other mainstream models in terms of accuracy, precision, recall, and F1-score. Compared to the TDNN network, the accuracy improves significantly by 2.202%. In comparison with Res2Net and CAM++, the improved TSED-CNN achieves the highest accuracy while maintaining a comparable number of parameters, with improvements of 1.776% and 0.852%, respectively. Along with high accuracy, the improved network incorporates efficient attention modules to ensure the robustness of the model, making it more suitable for sound event detection in complex traffic environments with noise.

5. Conclusions

This work proposes a traffic sound event detection model, TSED-CNN, which is based on the PANNs-CNN10. The issues of insufficient robustness to traffic noise and limited classification categories are addressed by the model, and the recognition accuracy of traffic sound event detection is improved. The research results are as follows:

(1) Based on the PANNs-CNN network, the proposed TSED-CNN model incorporates a multi-scale convolution block in the intermediate layers, integrates a channel attention module, and employs residual connections. These modifications enable the network to effectively capture features at different scales while maintaining direct information flow from low-level to high-level layers. The addition of hybrid multi-scale and Shuffle Attention modules further enhances the model’s ability to focus on crucial features. As a result, the TSED-CNN model demonstrates significant improvements in performance compared to the baseline model.

(2) Comparative experiments with widely used models in sound event detection, including Res2Net, CAM++, TDNN, and EcapaTDNN, show that the TSED-CNN model consistently outperforms these models across all evaluation metrics. This highlights the superiority of TSED-CNN in traffic sound event detection.

(3) The TSED-CNN model is demonstrated to achieve not only high accuracy and precision but also robust performance under various traffic noise conditions. This indicates its strong potential for real-world applications, such as intelligent transportation systems and urban noise monitoring. Furthermore, the relatively small increase in the number of parameters is noted, which suggests that the model successfully balances performance and computational efficiency.

Author Contributions

Y.Z. contributed to the conceptualization, method development, implementation, experiments, and original draft writing. L.Y. supervised the study, reviewed, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Due to ethical restrictions, the raw data cannot be made publicly available. However, de-identified data may be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, L.; Ota, K.; Dong, M. Humanlike Driving: Empirical Decision-Making System for Autonomous Vehicles. IEEE Trans. Veh. Technol. 2018, 67, 6814–6823. [Google Scholar] [CrossRef]
Mishra, S.K.; Das, S. A Review on Vision Based Control of Autonomous Vehicles Using Artificial Intelligence Techniques. In Proceedings of the 2019 International Conference on Information Technology (ICIT), Bhubaneswar, India, 19–21 December 2019; pp. 500–504. [Google Scholar]
Kuutti, S.; Bowden, R.; Jin, Y.; Barber, P.; Fallah, S. A Survey of Deep Learning Applications to Autonomous Vehicle Control. IEEE Trans. Intell. Transp. Syst. 2021, 22, 712–733. [Google Scholar] [CrossRef]
Shi, J.; Zhao, L.; Wang, X.; Zhao, W.; Hawbani, A.; Huang, M. A Novel Deep QLearning-Based Air-Assisted Vehicular Caching Scheme for Safe Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4348–4358. [Google Scholar] [CrossRef]
Wu, Q.; Li, X.; Wang, K.; Bilal, H. Regional feature fusion for on-road detection of objects using camera and 3D-LiDAR in high-speed autonomous vehicles. Soft Comput. 2023, 27, 18195–18213. [Google Scholar] [CrossRef]
Gao, H.; Fang, D.; Xiao, J.; Hussain, W.; Kim, J.Y. CAMRL: A Joint Method of Channel Attention and Multidimensional Regression Loss for 3D Object Detection in Automated Vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8831–8845. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Li, J.; Xv, B.; Fu, R.; Chen, H. Multi-Modal and Multi-Scale Fusion 3D Object Detection of 4D Radar and LiDAR for Autonomous Driving. IEEE Trans. Veh. Technol. 2023, 72, 5628–5641. [Google Scholar] [CrossRef]
Broughton, G.; Majer, F.; Rouček, T.; Ruichek, Y.; Yan, Z.; Krajník, T. Learning to see through the haze: Multi-sensor learning-fusion System for Vulnerable Traffic Participant Detection in Fog. Robot. Auton. Syst. 2021, 136, 103687. [Google Scholar] [CrossRef]
Buchan, S.J.; Duran, M.; Rojas, C.; Wuth, J.; Mahu, R.; Stafford, K.M.; Yoma, N.B. An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile. Remote Sens. 2023, 15, 2554. [Google Scholar] [CrossRef]
Guan, J.; Liu, Y.; Zhu, Q.; Zheng, T.; Han, J.; Wang, W. Time-Weighted Frequency Domain Audio Representation with GMM Estimator for Anomalous Sound Detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Sun, S.; Tong, Y.; He, P.; Song, W.; Li, X.; Liu, G. A Novel GMM-Based Estimated Splitting Coefficient of Second Heart Sound for Diagnosing Aortic Septal Defect. IEEE Sens. J. 2024, 24, 16299–16315. [Google Scholar] [CrossRef]
Han, X.; Peng, J. Bird sound classification based on ECOC-SVM. Appl. Acoust. 2023, 204, 109245. [Google Scholar] [CrossRef]
Cinyol, F.; Baysal, U.; Köksal, D.; Babaoğlu, E.; Ulaşlı, S.S. Incorporating support vector machine to the classification of respiratory sounds by Convolutional Neural Network. Biomed. Signal Process. Control 2023, 79, 104093. [Google Scholar] [CrossRef]
Sun, Z.; Gao, M.; Zhang, M.; Lv, M.; Wang, G. Research on recognition method of broiler overlapping sounds based on random forest and confidence interval. Comput. Electron. Agric. 2023, 209, 107801. [Google Scholar] [CrossRef]
Roy, T.S.; Roy, J.K.; Mandal, N. Conv-Random Forest-Based IoT: A Deep Learning Model Based on CNN and Random Forest for Classification and Analysis of Valvular Heart Diseases. IEEE Open J. Instrum. Meas. 2023, 2, 2500717. [Google Scholar]
Vashishtha, S.; Narula, R.; Chaudhary, P. Classification of Musical Instruments’ Sound using kNN and CNN. In Proceedings of the 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 28 February–1 March 2024; pp. 1196–1200. [Google Scholar]
Chu, H.-C.; Zhang, Y.-L.; Chiang, H.-C. A CNN Sound Classification Mechanism Using Data Augmentation. Sensors 2023, 23, 6972. [Google Scholar] [CrossRef] [PubMed]
İnik, Ö. CNN hyper-parameter optimization for environmental sound classification. Appl. Acoust. 2023, 202, 109168. [Google Scholar] [CrossRef]
Zhang, K.; Cai, Y.; Ren, Y.; Ye, R.; He, L. MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection. IEEE Access 2020, 8, 147337–147348. [Google Scholar] [CrossRef]
Petmezas, G.; Cheimariotis, G.-A.; Stefanopoulos, L.; Rocha, B.; Paiva, R.P.; Katsaggelos, A.K.; Maglaveras, N. Automated Lung Sound Classification Using a Hybrid CNN-LSTM Network and Focal Loss Function. Sensors 2022, 22, 1232. [Google Scholar] [CrossRef]
Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
Xie, J.; Hu, K.; Zhu, M.; Yu, J.; Zhu, Q. Investigation of Different CNN-Based Models for Improved Bird Sound Classification. IEEE Access 2019, 7, 175353–175361. [Google Scholar] [CrossRef]
Wang, Y.; Li, S.; Zhang, H.; Liu, T. A lightweight CNN-based model for early warning in sow oestrus sound monitoring. Ecol. Inform. 2022, 72, 101863. [Google Scholar] [CrossRef]
Bardou, D.; Zhang, K.; Ahmad, S.M. Lung sounds classification using convolutional neural networks. Artif. Intell. Med. 2018, 88, 58–69. [Google Scholar] [CrossRef]
Choi, J.; Im, S. Application of CNN Models to Detect and Classify Leakages in Water Pipelines Using Magnitude Spectra of Vibration Sound. Appl. Sci. 2023, 13, 2845. [Google Scholar] [CrossRef]
Li, F.; Tang, H.; Shang, S.; Mathiak, K.; Cong, F. Classification of Heart Sounds Using Convolutional Neural Network. Appl. Sci. 2020, 10, 3956. [Google Scholar] [CrossRef]
Luitel, B.; Murthy, Y.V.S.; Koolagudi, S.G. Sound event detection in urban soundscape using two-level classification. In Proceedings of the 2016 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), Mangalore, India, 13–14 August 2016; pp. 259–263. [Google Scholar]
Uchino, M.; Dawton, B.; Hori, Y.; Ishida, S.; Tagashira, S.; Arakawa, Y. Initial Design of Two-Stage Acoustic Vehicle Detection System for High Traffic Roads. In Proceedings of the 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), Austin, TX, USA, 23–27 March 2020; pp. 1–6. [Google Scholar]
Hao, M.; Ning, F.; Wang, K.; Duan, S.; Wang, Z.; Meng, D. Acoustic Non-Line-of-Sight Vehicle Approaching and Leaving Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9979–9991. [Google Scholar] [CrossRef]
Jiang, Y.; Guo, D.; Wang, L.; Zhang, H.; Dong, H.; Qiu, Y.; Zou, H. Sound event detection in traffic scenes based on graph convolutional network to obtain multi-modal information. Complex Intell. 2024, 10, 5653–5668. [Google Scholar] [CrossRef]
Nithya, T.M.; Dhivya, P.; Sangeethaa, S.N.; Kanna, P.R. TB-MFCC multifuse feature for emergency vehicle sound classification using multistacked CNN–Attention BiLSTM. Biomed. Signal Process. Control 2024, 88, 105688. [Google Scholar] [CrossRef]
Liang, C.; Chen, Q.; Li, Q.; Wang, Q.; Zhao, K.; Tu, J.; Jafaripournimchahi, A. HADNet: A Novel Lightweight Approach for Abnormal Sound Detection on Highway Based on 1D Convolutional Neural Network and Multi-Head Self-Attention Mechanism. Electronics 2024, 13, 4229. [Google Scholar] [CrossRef]
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Xu, K.; Yao, L.; Yao, J.; Yao, D. Research on traffic sound event classification method based on improved VGG-16 network. J. Southwest Univ. Nat. Sci. Ed. 2023, 45, 145–156. [Google Scholar]

Figure 1. FBank and MFCC process diagram.

Figure 2. FBank feature maps for each type of sound.

Figure 3. PANNs-CNN structure diagram.

Figure 4. TSED-CNN structure diagram.

Figure 5. SE Module structure diagram.

Figure 6. Multi-scale convolutional block structure diagram.

Figure 7. Hybrid multi-scale attention module structure diagram.

Figure 8. TSED-CNN confusion matrix diagram.

Table 1. Results of Baseline Model Comparison Test.

Model	Accuracy/%	Precision/%	Recall/%	F1-Score	Param/MB
CNN6	94.176	94.279	94.186	0.94201	18.29
CNN10	94.673	94.811	94.678	0.94689	19.82
CNN14	95.384	95.390	95.389	0.95383	318.81

Table 2. Results of Ablation Tests.

Model	Accuracy/%	Precision/%	Recall/%	F1-Score	Param/MB
CNN10	94.673	94.811	94.678	0.94689	19.82
CNN10 + MSCB	95.241	95.332	95.245	0.95247	20.59
CNN10 + HMSA	95.952	96.080	95.955	0.95980	19.87
CNN10 + SA	95.312	95.432	95.319	0.95340	19.82
TSED-CNN	96.378	96.394	96.380	0.96382	20.63

Table 3. Comparison Experiment Results with Mainstream Models.

Model	Accuracy/%	Precision/%	Recall/%	F1-Score	Param/MB
Res2Net	94.602	94.672	94.608	0.94599	22.46
CAM++	95.526	95.591	95.533	0.95525	28.73
TDNN	94.176	94.204	94.181	0.94184	11.09
EcapaTDNN	94.815	94.828	94.823	0.94802	24.78
TSED-CNN	96.378	96.394	96.380	0.96382	20.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, Y.; Yao, L. Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion. Appl. Sci. 2026, 16, 2359. https://doi.org/10.3390/app16052359

AMA Style

Zheng Y, Yao L. Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion. Applied Sciences. 2026; 16(5):2359. https://doi.org/10.3390/app16052359

Chicago/Turabian Style

Zheng, Yonghao, and Lingyun Yao. 2026. "Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion" Applied Sciences 16, no. 5: 2359. https://doi.org/10.3390/app16052359

APA Style

Zheng, Y., & Yao, L. (2026). Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion. Applied Sciences, 16(5), 2359. https://doi.org/10.3390/app16052359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Traffic Sound Event Detection Based on Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Feature Extraction Method

3. TSED-CNN Model

3.1. Sound Event Detection Model

3.2. The Improvement of Network Structure

3.2.1. Multi-Scale Convolutional Block

3.2.2. Hybrid Multi-Scale Attention Module

4. Experimental Analysis

4.1. Dataset

4.2. Experimental Environment and Evaluation Metrics

4.3. Analysis of Experimental Results

4.4. Ablation Test

4.5. Analysis and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI