MDPI - Publisher of Open Access Journals

25 pages, 2093 KiB

Open AccessArticle

Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems

by Samuel Yaw Mensah, Tao Zhang, Nahid AI Mahmud and Yanzhang Geng

Electronics 2025, 14(13), 2643; https://doi.org/10.3390/electronics14132643 - 30 Jun 2025

Viewed by 798

Deep learning has emerged as a powerful technique for speech enhancement, particularly in security systems where audio signals are often degraded by non-stationary noise. Traditional signal processing methods struggle in such conditions, making it difficult to detect critical sounds like gunshots, alarms, and [...] Read more.

Deep learning has emerged as a powerful technique for speech enhancement, particularly in security systems where audio signals are often degraded by non-stationary noise. Traditional signal processing methods struggle in such conditions, making it difficult to detect critical sounds like gunshots, alarms, and unauthorized speech. This study investigates a hybrid deep learning framework that combines Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs) to enhance speech quality and improve sound classification accuracy in noisy security environments. The proposed model is trained and validated using real-world datasets containing diverse noise distortions, including VoxCeleb for benchmarking speech enhancement and UrbanSound8K and ESC-50 for sound classification. Performance is evaluated using industry-standard metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Signal-to-Noise Ratio (SNR). The architecture includes multi-layered neural networks, residual connections, and dropout regularization to ensure robustness and generalizability. Additionally, the paper addresses key challenges in deploying deep learning models for security applications, such as computational complexity, latency, and vulnerability to adversarial attacks. Experimental results demonstrate that the proposed DNN + GAN-based approach significantly improves speech intelligibility and classification performance in high-interference scenarios, offering a scalable solution for enhancing the reliability of audio-based security systems. Full article

► Show Figures

Figure 1

16 pages, 1166 KiB

Open AccessArticle

Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network

by Fengzheng Bi and Lidong Yang

Sensors 2025, 25(13), 3930; https://doi.org/10.3390/s25133930 - 24 Jun 2025

Viewed by 405

Abstract

Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing [...] Read more.

Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing on three dimensions: the time dimension of the spectrogram, the frequency dimension, and the high- and low-frequency information extracted by a wavelet transform through a time–frequency–wavelet module. Multidimensional information was fused through the gated temporal–spatial attention unit, and the visual state space module was introduced to enhance the contextual modeling capability of audio sequences. In addition, Kolmogorov–Arnold network layers were used in place of multilayer perceptrons in the classifier part. The experimental results show that the proposed method achieves a 56.16% average accuracy on the TAU Urban Acoustic Scenes 2022 mobile development dataset, which is an improvement of 6.53% compared to the official baseline system. This performance improvement demonstrates the effectiveness of the model in complex scenarios. In addition, the accuracy of the proposed method on the UrbanSound8K dataset reached 97.60%, which is significantly better than the existing methods, further verifying the generalization ability of the proposed model in the acoustic scene classification task. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

28 pages, 13595 KiB

Open AccessArticle

Open-Set Recognition of Environmental Sound Based on KDE-GAN and Attractor–Reciprocal Point Learning

by Jiakuan Wu, Nan Wang, Huajie Hong, Wei Wang, Kunsheng Xing and Yujie Jiang

Acoustics 2025, 7(2), 33; https://doi.org/10.3390/acoustics7020033 - 28 May 2025

Viewed by 731

Abstract

While open-set recognition algorithms have been extensively explored in computer vision, their application to environmental sound analysis remains understudied. To address this gap, this study investigates how to effectively recognize unknown sound categories in real-world environments by proposing a novel Kernel Density Estimation-based [...] Read more.

While open-set recognition algorithms have been extensively explored in computer vision, their application to environmental sound analysis remains understudied. To address this gap, this study investigates how to effectively recognize unknown sound categories in real-world environments by proposing a novel Kernel Density Estimation-based Generative Adversarial Network (KDE-GAN) for data augmentation combined with Attractor–Reciprocal Point Learning for open-set classification. Specifically, our approach addresses three key challenges: (1) How to generate boundary-aware synthetic samples for robust open-set training: A closed-set classifier’s pre-logit layer outputs are fed into the KDE-GAN, which synthesizes samples mapped to the logit layer using the classifier’s original weights. Kernel Density Estimation then enforces Density Loss and Offset Loss to ensure these samples align with class boundaries. (2) How to optimize feature space organization: The closed-set classifier is constrained by an Attractor–Reciprocal Point joint loss, maintaining intra-class compactness while pushing unknown samples toward low-density regions. (3) How to evaluate performance in highly open scenarios: We validate the method using UrbanSound8K, AudioEventDataset, and TUT Acoustic Scenes 2017 as closed sets, with ESC-50 categories as open-set samples, achieving AUROC/OSCR scores of 0.9251/0.8743, 0.7921/0.7135, and 0.8209/0.6262, respectively. The findings demonstrate the potential of this framework to enhance environmental sound monitoring systems, particularly in applications requiring adaptability to unseen acoustic events (e.g., urban noise surveillance or wildlife monitoring). Full article

► Show Figures

Figure 1

11 pages, 1273 KiB

Open AccessArticle

Validation of a Swine Cough Monitoring System Under Field Conditions

by Luís F. C. Garrido, Gabriel S. T. Rodrigues, Leandro B. Costa, Diego J. Kurtz and Ruan R. Daros

AgriEngineering 2025, 7(5), 140; https://doi.org/10.3390/agriengineering7050140 - 6 May 2025

Viewed by 821

Abstract

Precision livestock farming technologies support health monitoring on farms, yet few studies have evaluated their effectiveness under field conditions using reliable gold standards. This study evaluated a commercially available technology for detecting cough sounds in pigs on a commercial farm. Audio was recorded [...] Read more.

Precision livestock farming technologies support health monitoring on farms, yet few studies have evaluated their effectiveness under field conditions using reliable gold standards. This study evaluated a commercially available technology for detecting cough sounds in pigs on a commercial farm. Audio was recorded over six days using 16 microphones across two pig barns. A total of 1110 cough sounds were labelled by an on-site observer using a cough induction methodology, and 8938 other sounds from farm recordings and open-source datasets (ESC-50, UrbanSound8K, and AudioSet) were labelled. A hybrid deep learning model combining Convolutional Neural Networks and Recurrent Neural Networks was trained and evaluated using these labels. A total of 34 audio features were extracted from 1 s segments, including validated descriptors (e.g., MFCC), unverified external features, and proprietary features. Features were evaluated through 10-fold cross-validation based on classification performance and runtime, resulting in eight final features. The final model showed high performance (recall = 98.6%, specificity = 99.7%, precision = 98.8%, accuracy = 99.6%, F1-score = 98.6%). The technology tested was shown to be efficient for monitoring cough sounds in a commercial swine production facility. It is recommended to test the technology in other environments to evaluate the effectiveness in different farm settings. Full article

(This article belongs to the Collection Exploring the Application of Artificial Intelligence and Image Processing in Agriculture)

► Show Figures

Figure 1

18 pages, 4389 KiB

Open AccessArticle

How Vegetation Structure Shapes the Soundscape: Acoustic Community Partitioning and Its Implications for Urban Forestry Management

by Yilin Zhao, Zhenkai Sun, Zitong Bai, Jiali Jin and Cheng Wang

Forests 2025, 16(4), 669; https://doi.org/10.3390/f16040669 - 11 Apr 2025

Viewed by 462

Abstract

Urban green spaces are critical yet understudied areas where anthropogenic and biological sounds interact. This study investigates how vegetation structure mediates the acoustic partitioning of urban soundscapes and informs sustainable forestry management. Through the principal component analysis (PCA) of 1–11 kHz frequency bands, [...] Read more.

Urban green spaces are critical yet understudied areas where anthropogenic and biological sounds interact. This study investigates how vegetation structure mediates the acoustic partitioning of urban soundscapes and informs sustainable forestry management. Through the principal component analysis (PCA) of 1–11 kHz frequency bands, we identified anthropogenic sounds (1–2 kHz) and biological sounds (2–11 kHz). Within bio-acoustic communities, PCA further revealed three positively correlated sub-clusters (2–4 kHz, 5–6 kHz, and 6–11 kHz), suggesting cooperative niche partitioning among avian, amphibian, and insect vocalizations. Linear mixed models highlighted vegetation’s dual role: mature tree stands (explaining 19.9% variance) and complex vertical structures (leaf-height diversity: 12.2%) significantly enhanced biological soundscapes (R²m = 0.43) while suppressing anthropogenic noise through canopy stratification (32.3% variance explained). Based on our findings, we suggest that an acoustic data-driven framework—comprising (1) the preservation of mature stands with multi-layered canopies to enhance bioacoustic resilience, (2) strategic planting of mid-story vegetation to disrupt low-frequency noise propagation, and (3) real-time soundscape monitoring to balance biophony and anthropophony allocation—can contribute to promoting sustainable urban forestry management. Full article

(This article belongs to the Section Urban Forestry)

► Show Figures

Figure 1

20 pages, 634 KiB

Open AccessArticle

SATRN: Spiking Audio Tagging Robust Network

by Shouwei Gao, Xingyang Deng, Xiangyu Fan, Pengliang Yu, Hao Zhou and Zihao Zhu

Electronics 2025, 14(4), 761; https://doi.org/10.3390/electronics14040761 - 15 Feb 2025

Viewed by 612

Abstract

Audio tagging, as a fundamental task in acoustic signal processing, has demonstrated significant advances and broad applications in recent years. Spiking Neural Networks (SNNs), inspired by biological neural systems, exploit event-driven computing paradigms and temporal information processing, enabling superior energy efficiency. Despite the [...] Read more.

Audio tagging, as a fundamental task in acoustic signal processing, has demonstrated significant advances and broad applications in recent years. Spiking Neural Networks (SNNs), inspired by biological neural systems, exploit event-driven computing paradigms and temporal information processing, enabling superior energy efficiency. Despite the increasing adoption of SNNs, the potential of event-driven encoding mechanisms for audio tagging remains largely unexplored. This work presents a pioneering investigation into event-driven encoding strategies for SNN-based audio tagging. We propose the SATRN (Spiking Audio Tagging Robust Network), a novel architecture that integrates temporal–spatial attention mechanisms with membrane potential residual connections. The network employs a dual-stream structure combining global feature fusion and local feature extraction through inverted bottleneck blocks, specifically designed for efficient audio processing. Furthermore, we introduce an event-based encoding approach that enhances the resilience of Spiking Neural Networks to disturbances while maintaining performance. Our experimental results on the Urbansound8k and FSD50K datasets demonstrate that the SATRN achieves comparable performance to traditional Convolutional Neural Networks (CNNs) while requiring significantly less computation time and showing superior robustness against noise perturbations, making it particularly suitable for edge computing scenarios and real-time audio processing applications. Full article

► Show Figures

Figure 1

21 pages, 2188 KiB

Open AccessArticle

Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison

by Buket İşler

Appl. Sci. 2025, 15(3), 1201; https://doi.org/10.3390/app15031201 - 24 Jan 2025

Cited by 1 | Viewed by 1495

Abstract

Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. [...] Read more.

Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. This study provides a comparative evaluation of three machine learning models—convolutional neural networks (CNNs), long short-term memory (LSTM), and dense neural networks (Dense)—for classifying urban sounds. The analysis used the UrbanSound8K dataset, a static dataset designed for environmental sound classification, with mel-frequency cepstral coefficients (MFCCs) applied to extract core sound features. The models were tested in a fog computing architecture on AWS to simulate a smart city environment, chosen for its potential to reduce latency and optimize bandwidth for future real-time sound-recognition applications. Although real-time data were not used, the simulated setup effectively assessed model performance under conditions relevant to smart city applications. According to macro and weighted F1-score metrics, the CNN model achieved the highest accuracy at 90%, followed by the Dense model at 84% and the LSTM model at 81%, with the LSTM model showing limitations in distinguishing overlapping sound categories. These simulations demonstrated the framework’s capacity to enable efficient urban sound recognition within a fog-enabled architecture, underscoring its potential for real-time environmental monitoring and public safety applications. Full article

► Show Figures

Figure 1

16 pages, 5641 KiB

Open AccessEditor’s ChoiceArticle

Research on Battery Electric Vehicles’ DC Fast Charging Noise Emissions: Proposals to Reduce Environmental Noise Caused by Fast Charging Stations

by David Clar-Garcia, Hector Campello-Vicente, Miguel Fabra-Rodriguez and Emilio Velasco-Sanchez

World Electr. Veh. J. 2025, 16(1), 42; https://doi.org/10.3390/wevj16010042 - 14 Jan 2025

Cited by 4 | Viewed by 2850

Abstract

The potential of electric vehicles (EVs) to support the decarbonization of the transportation sector, crucial for meeting greenhouse gas reduction targets under the Paris Agreement, is obvious. Despite their advantages, the adoption of electric vehicles faces limitations, particularly those related to battery range [...] Read more.

The potential of electric vehicles (EVs) to support the decarbonization of the transportation sector, crucial for meeting greenhouse gas reduction targets under the Paris Agreement, is obvious. Despite their advantages, the adoption of electric vehicles faces limitations, particularly those related to battery range and charging times, which significantly impact the time needed for a trip compared to their combustion engine counterparts. However, recent improvements in fast charging technology have enhanced these aspects, making EVs more suitable for both daily and long-distance trips. EVs can now deal with long trips, with travel times only slightly longer than those of internal combustion engine (ICE) vehicles. Fast charging capabilities and infrastructure, such as 350 kW chargers, are essential for making EV travel times comparable to ICE vehicles, with brief stops every 2–3 h. Additionally, EVs help reduce noise pollution in urban areas, especially in noise-saturated environments, contributing to an overall decrease in urban sound levels. However, this research highlights a downside of DC (Direct Current) fast charging stations: high-frequency noise emissions during fast charging, which can disturb nearby residents, especially in urban and residential areas. This noise, a result of the growing fast charging infrastructure, has led to complaints and even operational restrictions for some charging stations. Noise-related disturbances are a significant urban issue. The World Health Organization identifies noise as a key contributor to health burdens in Europe, even when noise annoyance is subjective, influenced by individual factors like sensitivity, genetics, and lifestyle, as well as by the specific environment. This paper analyzes the sound emission of a broad sample of DC fast charging stations from leading EU market brands. The goal is to provide tools that assist manufacturers, installers, and operators of rapid charging stations in mitigating the aforementioned sound emissions in order to align these infrastructures with Sustainable Development Goals 3 and 11 adopted by all United Nations Member States in 2015. Full article

(This article belongs to the Special Issue Fast-Charging Station for Electric Vehicles: Challenges and Issues)

► Show Figures

Figure 1

21 pages, 8333 KiB

Open AccessArticle

Urban-Scale Acoustic Comfort Map: Fusion of Social Inputs, Noise Levels, and Citizen Comfort in Open GIS

by Farzaneh Zarei, Mazdak Nik-Bakht, Joonhee Lee and Farideh Zarei

Processes 2024, 12(12), 2864; https://doi.org/10.3390/pr12122864 - 13 Dec 2024

Viewed by 1317

Abstract

With advancements in the Internet of Things (IoT), diverse and high-resolution data sources, such as environmental sensors and user-generated inputs from mobile devices, have become available to model and estimate citizens’ acoustic comfort in urban environments. These IoT-enabled data sources offer scalable insights [...] Read more.

With advancements in the Internet of Things (IoT), diverse and high-resolution data sources, such as environmental sensors and user-generated inputs from mobile devices, have become available to model and estimate citizens’ acoustic comfort in urban environments. These IoT-enabled data sources offer scalable insights in real time into both objective parameters (e.g., noise levels and environmental conditions) and subjective perceptions (e.g., personal comfort and soundscape experiences), which were previously challenging to capture comprehensively by using traditional methods. Despite this, there remains a lack of a clear framework explicitly presenting the role of these diverse inputs in determining acoustic comfort. This paper contributes by (1) exploring the relationship between attributes governing the physical aspect of the built environment (sensory data) and the end-users’ characteristics/inputs/sensations (such as their acoustic comfort level) and how these attributes can correlate/connect; (2) developing a CityGML-based framework that leverages semantic 3D city models to integrate and represent both objective sensory data and subjective social inputs, enhancing data-driven decision making at the city level; and (3) introducing a novel approach to crowdsourcing citizen inputs to assess perceived acoustic comfort indicators, which inform predictive modeling efforts. Our solution is based on CityGML’s capacity to store and explain 3D city-related shapes with their semantic characteristics, which are essential for city-level operations such as spatial data mining and thematic queries. To do so, a crowdsourcing method was used, and 20 perceptive indicators were identified from the existing literature to evaluate people’s perceived acoustic attributes and types of sound sources and their relations to the perceived soundscape comfort. Three regression models—K-Nearest Neighbor (KNN), Support Vector Regression (SVR), and XGBoost—were trained on the collected data to predict acoustic comfort at bus stops in Montréal based on physical and psychological attributes of travellers. In the best-performing scenario, which incorporated psychological attributes and measured noise levels, the models achieved a normalized mean squared error (NMSE) as low as 0.0181, a mean absolute error (MAE) of 0.0890, and a root mean square error (RMSE) of 0.1349. These findings highlight the effectiveness of integrating subjective and objective data sources to accurately predict acoustic comfort in urban environments. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence Technologies in Energy, Manufacturing and Automatic Control Processes)

► Show Figures

Figure 1

17 pages, 2093 KiB

Open AccessArticle

Investigation of Data Augmentation Techniques in Environmental Sound Recognition

by Anastasios Loukas Sarris, Nikolaos Vryzas, Lazaros Vrysis and Charalampos Dimoulas

Electronics 2024, 13(23), 4719; https://doi.org/10.3390/electronics13234719 - 28 Nov 2024

Viewed by 1260

Abstract

The majority of sound events that occur in everyday life, like those caused by animals or household devices, can be included in the environmental sound family. This audio category has not been researched as much as music or speech recognition. One main bottleneck [...] Read more.

The majority of sound events that occur in everyday life, like those caused by animals or household devices, can be included in the environmental sound family. This audio category has not been researched as much as music or speech recognition. One main bottleneck in the design of environmental data-driven monitoring automation is the lack of sufficient data representing each of a wide range of categories. In the context of audio data, an important method to increase the available data is the process of the augmentation of existing datasets. In this study, some of the most widespread time domain data augmentation techniques are studied, along with their effects on the recognition of environmental sounds, through the UrbanSound8K dataset, which consists of ten classes. The confusion matrix and the metrics that can be calculated based on the matrix were used to examine the effect of the augmentation. Also, to address the difficulty that arises when large datasets are augmented, a web-based data augmentation application was created. To evaluate the performance of the data augmentation techniques, a convolutional neural network architecture trained on the original set was used. Moreover, four time domain augmentation techniques were used. Although the parameters of the techniques applied were chosen conservatively, they helped the model to better cluster the data, especially in the four classes in which confusion was high in the initial classification. Furthermore, a web application is presented in which the user can upload their own data and apply these data augmentation techniques to both the audio extract and its time frequency representation, the spectrogram. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

16 pages, 3506 KiB

Open AccessArticle

HADNet: A Novel Lightweight Approach for Abnormal Sound Detection on Highway Based on 1D Convolutional Neural Network and Multi-Head Self-Attention Mechanism

by Cong Liang, Qian Chen, Qiran Li, Qingnan Wang, Kang Zhao, Jihui Tu and Ammar Jafaripournimchahi

Electronics 2024, 13(21), 4229; https://doi.org/10.3390/electronics13214229 - 28 Oct 2024

Cited by 1 | Viewed by 1374

Abstract

Video surveillance is an effective tool for traffic management and safety, but it may face challenges in extreme weather, low visibility, areas outside the monitoring field of view, or during nighttime conditions. Therefore, abnormal sound detection is used in traffic management and safety [...] Read more.

Video surveillance is an effective tool for traffic management and safety, but it may face challenges in extreme weather, low visibility, areas outside the monitoring field of view, or during nighttime conditions. Therefore, abnormal sound detection is used in traffic management and safety as an auxiliary tool to complement video surveillance. In this paper, a novel lightweight method for abnormal sound detection based on 1D CNN and Multi-Head Self-Attention Mechanism on the embedded system is proposed, which is named HADNet. First, 1D CNN is employed for local feature extraction, which minimizes information loss from the audio signal during time-frequency conversion and reduces computational complexity. Second, the proposed block based on Multi-Head Self-Attention Mechanism not only effectively mitigates the issue of disappearing gradients, but also enhances detection accuracy. Finally, the joint loss function is employed to detect abnormal audio. This choice helps address issues related to unbalanced training data and class overlap, thereby improving model performance on imbalanced datasets. The proposed HADNet method was evaluated on the MIVIA Road Events and UrbanSound8K datasets. The results demonstrate that the proposed method for abnormal audio detection on embedded systems achieves high accuracy of 99.6% and an efficient detection time of 0.06 s. This approach proves to be robust and suitable for practical applications in traffic management and safety. By addressing the challenges posed by traditional video surveillance methods, HADNet offers a valuable and complementary solution for enhancing safety measures in diverse traffic conditions. Full article

(This article belongs to the Special Issue Fault Detection Technology Based on Deep Learning)

► Show Figures

Figure 1

18 pages, 3589 KiB

Open AccessArticle

Improved Patch-Mix Transformer and Contrastive Learning Method for Sound Classification in Noisy Environments

by Xu Chen, Mei Wang, Ruixiang Kan and Hongbing Qiu

Appl. Sci. 2024, 14(21), 9711; https://doi.org/10.3390/app14219711 - 24 Oct 2024

Cited by 1 | Viewed by 1592

Abstract

In urban environments, noise significantly impacts daily life and presents challenges for Environmental Sound Classification (ESC). The structural influence of urban noise on audio signals complicates feature extraction and audio classification for environmental sound classification methods. To address these challenges, this paper proposes [...] Read more.

In urban environments, noise significantly impacts daily life and presents challenges for Environmental Sound Classification (ESC). The structural influence of urban noise on audio signals complicates feature extraction and audio classification for environmental sound classification methods. To address these challenges, this paper proposes a Contrastive Learning-based Audio Spectrogram Transformer (CL-Transformer) that incorporates a Patch-Mix mechanism and adaptive contrastive learning strategies while simultaneously improving and utilizing adaptive data augmentation techniques for model training. Firstly, a combination of data augmentation techniques is introduced to enrich environmental sounds. Then, the Patch-Mix feature fusion scheme randomly mixes patches of the enhanced and noisy spectrograms during the Transformer’s patch embedding. Furthermore, a novel contrastive learning scheme is introduced to quantify loss and improve model performance, synergizing well with the Transformer model. Finally, experiments on the ESC-50 and UrbanSound8K public datasets achieved accuracies of 97.75% and 92.95%, respectively. To simulate the impact of noise in real urban environments, the model is evaluated using the UrbanSound8K dataset with added background noise at different signal-to-noise ratios (SNR). Experimental results demonstrate that the proposed framework performs well in noisy environments. Full article

► Show Figures

Figure 1

19 pages, 3589 KiB

Open AccessArticle

Investigation of Bird Sound Transformer Modeling and Recognition

by Darui Yi and Xizhong Shen

Electronics 2024, 13(19), 3964; https://doi.org/10.3390/electronics13193964 - 9 Oct 2024

Cited by 1 | Viewed by 1869

Abstract

Birds play a pivotal role in ecosystem and biodiversity research, and accurate bird identification contributes to the monitoring of biodiversity, understanding of ecosystem functionality, and development of effective conservation strategies. Current methods for bird sound recognition often involve processing bird songs into various [...] Read more.

Birds play a pivotal role in ecosystem and biodiversity research, and accurate bird identification contributes to the monitoring of biodiversity, understanding of ecosystem functionality, and development of effective conservation strategies. Current methods for bird sound recognition often involve processing bird songs into various acoustic features or fusion features for identification, which can result in information loss and complicate the recognition process. At the same time, the recognition method based on raw bird audio has not received widespread attention. Therefore, this study proposes a bird sound recognition method that utilizes multiple one-dimensional convolutional neural networks to directly learn feature representations from raw audio data, simplifying the feature extraction process. We also apply positional embedding convolution and multiple Transformer modules to enhance feature processing and improve accuracy. Additionally, we introduce a trainable weight array to control the importance of each Transformer module for better generalization of the model. Experimental results demonstrate our model’s effectiveness, with an accuracy rate of 99.58% for the public dataset Birds_data, as well as 98.77% for the Birdsonund1 dataset, and 99.03% for the UrbanSound8K environment sound dataset. Full article

► Show Figures

Figure 1

20 pages, 17050 KiB

Open AccessFeature PaperArticle

Near- and Far-Field Acoustic Characteristics and Sound Source Localization Performance of Low-Noise Propellers with Gapped Gurney Flap

by Ryusuke Noda, Kotaro Hoshiba, Izumi Komatsuzaki, Toshiyuki Nakata and Hao Liu

Drones 2024, 8(6), 265; https://doi.org/10.3390/drones8060265 - 14 Jun 2024

Cited by 4 | Viewed by 2648

Abstract

With the rapid industrialization utilizing multi-rotor drones in recent years, an increase in urban flights is expected in the near future. This may potentially result in noise pollution due to the operation of drones. This study investigates the near- and far-field acoustic characteristics [...] Read more.

With the rapid industrialization utilizing multi-rotor drones in recent years, an increase in urban flights is expected in the near future. This may potentially result in noise pollution due to the operation of drones. This study investigates the near- and far-field acoustic characteristics of low-noise propellers inspired by Gurney flaps. In addition, we examine the impact of these low-noise propellers on the sound source localization performance of drones equipped with a microphone array, which are expected to be used for rescuing people in disasters. Results from in-flight noise measurements indicate significant noise reduction mainly in frequency bands above 1 kHz in both the near- and far-field. An improvement in the success rate of sound source localization with low-noise propellers was also observed. However, the influence of the position of the microphone array with respect to the propellers is more pronounced than that of propeller shape manipulation, suggesting the importance of considering the positional relationships. Computational fluid dynamics analysis of the flow field around the propellers suggests potential mechanisms for noise reduction in the developed low-noise propellers. The results obtained in this study hold potential for contributing to the development of integrated drones aimed at reducing noise and improving sound source localization performance. Full article

► Show Figures

Figure 1

23 pages, 3711 KiB

Open AccessArticle

ESC-NAS: Environment Sound Classification Using Hardware-Aware Neural Architecture Search for the Edge

by Dakshina Ranmal, Piumini Ranasinghe, Thivindu Paranayapa, Dulani Meedeniya and Charith Perera

Sensors 2024, 24(12), 3749; https://doi.org/10.3390/s24123749 - 9 Jun 2024

Cited by 8 | Viewed by 1952

Abstract

The combination of deep-learning and IoT plays a significant role in modern smart solutions, providing the capability of handling task-specific real-time offline operations with improved accuracy and minimised resource consumption. This study provides a novel hardware-aware neural architecture search approach called ESC-NAS, to [...] Read more.

The combination of deep-learning and IoT plays a significant role in modern smart solutions, providing the capability of handling task-specific real-time offline operations with improved accuracy and minimised resource consumption. This study provides a novel hardware-aware neural architecture search approach called ESC-NAS, to design and develop deep convolutional neural network architectures specifically tailored for handling raw audio inputs in environmental sound classification applications under limited computational resources. The ESC-NAS process consists of a novel cell-based neural architecture search space built with 2D convolution, batch normalization, and max pooling layers, and capable of extracting features from raw audio. A black-box Bayesian optimization search strategy explores the search space and the resulting model architectures are evaluated through hardware simulation. The models obtained from the ESC-NAS process achieved the optimal trade-off between model performance and resource consumption compared to the existing literature. The ESC-NAS models achieved accuracies of 85.78%, 81.25%, 96.25%, and 81.0% for the FSC22, UrbanSound8K, ESC-10, and ESC-50 datasets, respectively, with optimal model sizes and parameter counts for edge deployment. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

Search Results (47)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (47)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI