MDPI - Publisher of Open Access Journals

28 pages, 13595 KiB

Open AccessArticle

Open-Set Recognition of Environmental Sound Based on KDE-GAN and Attractor–Reciprocal Point Learning

by Jiakuan Wu, Nan Wang, Huajie Hong, Wei Wang, Kunsheng Xing and Yujie Jiang

Acoustics 2025, 7(2), 33; https://doi.org/10.3390/acoustics7020033 - 28 May 2025

Viewed by 739

While open-set recognition algorithms have been extensively explored in computer vision, their application to environmental sound analysis remains understudied. To address this gap, this study investigates how to effectively recognize unknown sound categories in real-world environments by proposing a novel Kernel Density Estimation-based [...] Read more.

While open-set recognition algorithms have been extensively explored in computer vision, their application to environmental sound analysis remains understudied. To address this gap, this study investigates how to effectively recognize unknown sound categories in real-world environments by proposing a novel Kernel Density Estimation-based Generative Adversarial Network (KDE-GAN) for data augmentation combined with Attractor–Reciprocal Point Learning for open-set classification. Specifically, our approach addresses three key challenges: (1) How to generate boundary-aware synthetic samples for robust open-set training: A closed-set classifier’s pre-logit layer outputs are fed into the KDE-GAN, which synthesizes samples mapped to the logit layer using the classifier’s original weights. Kernel Density Estimation then enforces Density Loss and Offset Loss to ensure these samples align with class boundaries. (2) How to optimize feature space organization: The closed-set classifier is constrained by an Attractor–Reciprocal Point joint loss, maintaining intra-class compactness while pushing unknown samples toward low-density regions. (3) How to evaluate performance in highly open scenarios: We validate the method using UrbanSound8K, AudioEventDataset, and TUT Acoustic Scenes 2017 as closed sets, with ESC-50 categories as open-set samples, achieving AUROC/OSCR scores of 0.9251/0.8743, 0.7921/0.7135, and 0.8209/0.6262, respectively. The findings demonstrate the potential of this framework to enhance environmental sound monitoring systems, particularly in applications requiring adaptability to unseen acoustic events (e.g., urban noise surveillance or wildlife monitoring). Full article

► Show Figures

Figure 1

21 pages, 2188 KiB

Open AccessArticle

Urban Sound Recognition in Smart Cities Using an IoT–Fog Computing Framework and Deep Learning Models: A Performance Comparison

by Buket İşler

Appl. Sci. 2025, 15(3), 1201; https://doi.org/10.3390/app15031201 - 24 Jan 2025

Cited by 1 | Viewed by 1501

Abstract

Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. [...] Read more.

Rapid urbanization presents significant challenges in energy consumption, noise control, and environmental sustainability. Smart cities aim to address these issues by leveraging information technologies to enhance operational efficiency and urban liveability. In this context, urban sound recognition supports environmental monitoring and public safety. This study provides a comparative evaluation of three machine learning models—convolutional neural networks (CNNs), long short-term memory (LSTM), and dense neural networks (Dense)—for classifying urban sounds. The analysis used the UrbanSound8K dataset, a static dataset designed for environmental sound classification, with mel-frequency cepstral coefficients (MFCCs) applied to extract core sound features. The models were tested in a fog computing architecture on AWS to simulate a smart city environment, chosen for its potential to reduce latency and optimize bandwidth for future real-time sound-recognition applications. Although real-time data were not used, the simulated setup effectively assessed model performance under conditions relevant to smart city applications. According to macro and weighted F1-score metrics, the CNN model achieved the highest accuracy at 90%, followed by the Dense model at 84% and the LSTM model at 81%, with the LSTM model showing limitations in distinguishing overlapping sound categories. These simulations demonstrated the framework’s capacity to enable efficient urban sound recognition within a fog-enabled architecture, underscoring its potential for real-time environmental monitoring and public safety applications. Full article

► Show Figures

Figure 1

17 pages, 2093 KiB

Open AccessArticle

Investigation of Data Augmentation Techniques in Environmental Sound Recognition

by Anastasios Loukas Sarris, Nikolaos Vryzas, Lazaros Vrysis and Charalampos Dimoulas

Electronics 2024, 13(23), 4719; https://doi.org/10.3390/electronics13234719 - 28 Nov 2024

Viewed by 1263

Abstract

The majority of sound events that occur in everyday life, like those caused by animals or household devices, can be included in the environmental sound family. This audio category has not been researched as much as music or speech recognition. One main bottleneck [...] Read more.

The majority of sound events that occur in everyday life, like those caused by animals or household devices, can be included in the environmental sound family. This audio category has not been researched as much as music or speech recognition. One main bottleneck in the design of environmental data-driven monitoring automation is the lack of sufficient data representing each of a wide range of categories. In the context of audio data, an important method to increase the available data is the process of the augmentation of existing datasets. In this study, some of the most widespread time domain data augmentation techniques are studied, along with their effects on the recognition of environmental sounds, through the UrbanSound8K dataset, which consists of ten classes. The confusion matrix and the metrics that can be calculated based on the matrix were used to examine the effect of the augmentation. Also, to address the difficulty that arises when large datasets are augmented, a web-based data augmentation application was created. To evaluate the performance of the data augmentation techniques, a convolutional neural network architecture trained on the original set was used. Moreover, four time domain augmentation techniques were used. Although the parameters of the techniques applied were chosen conservatively, they helped the model to better cluster the data, especially in the four classes in which confusion was high in the initial classification. Furthermore, a web application is presented in which the user can upload their own data and apply these data augmentation techniques to both the audio extract and its time frequency representation, the spectrogram. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis)

► Show Figures

Figure 1

18 pages, 3589 KiB

Open AccessArticle

Improved Patch-Mix Transformer and Contrastive Learning Method for Sound Classification in Noisy Environments

by Xu Chen, Mei Wang, Ruixiang Kan and Hongbing Qiu

Appl. Sci. 2024, 14(21), 9711; https://doi.org/10.3390/app14219711 - 24 Oct 2024

Cited by 1 | Viewed by 1596

Abstract

In urban environments, noise significantly impacts daily life and presents challenges for Environmental Sound Classification (ESC). The structural influence of urban noise on audio signals complicates feature extraction and audio classification for environmental sound classification methods. To address these challenges, this paper proposes [...] Read more.

In urban environments, noise significantly impacts daily life and presents challenges for Environmental Sound Classification (ESC). The structural influence of urban noise on audio signals complicates feature extraction and audio classification for environmental sound classification methods. To address these challenges, this paper proposes a Contrastive Learning-based Audio Spectrogram Transformer (CL-Transformer) that incorporates a Patch-Mix mechanism and adaptive contrastive learning strategies while simultaneously improving and utilizing adaptive data augmentation techniques for model training. Firstly, a combination of data augmentation techniques is introduced to enrich environmental sounds. Then, the Patch-Mix feature fusion scheme randomly mixes patches of the enhanced and noisy spectrograms during the Transformer’s patch embedding. Furthermore, a novel contrastive learning scheme is introduced to quantify loss and improve model performance, synergizing well with the Transformer model. Finally, experiments on the ESC-50 and UrbanSound8K public datasets achieved accuracies of 97.75% and 92.95%, respectively. To simulate the impact of noise in real urban environments, the model is evaluated using the UrbanSound8K dataset with added background noise at different signal-to-noise ratios (SNR). Experimental results demonstrate that the proposed framework performs well in noisy environments. Full article

► Show Figures

Figure 1

13 pages, 2294 KiB

Open AccessArticle

An Automatic Classification System for Environmental Sound in Smart Cities

by Dongping Zhang, Ziyin Zhong, Yuejian Xia, Zhutao Wang and Wenbo Xiong

Sensors 2023, 23(15), 6823; https://doi.org/10.3390/s23156823 - 31 Jul 2023

Cited by 8 | Viewed by 3317

Abstract

With the continuous promotion of “smart cities” worldwide, the approach to be used in combining smart cities with modern advanced technologies (Internet of Things, cloud computing, artificial intelligence) has become a hot topic. However, due to the non-stationary nature of environmental sound and [...] Read more.

With the continuous promotion of “smart cities” worldwide, the approach to be used in combining smart cities with modern advanced technologies (Internet of Things, cloud computing, artificial intelligence) has become a hot topic. However, due to the non-stationary nature of environmental sound and the interference of urban noise, it is challenging to fully extract features from the model with a single input and achieve ideal classification results, even with deep learning methods. To improve the recognition accuracy of ESC (environmental sound classification), we propose a dual-branch residual network (dual-resnet) based on feature fusion. Furthermore, in terms of data pre-processing, a loop-padding method is proposed to patch shorter data, enabling it to obtain more useful information. At the same time, in order to prevent the occurrence of overfitting, we use the time-frequency data enhancement method to expand the dataset. After uniform pre-processing of all the original audio, the dual-branch residual network automatically extracts the frequency domain features of the log-Mel spectrogram and log-spectrogram. Then, the two different audio features are fused to make the representation of the audio features more comprehensive. The experimental results show that compared with other models, the classification accuracy of the UrbanSound8k dataset has been improved to different degrees. Full article

(This article belongs to the Special Issue Application of AI-Based Enabled Cyber Resilience in Sensor Networks for Infrastructure Management)

► Show Figures

Figure 1

27 pages, 1600 KiB

Open AccessArticle

Evaluating the Performance of Pre-Trained Convolutional Neural Network for Audio Classification on Embedded Systems for Anomaly Detection in Smart Cities

by Mimoun Lamrini, Mohamed Yassin Chkouri and Abdellah Touhafi

Sensors 2023, 23(13), 6227; https://doi.org/10.3390/s23136227 - 7 Jul 2023

Cited by 10 | Viewed by 3958

Abstract

Environmental Sound Recognition (ESR) plays a crucial role in smart cities by accurately categorizing audio using well-trained Machine Learning (ML) classifiers. This application is particularly valuable for cities that analyzed environmental sounds to gain insight and data. However, deploying deep learning (DL) models [...] Read more.

Environmental Sound Recognition (ESR) plays a crucial role in smart cities by accurately categorizing audio using well-trained Machine Learning (ML) classifiers. This application is particularly valuable for cities that analyzed environmental sounds to gain insight and data. However, deploying deep learning (DL) models on resource-constrained embedded devices, such as Raspberry Pi (RPi) or Tensor Processing Units (TPUs), poses challenges. In this work, an evaluation of an existing pre-trained model for deployment on Raspberry Pi (RPi) and TPU platforms other than a laptop is proposed. We explored the impact of the retraining parameters and compared the sound classification performance across three datasets: ESC-10, BDLib, and Urban Sound. Our results demonstrate the effectiveness of the pre-trained model for transfer learning in embedded systems. On laptops, the accuracy rates reached 96.6% for ESC-10, 100% for BDLib, and 99% for Urban Sound. On RPi, the accuracy rates were 96.4% for ESC-10, 100% for BDLib, and 95.3% for Urban Sound, while on RPi with Coral TPU, the rates were 95.7% for ESC-10, 100% for BDLib and 95.4% for the Urban Sound. Utilizing pre-trained models reduces the computational requirements, enabling faster inference. Leveraging pre-trained models in embedded systems accelerates the development, deployment, and performance of various real-time applications. Full article

(This article belongs to the Special Issue AI-Assisted Condition Monitoring and Fault Diagnosis)

► Show Figures

Figure 1

12 pages, 2587 KiB

Open AccessArticle

A Ground Moving Target Detection Method for Seismic and Sound Sensor Based on Evolutionary Neural Networks

by Kunsheng Xing, Nan Wang and Wei Wang

Appl. Sci. 2022, 12(18), 9343; https://doi.org/10.3390/app12189343 - 18 Sep 2022

Cited by 2 | Viewed by 2551

Abstract

The accurate identification of moving target types in alert areas is a fundamental task for unattended ground sensors. Considering that the seismic and sound signals generated by ground moving targets in urban areas are easily affected by environmental noise and the power consumption [...] Read more.

The accurate identification of moving target types in alert areas is a fundamental task for unattended ground sensors. Considering that the seismic and sound signals generated by ground moving targets in urban areas are easily affected by environmental noise and the power consumption of unattended ground sensors needs to be reduced to achieve low-power consumption, this paper proposes a ground moving target detection method based on evolutionary neural networks. The technique achieves the selection of feature extraction methods and the design of evolving neural network structures. The experimental results show that the improved model can achieve high recognition accuracy with a smaller feature vector and lower network complexity. Full article

(This article belongs to the Special Issue Wireless Sensor Networks in Smart Environments — 2nd Volume)

► Show Figures

Figure 1

11 pages, 3250 KiB

Open AccessArticle

A Deep Attention Model for Environmental Sound Classification from Multi-Feature Data

by Jinming Guo, Chuankun Li, Zepeng Sun, Jian Li and Pan Wang

Appl. Sci. 2022, 12(12), 5988; https://doi.org/10.3390/app12125988 - 12 Jun 2022

Cited by 13 | Viewed by 5065

Abstract

Automated environmental sound recognition has clear engineering benefits; it allows audio to be sorted, curated, and searched. Unlike music and language, environmental sound is loaded with noise and lacks the rhythm and melody of music or the semantic sequence of language, making it [...] Read more.

Automated environmental sound recognition has clear engineering benefits; it allows audio to be sorted, curated, and searched. Unlike music and language, environmental sound is loaded with noise and lacks the rhythm and melody of music or the semantic sequence of language, making it difficult to find common features representative enough of various environmental sound signals. To improve the accuracy of environmental sound recognition, this paper proposes a recognition method based on multi-feature parameters and time–frequency attention module. It begins with a pretreatment that relies on multi-feature parameters to extract the sound, which supplements the phase information lost by the Log-Mel spectrogram in the current mainstream methods, and enhances the expressive ability of input features. A time–frequency attention module with multiple convolutions is designed to extract the attention weight of the input feature spectrogram and reduce the interference coming from the background noise and irrelevant frequency bands in the audio. Comparative experiments were conducted on three general datasets: environmental sound classification datasets (ESC-10, ESC-50) and an UrbanSound8K dataset. Experiments demonstrated that the proposed method performs better. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Deep Learning)

► Show Figures

Figure 1

31 pages, 3640 KiB

Open AccessArticle

Environmental Sound Recognition on Embedded Systems: From FPGAs to TPUs

by Jurgen Vandendriessche, Nick Wouters, Bruno da Silva, Mimoun Lamrini, Mohamed Yassin Chkouri and Abdellah Touhafi

Electronics 2021, 10(21), 2622; https://doi.org/10.3390/electronics10212622 - 27 Oct 2021

Cited by 23 | Viewed by 4947

Abstract

In recent years, Environmental Sound Recognition (ESR) has become a relevant capability for urban monitoring applications. The techniques for automated sound recognition often rely on machine learning approaches, which have increased in complexity in order to achieve higher accuracy. Nonetheless, such machine learning [...] Read more.

In recent years, Environmental Sound Recognition (ESR) has become a relevant capability for urban monitoring applications. The techniques for automated sound recognition often rely on machine learning approaches, which have increased in complexity in order to achieve higher accuracy. Nonetheless, such machine learning techniques often have to be deployed on resource and power-constrained embedded devices, which has become a challenge with the adoption of deep learning approaches based on Convolutional Neural Networks (CNNs). Field-Programmable Gate Arrays (FPGAs) are power efficient and highly suitable for computationally intensive algorithms like CNNs. By fully exploiting their parallel nature, they have the potential to accelerate the inference time as compared to other embedded devices. Similarly, dedicated architectures to accelerate Artificial Intelligence (AI) such as Tensor Processing Units (TPUs) promise to deliver high accuracy while achieving high performance. In this work, we evaluate existing tool flows to deploy CNN models on FPGAs as well as on TPU platforms. We propose and adjust several CNN-based sound classifiers to be embedded on such hardware accelerators. The results demonstrate the maturity of the existing tools and how FPGAs can be exploited to outperform TPUs. Full article

(This article belongs to the Special Issue Advanced Application of FPGA in Embedded Systems)

► Show Figures

Figure 1

18 pages, 1634 KiB

Open AccessArticle

An Ensemble of Convolutional Neural Networks for Audio Classification

by Loris Nanni, Gianluca Maguolo, Sheryl Brahnam and Michelangelo Paci

Appl. Sci. 2021, 11(13), 5796; https://doi.org/10.3390/app11135796 - 22 Jun 2021

Cited by 84 | Viewed by 9447

Abstract

Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in [...] Read more.

Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in urban environments. Since environmental audio datasets are often limited in size, a robust model able to perform well across different datasets is of strong research interest. In this paper, ensembles of classifiers are combined that exploit six data augmentation techniques and four signal representations for retraining five pre-trained convolutional neural networks (CNNs); these ensembles are tested on three freely available environmental audio benchmark datasets: (i) bird calls, (ii) cat sounds, and (iii) the Environmental Sound Classification (ESC-50) database for identifying sources of noise in environments. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. The best-performing ensembles are compared and shown to either outperform or perform comparatively to the best methods reported in the literature on these datasets, including on the challenging ESC-50 dataset. We obtained a 97% accuracy on the bird dataset, 90.51% on the cat dataset, and 88.65% on ESC-50 using different approaches. In addition, the same ensemble model trained on the three datasets managed to reach the same results on the bird and cat datasets while losing only 0.1% on ESC-50. Thus, we have managed to create an off-the-shelf ensemble that can be trained on different datasets and reach performances competitive with the state of the art. Full article

(This article belongs to the Special Issue Applications of Machine Learning in Audio Classification and Acoustic Scene Characterization)

► Show Figures

Figure 1

19 pages, 1922 KiB

Open AccessArticle

An Efficient Audio Coding Scheme for Quantitative and Qualitative Large Scale Acoustic Monitoring Using the Sensor Grid Approach

by Félix Gontier, Mathieu Lagrange, Pierre Aumond, Arnaud Can and Catherine Lavandier

Sensors 2017, 17(12), 2758; https://doi.org/10.3390/s17122758 - 29 Nov 2017

Cited by 12 | Viewed by 5338

Abstract

The spreading of urban areas and the growth of human population worldwide raise societal and environmental concerns. To better address these concerns, the monitoring of the acoustic environment in urban as well as rural or wilderness areas is an important matter. Building on [...] Read more.

The spreading of urban areas and the growth of human population worldwide raise societal and environmental concerns. To better address these concerns, the monitoring of the acoustic environment in urban as well as rural or wilderness areas is an important matter. Building on the recent development of low cost hardware acoustic sensors, we propose in this paper to consider a sensor grid approach to tackle this issue. In this kind of approach, the crucial question is the nature of the data that are transmitted from the sensors to the processing and archival servers. To this end, we propose an efficient audio coding scheme based on third octave band spectral representation that allows: (1) the estimation of standard acoustic indicators; and (2) the recognition of acoustic events at state-of-the-art performance rate. The former is useful to provide quantitative information about the acoustic environment, while the latter is useful to gather qualitative information and build perceptually motivated indicators using for example the emergence of a given sound source. The coding scheme is also demonstrated to transmit spectrally encoded data that, reverted to the time domain using state-of-the-art techniques, are not intelligible, thus protecting the privacy of citizens. Full article

(This article belongs to the Collection Smart Communication Protocols and Algorithms for Sensor Networks)

► Show Figures

Figure 1

31 pages, 1678 KiB

Open AccessArticle

Evaluation of Three Electronic Noses for Detecting Incipient Wood Decay

by Manuela Baietto, Alphus D. Wilson, Daniele Bassi and Francesco Ferrini

Sensors 2010, 10(2), 1062-1092; https://doi.org/10.3390/s100201062 - 29 Jan 2010

Cited by 61 | Viewed by 16850

Abstract

Tree assessment methodologies, currently used to evaluate the structural stability of individual urban trees, usually involve a visual analysis followed by measurements of the internal soundness of wood using various instruments that are often invasive, expensive, or inadequate for use within the urban [...] Read more.

Tree assessment methodologies, currently used to evaluate the structural stability of individual urban trees, usually involve a visual analysis followed by measurements of the internal soundness of wood using various instruments that are often invasive, expensive, or inadequate for use within the urban environment. Moreover, most conventional instruments do not provide an adequate evaluation of decay that occurs in the root system. The intent of this research was to evaluate the possibility of integrating conventional tools, currently used for assessments of decay in urban trees, with the electronic nose–a new innovative tool used in diverse fields and industries for various applications such as quality control in manufacturing, environmental monitoring, medical diagnoses, and perfumery. Electronic-nose (e-nose) technologies were tested for the capability of detecting differences in volatile organic compounds (VOCs) released by wood decay fungi and wood from healthy and decayed trees. Three e-noses, based on different types of operational technologies and analytical methods, were evaluated independently (not directly compared) to determine the feasibility of detecting incipient decays in artificially-inoculated wood. All three e-nose devices were capable of discriminating between healthy and artificially-inoculated, decayed wood with high levels of precision and confidence. The LibraNose quartz microbalance (QMB) e-nose generally provided higher levels of discrimination of sample unknowns, but not necessarily more accurate or effective detection than the AromaScan A32S conducting polymer and PEN3 metal-oxide (MOS) gas sensor e-noses for identifying and distinguishing woody samples containing different agents of wood decay. However, the conducting polymer e-nose had the greater advantage for identifying unknowns from diverse woody sample types due to the associated software capability of utilizing prior-developed, application-specific reference libraries with aroma pattern-recognition and neural-net training algorithms. Full article

(This article belongs to the Special Issue Microbial Sensors and Biosensors)

► Show Figures

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI