sensors-logo

Journal Browser

Journal Browser

Advances in Acoustic Sensors and Deep Audio Pattern Recognition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Electronic Sensors".

Deadline for manuscript submissions: 20 September 2024 | Viewed by 8554

Special Issue Editor


E-Mail Website
Guest Editor
Department of Computer Science, University of Milan, 20133 Milan, Italy
Interests: audio analyzing; AI; computer vision; robotics; deep learning
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Lately, there has been a constantly increasing demand for applications of generalized sound-recognition technologies focused on non-speech signals, including environmental sounds, music, animal vocalizations, etc. The field is advancing at a fast pace, and most of the literature adopts solutions based on deep architectures. Even though such solutions offer significant performance improvements, there is a series of aspects that remain open, such as interpretability, out-of-distribution learning, etc., which currently constitute the center of attention of much of the ongoing research.

We invite original papers, communications, and review articles covering the latest advances in acoustic sensors and audio pattern recognition technologies while focusing on the following topics and applications:

  • Self-supervised learning; cooperative deep learning methods; continual learning; multi-task learning; small-footprint models; graph neural networks; deep generative models; out-of-distribution generalization; few-shot learning; adversarial machine learning; transfer and reinforcement learning interpretation; and verifiable, reliable, explainable, auditable, robust and unbiased modeling.
  • Computational auditory scene analysis, bioacoustics, medical acoustics, music information retrieval, privacy in smart-home assistants, and acoustic sensor networks.

Dr. Stavros Ntalampiras
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • interpretable, verifiable, reliable, explainable, auditable, robust, and unbiased modeling
  • adversarial machine learning
  • out-of-distribution learning

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 19137 KiB  
Article
Soundscape Characterization Using Autoencoders and Unsupervised Learning
by Daniel Alexis Nieto-Mora, Maria Cristina Ferreira de Oliveira, Camilo Sanchez-Giraldo, Leonardo Duque-Muñoz, Claudia Isaza-Narváez and Juan David Martínez-Vargas
Sensors 2024, 24(8), 2597; https://doi.org/10.3390/s24082597 - 18 Apr 2024
Viewed by 413
Abstract
Passive acoustic monitoring (PAM) through acoustic recorder units (ARUs) shows promise in detecting early landscape changes linked to functional and structural patterns, including species richness, acoustic diversity, community interactions, and human-induced threats. However, current approaches primarily rely on supervised methods, which require prior [...] Read more.
Passive acoustic monitoring (PAM) through acoustic recorder units (ARUs) shows promise in detecting early landscape changes linked to functional and structural patterns, including species richness, acoustic diversity, community interactions, and human-induced threats. However, current approaches primarily rely on supervised methods, which require prior knowledge of collected datasets. This reliance poses challenges due to the large volumes of ARU data. In this work, we propose a non-supervised framework using autoencoders to extract soundscape features. We applied this framework to a dataset from Colombian landscapes captured by 31 audiomoth recorders. Our method generates clusters based on autoencoder features and represents cluster information with prototype spectrograms using centroid features and the decoder part of the neural network. Our analysis provides valuable insights into the distribution and temporal patterns of various sound compositions within the study area. By utilizing autoencoders, we identify significant soundscape patterns characterized by recurring and intense sound types across multiple frequency ranges. This comprehensive understanding of the study area’s soundscape allows us to pinpoint crucial sound sources and gain deeper insights into its acoustic environment. Our results encourage further exploration of unsupervised algorithms in soundscape analysis as a promising alternative path for understanding and monitoring environmental changes. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

12 pages, 4523 KiB  
Article
A 90.9 dB SNDR 95.3 dB DR Audio Delta–Sigma Modulator with FIA-Assisted OTA
by Gongxing Huang, Cong Wei and Rongshan Wei
Sensors 2024, 24(5), 1449; https://doi.org/10.3390/s24051449 - 23 Feb 2024
Viewed by 567
Abstract
This paper presents a low-power, high-gain integrator design that uses a cascode operational transconductance amplifier (OTA) with floating inverter–amplifier (FIA) assistance. Compared to a traditional cascode, the proposed integrator can achieve a gain of 80 dB, while reducing power consumption by 30%. Upon [...] Read more.
This paper presents a low-power, high-gain integrator design that uses a cascode operational transconductance amplifier (OTA) with floating inverter–amplifier (FIA) assistance. Compared to a traditional cascode, the proposed integrator can achieve a gain of 80 dB, while reducing power consumption by 30%. Upon completing the analysis, the value of the FIA drive capacitor and clock scheme for the FIA-assisted OTA were obtained. To enhance the dynamic range (DR) and mitigate quantization noise, a tri-level quantizer was employed. The design of the feedback digital-to-analog converter (DAC) was simplified, as it does not use additional mismatch shaping techniques. A third-order, discrete-time delta–sigma modulator was designed and fabricated in a 0.18 μm complementary metal-oxide semiconductor (CMOS) process. It operated on a 1.8 V supply, consuming 221 µW with a 24 kHz bandwidth. The measured SNDR and DR were 90.9 dB and 95.3 dB, respectively. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

18 pages, 12509 KiB  
Article
EnViTSA: Ensemble of Vision Transformer with SpecAugment for Acoustic Event Classification
by Kian Ming Lim, Chin Poo Lee, Zhi Yang Lee and Ali Alqahtani
Sensors 2023, 23(22), 9084; https://doi.org/10.3390/s23229084 - 10 Nov 2023
Cited by 2 | Viewed by 995
Abstract
Recent successes in deep learning have inspired researchers to apply deep neural networks to Acoustic Event Classification (AEC). While deep learning methods can train effective AEC models, they are susceptible to overfitting due to the models’ high complexity. In this paper, we introduce [...] Read more.
Recent successes in deep learning have inspired researchers to apply deep neural networks to Acoustic Event Classification (AEC). While deep learning methods can train effective AEC models, they are susceptible to overfitting due to the models’ high complexity. In this paper, we introduce EnViTSA, an innovative approach that tackles key challenges in AEC. EnViTSA combines an ensemble of Vision Transformers with SpecAugment, a novel data augmentation technique, to significantly enhance AEC performance. Raw acoustic signals are transformed into Log Mel-spectrograms using Short-Time Fourier Transform, resulting in a fixed-size spectrogram representation. To address data scarcity and overfitting issues, we employ SpecAugment to generate additional training samples through time masking and frequency masking. The core of EnViTSA resides in its ensemble of pre-trained Vision Transformers, harnessing the unique strengths of the Vision Transformer architecture. This ensemble approach not only reduces inductive biases but also effectively mitigates overfitting. In this study, we evaluate the EnViTSA method on three benchmark datasets: ESC-10, ESC-50, and UrbanSound8K. The experimental results underscore the efficacy of our approach, achieving impressive accuracy scores of 93.50%, 85.85%, and 83.20% on ESC-10, ESC-50, and UrbanSound8K, respectively. EnViTSA represents a substantial advancement in AEC, demonstrating the potential of Vision Transformers and SpecAugment in the acoustic domain. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

19 pages, 679 KiB  
Article
Online Continual Learning in Acoustic Scene Classification: An Empirical Study
by Donghee Ha, Mooseop Kim and Chi Yoon Jeong
Sensors 2023, 23(15), 6893; https://doi.org/10.3390/s23156893 - 03 Aug 2023
Viewed by 951
Abstract
Numerous deep learning methods for acoustic scene classification (ASC) have been proposed to improve the classification accuracy of sound events. However, only a few studies have focused on continual learning (CL) wherein a model continually learns to solve issues with task changes. Therefore, [...] Read more.
Numerous deep learning methods for acoustic scene classification (ASC) have been proposed to improve the classification accuracy of sound events. However, only a few studies have focused on continual learning (CL) wherein a model continually learns to solve issues with task changes. Therefore, in this study, we systematically analyzed the performance of ten recent CL methods to provide guidelines regarding their performances. The CL methods included two regularization-based methods and eight replay-based methods. First, we defined realistic and difficult scenarios such as online class-incremental (OCI) and online domain-incremental (ODI) cases for three public sound datasets. Then, we systematically analyzed the performance of each CL method in terms of average accuracy, average forgetting, and training time. In OCI scenarios, iCaRL and SCR showed the best performance for small buffer sizes, and GDumb showed the best performance for large buffer sizes. In ODI scenarios, SCR adopting supervised contrastive learning consistently outperformed the other methods, regardless of the memory buffer size. Most replay-based methods have an almost constant training time, regardless of the memory buffer size, and their performance increases with an increase in the memory buffer size. Based on these results, we must first consider GDumb/SCR for the continual learning methods for ASC. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

18 pages, 2268 KiB  
Article
Affective Neural Responses Sonified through Labeled Correlation Alignment
by Andrés Marino Álvarez-Meza, Héctor Fabio Torres-Cardona, Mauricio Orozco-Alzate, Hernán Darío Pérez-Nastar and German Castellanos-Dominguez
Sensors 2023, 23(12), 5574; https://doi.org/10.3390/s23125574 - 14 Jun 2023
Viewed by 1030
Abstract
Sound synthesis refers to the creation of original acoustic signals with broad applications in artistic innovation, such as music creation for games and videos. Nonetheless, machine learning architectures face numerous challenges when learning musical structures from arbitrary corpora. This issue involves adapting patterns [...] Read more.
Sound synthesis refers to the creation of original acoustic signals with broad applications in artistic innovation, such as music creation for games and videos. Nonetheless, machine learning architectures face numerous challenges when learning musical structures from arbitrary corpora. This issue involves adapting patterns borrowed from other contexts to a concrete composition objective. Using Labeled Correlation Alignment (LCA), we propose an approach to sonify neural responses to affective music-listening data, identifying the brain features that are most congruent with the simultaneously extracted auditory features. For dealing with inter/intra-subject variability, a combination of Phase Locking Value and Gaussian Functional Connectivity is employed. The proposed two-step LCA approach embraces a separate coupling stage of input features to a set of emotion label sets using Centered Kernel Alignment. This step is followed by canonical correlation analysis to select multimodal representations with higher relationships. LCA enables physiological explanation by adding a backward transformation to estimate the matching contribution of each extracted brain neural feature set. Correlation estimates and partition quality represent performance measures. The evaluation uses a Vector Quantized Variational AutoEncoder to create an acoustic envelope from the tested Affective Music-Listening database. Validation results demonstrate the ability of the developed LCA approach to generate low-level music based on neural activity elicited by emotions while maintaining the ability to distinguish between the acoustic outputs. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

15 pages, 4233 KiB  
Article
Acoustic Source Localization in CFRP Composite Plate Based on Wave Velocity-Direction Function Fitting
by Yu Zhang, Yu Feng, Xiaobo Rui, Lixin Xu, Lei Qi, Zi Yang, Cong Hu, Peng Liu and Haijiang Zhang
Sensors 2023, 23(6), 3052; https://doi.org/10.3390/s23063052 - 12 Mar 2023
Viewed by 1256
Abstract
Composite materials are widely used, but they are often subjected to impacts from foreign objects, causing structural damage. To ensure the safety of use, it is necessary to locate the impact point. This paper investigates impact sensing and localization technology for composite plates [...] Read more.
Composite materials are widely used, but they are often subjected to impacts from foreign objects, causing structural damage. To ensure the safety of use, it is necessary to locate the impact point. This paper investigates impact sensing and localization technology for composite plates and proposes a method of acoustic source localization for CFRP composite plates based on wave velocity-direction function fitting. This method divides the grid of composite plates, constructs the theoretical time difference matrix of the grid points, and compares it with the actual time difference to form an error matching matrix to localize the impact source. In this paper, finite element simulation combined with a lead-break experiment is used to explore the wave velocity-angle function relationship of Lamb waves in composite materials. The simulation experiment is used to verify the feasibility of the localization method, and the lead-break experimental system is built to locate the actual impact source. The results show that the acoustic emission time-difference approximation method can effectively solve the problem of impact source localization in composite structures, and the average localization error is 1.44 cm and the maximum localization error is 3.35 cm in 49 experimental points with good stability and accuracy. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

19 pages, 3109 KiB  
Article
Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech
by Chongchong Yu, Jiaqi Yu, Zhaopeng Qian and Yuchen Tan
Sensors 2023, 23(4), 2071; https://doi.org/10.3390/s23042071 - 12 Feb 2023
Viewed by 1063
Abstract
Endangered language generally has low-resource characteristics, as an immaterial cultural resource that cannot be renewed. Automatic speech recognition (ASR) is an effective means to protect this language. However, for low-resource language, native speakers are few and labeled corpora are insufficient. ASR, thus, suffers [...] Read more.
Endangered language generally has low-resource characteristics, as an immaterial cultural resource that cannot be renewed. Automatic speech recognition (ASR) is an effective means to protect this language. However, for low-resource language, native speakers are few and labeled corpora are insufficient. ASR, thus, suffers deficiencies including high speaker dependence and over fitting, which greatly harms the accuracy of recognition. To tackle the deficiencies, the paper puts forward an approach of audiovisual speech recognition (AVSR) based on LSTM-Transformer. The approach introduces visual modality information including lip movements to reduce the dependence of acoustic models on speakers and the quantity of data. Specifically, the new approach, through the fusion of audio and visual information, enhances the expression of speakers’ feature space, thus achieving the speaker adaptation that is difficult in a single modality. The approach also includes experiments on speaker dependence and evaluates to what extent audiovisual fusion is dependent on speakers. Experimental results show that the CER of AVSR is 16.9% lower than those of traditional models (optimal performance scenario), and 11.8% lower than that for lip reading. The accuracy for recognizing phonemes, especially finals, improves substantially. For recognizing initials, the accuracy improves for affricates and fricatives where the lip movements are obvious and deteriorates for stops where the lip movements are not obvious. In AVSR, the generalization onto different speakers is also better than in a single modality and the CER can drop by as much as 17.2%. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

20 pages, 6023 KiB  
Article
Detecting Lombard Speech Using Deep Learning Approach
by Krzysztof Kąkol, Gražina Korvel, Gintautas Tamulevičius and Bożena Kostek
Sensors 2023, 23(1), 315; https://doi.org/10.3390/s23010315 - 28 Dec 2022
Cited by 2 | Viewed by 1543
Abstract
Robust Lombard speech-in-noise detecting is challenging. This study proposes a strategy to detect Lombard speech using a machine learning approach for applications such as public address systems that work in near real time. The paper starts with the background concerning the Lombard effect. [...] Read more.
Robust Lombard speech-in-noise detecting is challenging. This study proposes a strategy to detect Lombard speech using a machine learning approach for applications such as public address systems that work in near real time. The paper starts with the background concerning the Lombard effect. Then, assumptions of the work performed for Lombard speech detection are outlined. The framework proposed combines convolutional neural networks (CNNs) and various two-dimensional (2D) speech signal representations. To reduce the computational cost and not resign from the 2D representation-based approach, a strategy for threshold-based averaging of the Lombard effect detection results is introduced. The pseudocode of the averaging process is also included. A series of experiments are performed to determine the most effective network structure and the 2D speech signal representation. Investigations are carried out on German and Polish recordings containing Lombard speech. All 2D signal speech representations are tested with and without augmentation. Augmentation means using the alpha channel to store additional data: gender of the speaker, F0 frequency, and first two MFCCs. The experimental results show that Lombard and neutral speech recordings can clearly be discerned, which is done with high detection accuracy. It is also demonstrated that the proposed speech detection process is capable of working in near real-time. These are the key contributions of this work. Full article
(This article belongs to the Special Issue Advances in Acoustic Sensors and Deep Audio Pattern Recognition)
Show Figures

Figure 1

Back to TopTop