Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,042)

Search Parameters:
Keywords = audio systems

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 1659 KB  
Article
Lightweight Depression Detection Using 3D Facial Landmark Pseudo-Images and CNN-LSTM on DAIC-WOZ and E-DAIC
by Achraf Jallaglag, My Abdelouahed Sabri, Ali Yahyaouy and Abdellah Aarab
BioMedInformatics 2026, 6(1), 8; https://doi.org/10.3390/biomedinformatics6010008 (registering DOI) - 4 Feb 2026
Abstract
Background: Depression is a common mental disorder, and early and objective diagnosis of depression is challenging. New advances in deep learning show promise for processing audio and video content when screening for depression. Nevertheless, the majority of current methods rely on raw video [...] Read more.
Background: Depression is a common mental disorder, and early and objective diagnosis of depression is challenging. New advances in deep learning show promise for processing audio and video content when screening for depression. Nevertheless, the majority of current methods rely on raw video processing or multimodal pipelines, which are computationally costly and challenging to understand and create privacy issues, restricting their use in actual clinical settings. Methods: Based solely on spatiotemporal 3D face landmark representations, we describe a unique, totally visual, and lightweight deep learning approach to overcome these constraints. In this paper we introduce, for the first time, a pure visual deep learning framework, based on spatiotemporal 3D facial landmarks extracted from clinical interview videos contained in the DAIC-WOZ and Extended DAIC-WOZ (E-DAIC) datasets. Our method does not use raw video or any type of semi-automated multimodal fusion. Whereas raw video streaming can be computationally expensive and is not well suited to investigating specific variables, we first take a temporal series of 3D landmarks, convert them to pseudo-images (224 × 224 × 3), and then use them within a CNN-LSTM framework. Importantly, CNN-LSTM provides the ability to analyze the spatial configuration and temporal dimensions of facial behavior. Results: The experimental results indicate macro-average F1 scores of 0.74 on DAIC-WOZ and 0.762 on E-DAIC, demonstrating robust performance under heavy class imbalances, with a variability of ±0.03 across folds. Conclusion: These results indicate that landmark-based spatiotemporal modeling represents the future of lightweight, interpretable, and scalable automatic depression detection. Second, our results suggest exciting opportunities for completely embedding ADI systems within the framework of real-world MHA. Full article
Show Figures

Graphical abstract

41 pages, 5589 KB  
Review
Advances in Audio-Based Artificial Intelligence for Respiratory Health and Welfare Monitoring in Broiler Chickens
by Md Sharifuzzaman, Hong-Seok Mun, Eddiemar B. Lagua, Md Kamrul Hasan, Jin-Gu Kang, Young-Hwa Kim, Ahsan Mehtab, Hae-Rang Park and Chul-Ju Yang
AI 2026, 7(2), 58; https://doi.org/10.3390/ai7020058 - 4 Feb 2026
Abstract
Respiratory diseases and welfare impairments impose substantial economic and ethical burdens on modern broiler production, driven by high stocking density, rapid pathogen transmission, and limited sensitivity of conventional monitoring methods. Because respiratory pathology and stress directly alter vocal behavior, acoustic monitoring has emerged [...] Read more.
Respiratory diseases and welfare impairments impose substantial economic and ethical burdens on modern broiler production, driven by high stocking density, rapid pathogen transmission, and limited sensitivity of conventional monitoring methods. Because respiratory pathology and stress directly alter vocal behavior, acoustic monitoring has emerged as a promising non-invasive approach for continuous flock-level surveillance. This review synthesizes recent advances in audio classification and artificial intelligence for monitoring respiratory health and welfare in broiler chickens. We have reviewed the anatomical basis of sound production, characterized key vocal categories relevant to health and welfare, and summarized recording strategies, datasets, acoustic features, machine-learning and deep-learning models, and evaluation metrics used in poultry sound analysis. Evidence from experimental and commercial settings demonstrates that AI-based acoustic systems can detect respiratory sounds, stress, and welfare changes with high accuracy, often enabling earlier intervention than traditional methods. Finally, we discuss current limitations, including background noise, data imbalance, limited multi-farm validation, and challenges in interpretability and deployment, and outline future directions for scalable, robust, and practical sound-based monitoring systems in broiler production. Full article
(This article belongs to the Section AI Systems: Theory and Applications)
37 pages, 3465 KB  
Article
Transmitting Images in Difficult Environments Using Acoustics, SDR and GNU Radio Applications
by Michael Alldritt and Robin Braun
Electronics 2026, 15(3), 678; https://doi.org/10.3390/electronics15030678 - 4 Feb 2026
Abstract
This paper explores the feasibility of using acoustic wave propagation, particularly in the ultrasonic range, as a solution for data transmission in environments where traditional radio frequency (RF) communication is ineffective due to signal attenuation—such as in liquids or dense media like metal [...] Read more.
This paper explores the feasibility of using acoustic wave propagation, particularly in the ultrasonic range, as a solution for data transmission in environments where traditional radio frequency (RF) communication is ineffective due to signal attenuation—such as in liquids or dense media like metal or stone. Leveraging GNU Radio and commercially available audio hardware, a low-cost, SDR (Software Defined Radio) system was developed to transmit data blocks (e.g., images, text, and audio) through various substances. The system employs BFSK (Binary Frequency Shift Keying) and BPSK (Binary Phase Shift Keying), operates at ultrasonic frequencies (typically 40 kHz), and has performance validated under real-world conditions, including water, viscous substances, and flammable liquids such as hydrocarbon fuels. Experimental results demonstrate reliable, continuous communication at Nyquist–Shannon sampling rates, with effective demodulation and file reconstruction. The methodology builds on concepts originally developed for Ad Hoc Sensor Networks in shipping containers, extending their applicability to submerged and RF-hostile environments. The modularity and flexibility of the GNU Radio platform allow for rapid adaptation across different media and deployment contexts. This work provides a reproducible and scalable communication solution for scenarios where RF transmission is impractical, offering potential applications in underwater sensing, industrial monitoring, railways, and enclosed infrastructure diagnostics. Across controlled laboratory experiments, the system achieved 100% successful reconstruction of transmitted image files up to 100 kB and sustained packet delivery success exceeding 98% under stable coupling conditions. Full article
5 pages, 1305 KB  
Proceeding Paper
Audiovisual Fusion Technique for Detecting Sensitive Content in Videos
by Daniel Povedano Álvarez, Ana Lucila Sandoval Orozco and Luis Javier García Villalba
Eng. Proc. 2026, 123(1), 11; https://doi.org/10.3390/engproc2026123011 - 2 Feb 2026
Abstract
The detection of sensitive content in online videos is a key challenge for ensuring digital safety and effective content moderation. This work proposes the Multimodal Audiovisual Attention (MAV-Att), a multimodal deep learning framework that jointly exploits audio and visual cues to improve detection [...] Read more.
The detection of sensitive content in online videos is a key challenge for ensuring digital safety and effective content moderation. This work proposes the Multimodal Audiovisual Attention (MAV-Att), a multimodal deep learning framework that jointly exploits audio and visual cues to improve detection accuracy. The model was evaluated on the LSPD dataset, comprising 52,427 video segments of 20 s each, with optimized keyframe extraction. MAV-Att consists of dual audio and image branches enhanced by attention mechanisms to capture both temporal and cross-modal dependencies. Trained using a joint optimisation loss, the system achieved F1-scores of 94.9% on segments and 94.5% on entire videos, surpassing previous state-of-the-art models by 6.75%. Full article
(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)
Show Figures

Figure 1

12 pages, 1323 KB  
Proceeding Paper
Edge AI System Using Lightweight Semantic Voting to Detect Segment-Based Voice Scams
by Shao-Yong Lu and Wen-Ping Chen
Eng. Proc. 2025, 120(1), 14; https://doi.org/10.3390/engproc2025120014 - 2 Feb 2026
Viewed by 38
Abstract
Real-time telecom scam detection is difficult without cloud AI, particularly for privacy-sensitive, low-resource environments. We developed a lightweight, offline voice scam detector using on-device audio segmentation, automatic speech recognition (ASR), and semantic similarity. Four detection strategies were implemented. We used Whisper ASR and [...] Read more.
Real-time telecom scam detection is difficult without cloud AI, particularly for privacy-sensitive, low-resource environments. We developed a lightweight, offline voice scam detector using on-device audio segmentation, automatic speech recognition (ASR), and semantic similarity. Four detection strategies were implemented. We used Whisper ASR and DeepSeek to process 5 s speech chunks. An analysis of 120 synthetic and paraphrased Mandarin phone call dialogues reveals the A4 voting strategy’s superior performance in optimizing early detection and minimizing false positives, achieving an F1 score of 0.90, a 2.5% false positive rate, and a mean response time of under 4 s. The system is deployable on ESP32 for offline mobile inference. The proposed architecture provides a robust and scalable defense against threats targeting vulnerable user groups, such as older adults. It introduces a new method for real-time voice threat mitigation on devices through interpretable segment-level semantic analysis. Full article
(This article belongs to the Proceedings of 8th International Conference on Knowledge Innovation and Invention)
Show Figures

Figure 1

30 pages, 6824 KB  
Article
Audiovisual Gun Detection with Automated Lockdown and PA Announcing IoT System for Schools
by Tareq Khan
IoT 2026, 7(1), 15; https://doi.org/10.3390/iot7010015 - 31 Jan 2026
Viewed by 127
Abstract
Gun violence in U.S. schools not only causes loss of life and physical injury but also leaves enduring psychological trauma, damages property, and results in significant economic losses. One way to reduce this loss is to detect the gun early, notify the police [...] Read more.
Gun violence in U.S. schools not only causes loss of life and physical injury but also leaves enduring psychological trauma, damages property, and results in significant economic losses. One way to reduce this loss is to detect the gun early, notify the police as soon as possible, and implement lockdown procedures immediately. In this project, a novel gun detector Internet of Things (IoT) system is developed that automatically detects the presence of a gun either from images or from gunshot sounds, and sends notifications with exact location information to the first responder’s smartphones using the Internet within a second. The device also sends wireless commands using Message Queuing Telemetry Transport (MQTT) protocol to close the smart door locks in classrooms and announce to act using public address (PA) system automatically. The proposed system will remove the burden of manually calling the police and implementing the lockdown procedure during such traumatic situations. Police will arrive sooner, and thus it will help to stop the shooter early, the injured people can be taken to the hospital quickly, and more lives can be saved. Two custom deep learning AI models are used: (a) to detect guns from image data having an accuracy of 94.6%, and (b) the gunshot sounds from audio data having an accuracy of 99%. No single gun detector device is available in the literature that can detect guns from both image and audio data, implement lockdown and make PA announcement automatically. A prototype of the proposed gunshot detector IoT system, and a smartphone app is developed, and tested with gun replicas and blank guns in real-time. Full article
Show Figures

Figure 1

16 pages, 3367 KB  
Article
Utilizing Multimodal Logic Fusion to Identify the Types of Food Waste Sources
by Dong-Ming Gao, Jia-Qi Song, Zong-Qiang Fu, Zhi Liu and Gang Li
Sensors 2026, 26(3), 851; https://doi.org/10.3390/s26030851 - 28 Jan 2026
Viewed by 107
Abstract
It is a challenge to identify food waste sources in all-weather industrial environments, as variable lighting conditions can compromise the effectiveness of visual recognition models. This study proposes and validates a robust, interpretable, and adaptive multimodal logic fusion method in which sensor dominance [...] Read more.
It is a challenge to identify food waste sources in all-weather industrial environments, as variable lighting conditions can compromise the effectiveness of visual recognition models. This study proposes and validates a robust, interpretable, and adaptive multimodal logic fusion method in which sensor dominance is dynamically assigned based on real-time illuminance intensity. The method comprises two foundational components: (1) a lightweight MobileNetV3 + EMA model for image recognition; and (2) an audio model employing Fast Fourier Transform (FFT) for feature extraction and Support Vector Machine (SVM) for classification. The key contribution of this system lies in its environment-aware conditional logic. The image model MobileNetV3 + EMA achieves an accuracy of 99.46% within the optimal brightness range (120–240 cd m−2), significantly outperforming the audio model. However, its performance degrades significantly outside the optimal range, while the audio model maintains an illumination-independent accuracy of 0.80, a recall of 0.78, and an F1 score of 0.80. When light intensity falls below the threshold of 84 cd m−2, the audio recognition results take precedence. This strategy ensures robust classification accuracy under variable environmental conditions, preventing model failure. Validated on an independent test set, the fusion method achieves an overall accuracy of 90.25%, providing an interpretable and resilient solution for real-world industrial deployment. Full article
(This article belongs to the Special Issue Multi-Sensor Data Fusion)
Show Figures

Figure 1

40 pages, 2047 KB  
Review
A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models
by Mirela-Magdalena Grosu (Marinescu), Octaviana Datcu, Ruxandra Tapu and Bogdan Mocanu
Appl. Sci. 2026, 16(3), 1289; https://doi.org/10.3390/app16031289 - 27 Jan 2026
Viewed by 123
Abstract
Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language [...] Read more.
Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language models and multimodal large language models (MLLMs). This review surveys this evolution, with an emphasis on engineering considerations relevant to real-world deployment. We analyze multimodal fusion strategies, dataset characteristics, and evaluation protocols, highlighting limitations in robustness, bias, and annotation quality under unconstrained conditions. Emerging MLLM-based approaches are examined in terms of performance, reasoning capability, computational cost, and interaction potential. By comparing task-specific models with foundation model approaches, we clarify their respective strengths for resource-constrained versus context-aware applications. Finally, we outline practical research directions toward building robust, efficient, and deployable ERV systems for applied scenarios such as assistive technologies and human–AI interaction. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
18 pages, 10692 KB  
Article
Short-Time Homomorphic Deconvolution (STHD): A Novel 2D Feature for Robust Indoor Direction of Arrival Estimation
by Yeonseok Park and Jun-Hwa Kim
Sensors 2026, 26(2), 722; https://doi.org/10.3390/s26020722 - 21 Jan 2026
Viewed by 203
Abstract
Accurate indoor positioning and navigation remain significant challenges, with audio sensor-based sound source localization emerging as a promising sensing modality. Conventional methods, often reliant on multi-channel processing or time-delay estimation techniques such as Generalized Cross-Correlation, encounter difficulties regarding computational complexity, hardware synchronization, and [...] Read more.
Accurate indoor positioning and navigation remain significant challenges, with audio sensor-based sound source localization emerging as a promising sensing modality. Conventional methods, often reliant on multi-channel processing or time-delay estimation techniques such as Generalized Cross-Correlation, encounter difficulties regarding computational complexity, hardware synchronization, and reverberant environments where time difference in arrival cues are masked. While machine learning approaches have shown potential, their performance depends heavily on the discriminative power of input features. This paper proposes a novel feature extraction method named Short-Time Homomorphic Deconvolution, which transforms multi-channel audio signals into a 2D Time × Time-of-Flight representation. Unlike prior 1D methods, this feature effectively captures the temporal evolution and stability of time-of-flight differences between microphone pairs, offering a rich and robust input for deep learning models. We validate this feature using a lightweight Convolutional Neural Network integrated with a dual-stage channel attention mechanism, designed to prioritize reliable spatial cues. The system was trained on a large-scale dataset generated via simulations and rigorously tested using real-world data acquired in an ISO-certified anechoic chamber. Experimental results demonstrate that the proposed model achieves precise Direction of Arrival estimation with a Mean Absolute Error of 1.99 degrees in real-world scenarios. Notably, the system exhibits remarkable consistency between simulation and physical experiments, proving its effectiveness for robust indoor navigation and positioning systems. Full article
Show Figures

Figure 1

41 pages, 2850 KB  
Article
Automated Classification of Humpback Whale Calls Using Deep Learning: A Comparative Study of Neural Architectures and Acoustic Feature Representations
by Jack C. Johnson and Yue Rong
Sensors 2026, 26(2), 715; https://doi.org/10.3390/s26020715 - 21 Jan 2026
Viewed by 185
Abstract
Passive acoustic monitoring (PAM) using hydrophones enables collecting acoustic data to be collected in large and diverse quantities, necessitating the need for a reliable automated classification system. This paper presents a data-processing pipeline and a set of neural networks designed for a humpback-whale-detection [...] Read more.
Passive acoustic monitoring (PAM) using hydrophones enables collecting acoustic data to be collected in large and diverse quantities, necessitating the need for a reliable automated classification system. This paper presents a data-processing pipeline and a set of neural networks designed for a humpback-whale-detection system. A collection of audio segments is compiled using publicly available audio repositories and extensively curated via manual methods, undertaking thorough examination, editing and clipping to produce a dataset minimizing bias or categorization errors. An array of standard data-augmentation techniques are applied to the collected audio, diversifying and expanding the original dataset. Multiple neural networks are designed and trained using TensorFlow 2.20.0 and Keras 3.13.1 frameworks, resulting in a custom curated architecture layout based on research and iterative improvements. The pre-trained model MobileNetV2 is also included for further analysis. Model performance demonstrates a strong dependence on both feature representation and network architecture. Mel spectrogram inputs consistently outperformed MFCC (Mel-Frequency Cepstral Coefficients) features across all model types. The highest performance was achieved by the pretrained MobileNetV2 using mel spectrograms without augmentation, reaching a test accuracy of 99.01% with balanced precision and recall of 99% and a Matthews correlation coefficient of 0.98. The custom CNN with mel spectrograms also achieved strong performance, with 98.92% accuracy and a false negative rate of only 0.75%. In contrast, models trained with MFCC representations exhibited consistently lower robustness and higher false negative rates. These results highlight the comparative strengths of the evaluated feature representations and network architectures for humpback whale detection. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

19 pages, 2984 KB  
Article
Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure
by Yury Furletov, Dinara Aptinova, Mekan Mededov, Andrey Keller, Sergey S. Shadrin and Daria A. Makarova
Smart Cities 2026, 9(1), 17; https://doi.org/10.3390/smartcities9010017 - 21 Jan 2026
Viewed by 143
Abstract
Improving city crossroads safety is a critical problem for modern smart transportation systems (STS). This article presents the results of developing, upgrading, and comprehensively experimentally testing an acoustic monitoring system prototype designed for rapid accident detection. Unlike conventional camera- or lidar-based approaches, the [...] Read more.
Improving city crossroads safety is a critical problem for modern smart transportation systems (STS). This article presents the results of developing, upgrading, and comprehensively experimentally testing an acoustic monitoring system prototype designed for rapid accident detection. Unlike conventional camera- or lidar-based approaches, the proposed solution uses passive sound source localization to operate effectively with no direct visibility and in adverse weather conditions, addressing a key limitation of camera- or lidar-based systems. Generalized Cross-Correlation with Phase Transform (GCC-PHAT) algorithms were used to develop a hardware–software complex featuring four microphones, a multichannel audio interface, and a computation module. This study focuses on the gradual upgrading of the algorithm to reduce the mean localization error in real-life urban conditions. Laboratory and complex field tests were conducted on an open-air testing ground of a university campus. During these tests, the system demonstrated that it can accurately determine the coordinates of a sound source imitating accidents (sirens, collisions). The analysis confirmed that the system satisfies the V2X infrastructure integration response time requirement (<200 ms). The results suggest that the system can be used as part of smart transportation systems. Full article
(This article belongs to the Section Physical Infrastructures and Networks in Smart Cities)
Show Figures

Figure 1

28 pages, 435 KB  
Review
Advances in Audio Classification and Artificial Intelligence for Respiratory Health and Welfare Monitoring in Swine
by Md Sharifuzzaman, Hong-Seok Mun, Eddiemar B. Lagua, Md Kamrul Hasan, Jin-Gu Kang, Young-Hwa Kim, Ahsan Mehtab, Hae-Rang Park and Chul-Ju Yang
Biology 2026, 15(2), 177; https://doi.org/10.3390/biology15020177 - 18 Jan 2026
Viewed by 355
Abstract
Respiratory diseases remain one of the most significant health challenges in modern swine production, leading to substantial economic losses, compromised animal welfare, and increased antimicrobial use. In recent years, advances in artificial intelligence (AI), particularly machine learning and deep learning, have enabled the [...] Read more.
Respiratory diseases remain one of the most significant health challenges in modern swine production, leading to substantial economic losses, compromised animal welfare, and increased antimicrobial use. In recent years, advances in artificial intelligence (AI), particularly machine learning and deep learning, have enabled the development of non-invasive, continuous monitoring systems based on pig vocalizations. Among these, audio-based technologies have emerged as especially promising tools for early detection and monitoring of respiratory disorders under real farm conditions. This review provides a comprehensive synthesis of AI-driven audio classification approaches applied to pig farming, with focus on respiratory health and welfare monitoring. First, the biological and acoustic foundations of pig vocalizations and their relevance to health and welfare assessment are outlined. The review then systematically examines sound acquisition technologies, feature engineering strategies, machine learning and deep learning models, and evaluation methodologies reported in the literature. Commercially available systems and recent advances in real-time, edge, and on-farm deployment are also discussed. Finally, key challenges related to data scarcity, generalization, environmental noise, and practical deployment are identified, and emerging opportunities for future research including multimodal sensing, standardized datasets, and explainable AI are highlighted. This review aims to provide researchers, engineers, and industry stakeholders with a consolidated reference to guide the development and adoption of robust AI-based acoustic monitoring systems for respiratory health management in swine. Full article
(This article belongs to the Section Zoology)
22 pages, 8300 KB  
Article
Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics
by Gul Faraz, Lei Jing and Xiang Li
Sensors 2026, 26(2), 596; https://doi.org/10.3390/s26020596 - 15 Jan 2026
Viewed by 308
Abstract
This study presents a multimodal framework that uses smartphone motion sensors and generative AI to create audio comics from live news headlines. The system operates without direct touch or voice input, instead responding to simple hand-wave gestures. The system demonstrates potential as an [...] Read more.
This study presents a multimodal framework that uses smartphone motion sensors and generative AI to create audio comics from live news headlines. The system operates without direct touch or voice input, instead responding to simple hand-wave gestures. The system demonstrates potential as an alternative input method, which may benefit users who find traditional touch or voice interaction challenging. In the experiments, we investigated the generation of comics on based on the latest tech-related news headlines using Really Simple Syndication (RSS) on a simple hand wave gesture. The proposed framework demonstrates extensibility beyond comic generation, as various other tasks utilizing large language models and multimodal AI could be integrated by mapping them to different hand gestures. Our experiments with open-source models like LLaMA, LLaVA, Gemma, and Qwen revealed that LLaVA delivers superior results in generating panel-aligned stories compared to Qwen3-VL, both in terms of inference speed and output quality, relative to the source image. These large language models (LLMs) collectively contribute imaginative and conversational narrative elements that enhance diversity in storytelling within the comic format. Additionally, we implement an AI-in-the-loop mechanism to iteratively improve output quality without human intervention. Finally, AI-generated audio narration is incorporated into the comics to create an immersive, multimodal reading experience. Full article
(This article belongs to the Special Issue Body Area Networks: Intelligence, Sensing and Communication)
Show Figures

Figure 1

34 pages, 4760 KB  
Article
Design, Implementation, and Evaluation of a Low-Complexity Yelp Siren Detector Based on Frequency Modulation Symmetry
by Elena-Valentina Dumitrascu, Radu-Alexandru Badea, Răzvan Rughiniș and Robert Alexandru Dobre
Symmetry 2026, 18(1), 152; https://doi.org/10.3390/sym18010152 - 14 Jan 2026
Viewed by 143
Abstract
Robust detection of emergency vehicle sirens remains difficult due to modern soundproofing, competing audio, and variable traffic noise. Although many simulation-based studies have been reported, relatively few systems have been realized in hardware, and many proposed approaches rely on complex or artificial intelligence-based [...] Read more.
Robust detection of emergency vehicle sirens remains difficult due to modern soundproofing, competing audio, and variable traffic noise. Although many simulation-based studies have been reported, relatively few systems have been realized in hardware, and many proposed approaches rely on complex or artificial intelligence-based processing with limited interpretability. This work presents a physical implementation of a low-complexity yelp siren detector that leverages the symmetries of the yelp signal, together with its characterization under realistic conditions. The design is not based on conventional signal processing or machine learning pipelines. Instead, it uses a simple analog envelope-based principle with threshold-crossing rate analysis and a fixed comparator threshold. Its performance was evaluated using an open dataset of more than 1000 real-world audio recordings spanning different road conditions. Detection accuracy, false-positive behavior, and robustness were systematically evaluated on a real hardware implementation using multiple deployable decision rules. Among the evaluated detection rules, a representative operating point achieved a true positive rate of 0.881 at a false positive rate of 0.01, corresponding to a Matthews correlation coefficient of 0.899. The results indicate that a fixed-threshold realization can provide reliable yelp detection with very low computational requirements while preserving transparency and ease of implementation. The study establishes a pathway from conceptual detection principle to deployable embedded hardware. Full article
(This article belongs to the Section Engineering and Materials)
Show Figures

Figure 1

19 pages, 1607 KB  
Article
Real-Time Bird Audio Detection with a CNN-RNN Model on a SoC-FPGA
by Rodrigo Lopes da Silva, Gustavo Jacinto, Mário Véstias and Rui Policarpo Duarte
Electronics 2026, 15(2), 354; https://doi.org/10.3390/electronics15020354 - 13 Jan 2026
Viewed by 294
Abstract
Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio [...] Read more.
Monitoring wildlife has become increasingly important for understanding the evolution of species and ecosystem health. Acoustic monitoring offers several advantages over video-based approaches, enabling continuous 24/7 observation and robust detection under challenging environmental conditions. Deep learning models have demonstrated strong performance in audio classification. However, their computational complexity poses significant challenges for deployment on low-power embedded platforms. This paper presents a low-power embedded system for real-time bird audio detection. A hybrid CNN–RNN architecture is adopted, redesigned, and quantized to significantly reduce model complexity while preserving classification accuracy. To support efficient execution, a custom hardware accelerator was developed and integrated into a Zynq UltraScale+ ZU3CG FPGA. The proposed system achieves an accuracy of 87.4%, processes up to 5 audio samples per second, and operates at only 1.4 W, demonstrating its suitability for autonomous, energy-efficient wildlife monitoring applications. Full article
Show Figures

Figure 1

Back to TopTop