Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (739)

Search Parameters:
Keywords = audio signal

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
26 pages, 4595 KB  
Article
Combination of Audio Segmentation and Recurrent Neural Networks for Improved Alcohol Intoxication Detection in Speech Signals
by Pavel U. Laptev, Aleksey Sabanov, Alexander A. Shelupanov, Anton A. Konev and Alexander N. Kornetov
Symmetry 2026, 18(2), 262; https://doi.org/10.3390/sym18020262 - 30 Jan 2026
Abstract
This study proposes an approach for detecting alcohol intoxication from speech based on a combination of audio segmentation and a hybrid neural network architecture that integrates convolution neural network (CNN) and long-short term memory (LSTM) layers. The proposed design enables effective modeling of [...] Read more.
This study proposes an approach for detecting alcohol intoxication from speech based on a combination of audio segmentation and a hybrid neural network architecture that integrates convolution neural network (CNN) and long-short term memory (LSTM) layers. The proposed design enables effective modeling of both local spectral patterns and long-term temporal dependencies in speech signals. By operating on relatively long audio segments, the approach allows the simultaneous analysis of complex speech constructions and pause patterns, which are known to be sensitive to alcohol-induced speech impairments. Each audio signal was divided into two equal-duration segments that are processed sequentially by the model, which helps reduce the impact of asymmetrical distribution of intoxication-related speech artifacts. The approach was evaluated using the GradusSpeech-v1 corpus, which contains more than 1300 recordings of Russian tongue twisters collected from 31 speakers under controlled conditions in both sober and intoxicated states. Experimental results demonstrate that the proposed method achieves high performance. When full recordings are analyzed using median aggregation of segment-level predictions, the model reaches Accuracy, Recall, and F1-score values close to 0.93, indicating the effectiveness of the approach for alcohol intoxication detection in speech. Full article
(This article belongs to the Special Issue Symmetry: Feature Papers 2025)
Show Figures

Figure 1

34 pages, 1776 KB  
Article
Interpretable Acoustic Features from Wakefulness Tracheal Breathing for OSA Severity Assessment
by Ali Mohammad Alqudah, Walid Ashraf, Brian Lithgow and Zahra Moussavi
J. Clin. Med. 2026, 15(3), 1081; https://doi.org/10.3390/jcm15031081 - 29 Jan 2026
Abstract
Background: Obstructive Sleep Apnea (OSA) is one of the most prevalent sleep disorders associated with cardiovascular complications, cognitive impairments, and reduced quality of life. Early and accurate diagnosis is essential. The present gold standard, polysomnography, is expensive and resource-intensive. This work develops [...] Read more.
Background: Obstructive Sleep Apnea (OSA) is one of the most prevalent sleep disorders associated with cardiovascular complications, cognitive impairments, and reduced quality of life. Early and accurate diagnosis is essential. The present gold standard, polysomnography, is expensive and resource-intensive. This work develops a non-invasive machine-learning-based framework to classify four OSA severity groups (non, mild, moderate, and severe) using tracheal breathing sounds (TBSs) and anthropometric variables. Methods: A total of 199 participants were recruited, and TBS were recorded whilst awake (wakefulness) using a suprasternal microphone. The workflow included the following steps: signal preprocessing (segmentation, filtering, and normalization), multi-domain feature extraction representing spectral, temporal, nonlinear, and morphological features, adaptive feature normalization, and a three-stage feature selection that combined univariate filtering, Shapley Additive Explanations (SHAP)-based ranking, and recursive feature elimination (RFE). The classification included training ensemble learning models via bootstrap aggregation and validating them using stratified k-fold cross-validation (CV), while preserving the OSA severity and anthropometric distributions. Results: The proposed framework performed well in discriminating among OSA severity groups. TBS features, combined with anthropometric ones, increased classification performance and reliability across all severity classes, providing proof for the efficacy of non-invasive audio biomarkers for OSA screening. Conclusions: TBS-based model’s features, coupled with anthropometric information, offer a promising alternative or supplement to PSG for OSA severity detection. The approach provides scalability and accessibility to extend screening and potentially enables earlier detection of OSA, compared to cases that might remain undiagnosed without screening. Full article
Show Figures

Figure 1

20 pages, 5360 KB  
Article
Experimental Investigation of Deviations in Sound Reproduction
by Paul Oomen, Bashar Farran, Luka Nadiradze, Máté Csanád and Amira Val Baker
Acoustics 2026, 8(1), 7; https://doi.org/10.3390/acoustics8010007 - 28 Jan 2026
Viewed by 44
Abstract
Sound reproduction is the electro-mechanical re-creation of sound waves using analogue and digital audio equipment. Although sound reproduction implies that repeated acoustical events are close to identical, numerous fixed and variable conditions affect the acoustic result. To arrive at a better understanding of [...] Read more.
Sound reproduction is the electro-mechanical re-creation of sound waves using analogue and digital audio equipment. Although sound reproduction implies that repeated acoustical events are close to identical, numerous fixed and variable conditions affect the acoustic result. To arrive at a better understanding of the magnitude of deviations in sound reproduction, amplitude deviation and phase distortion of a sound signal were measured at various reproduction stages and compared under a set of controlled acoustical conditions, one condition being the presence of a human subject in the acoustic test environment. Deviations in electroacoustic reproduction were smaller than ±0.2 dB amplitude and ±3 degrees phase shift when comparing trials recorded on the same day (Δt < 8 h, mean uncertainty u = 1.58%). Deviations increased significantly with greater than two times the amplitude and three times the phase shift when comparing trials recorded on different days (Δt > 16 h, u = 4.63%). Deviations further increased significantly with greater than 15 times the amplitude and the phase shift when a human subject was present in the acoustic environment (u = 24.64%). For the first time, this study shows that the human body does not merely absorb but can also cause amplification of sound energy. The degree of attenuation or amplification per frequency shows complex variance depending on the type of reproduction and the subject, indicating a nonlinear dynamic interaction. The findings of this study may serve as a reference to update acoustical standards and improve accuracy and reliability of sound reproduction and its application in measurements, diagnostics and therapeutic methods. Full article
Show Figures

Figure 1

40 pages, 2047 KB  
Review
A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models
by Mirela-Magdalena Grosu (Marinescu), Octaviana Datcu, Ruxandra Tapu and Bogdan Mocanu
Appl. Sci. 2026, 16(3), 1289; https://doi.org/10.3390/app16031289 - 27 Jan 2026
Viewed by 92
Abstract
Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language [...] Read more.
Emotion recognition in video (ERV) aims to infer human affect from visual, audio, and contextual signals and is increasingly important for interactive and intelligent systems. Over the past decade, ERV has evolved from handcrafted features and task-specific deep learning models toward transformer-based vision–language models and multimodal large language models (MLLMs). This review surveys this evolution, with an emphasis on engineering considerations relevant to real-world deployment. We analyze multimodal fusion strategies, dataset characteristics, and evaluation protocols, highlighting limitations in robustness, bias, and annotation quality under unconstrained conditions. Emerging MLLM-based approaches are examined in terms of performance, reasoning capability, computational cost, and interaction potential. By comparing task-specific models with foundation model approaches, we clarify their respective strengths for resource-constrained versus context-aware applications. Finally, we outline practical research directions toward building robust, efficient, and deployable ERV systems for applied scenarios such as assistive technologies and human–AI interaction. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
20 pages, 1908 KB  
Article
Research on Real-Time Rainfall Intensity Monitoring Methods Based on Deep Learning and Audio Signals in the Semi-Arid Region of Northwest China
by Yishu Wang, Hongtao Jiang, Guangtong Liu, Qiangqiang Chen and Mengping Ni
Atmosphere 2026, 17(2), 131; https://doi.org/10.3390/atmos17020131 - 26 Jan 2026
Viewed by 194
Abstract
With the increasing frequency extreme weather events associated with climate change, real-time monitoring of rainfall intensity is critical for water resource management, disaster warning, and other applications. Traditional methods, such as ground-based rain gauges, radar, and satellites, face challenges like high costs, low [...] Read more.
With the increasing frequency extreme weather events associated with climate change, real-time monitoring of rainfall intensity is critical for water resource management, disaster warning, and other applications. Traditional methods, such as ground-based rain gauges, radar, and satellites, face challenges like high costs, low resolution, and monitoring gaps. This study proposes a novel real-time rainfall intensity monitoring method based on deep learning and audio signal processing, using acoustic features from rainfall to predict intensity. Conducted in the semi-arid region of Northwest China, the study employed a custom-designed sound collection device to capture acoustic signals from raindrop-surface interactions. The method, combining multi-feature extraction and regression modeling, accurately predicted rainfall intensity. Experimental results revealed a strong linear relationship between sound pressure and rainfall intensity (r = 0.916, R2 = 0.838), with clear nonlinear enhancement of acoustic energy during heavy rainfall. Compared to traditional methods like CML and radio link techniques, the acoustic approach offers advantages in cost, high-density deployment, and adaptability to complex terrain. Despite some limitations, including regional and seasonal biases, the study lays the foundation for future improvements, such as expanding sample coverage, optimizing sensor design, and incorporating multi-source data. This method holds significant potential for applications in urban drainage, agricultural irrigation, and disaster early warning. Full article
Show Figures

Figure 1

18 pages, 10692 KB  
Article
Short-Time Homomorphic Deconvolution (STHD): A Novel 2D Feature for Robust Indoor Direction of Arrival Estimation
by Yeonseok Park and Jun-Hwa Kim
Sensors 2026, 26(2), 722; https://doi.org/10.3390/s26020722 - 21 Jan 2026
Viewed by 169
Abstract
Accurate indoor positioning and navigation remain significant challenges, with audio sensor-based sound source localization emerging as a promising sensing modality. Conventional methods, often reliant on multi-channel processing or time-delay estimation techniques such as Generalized Cross-Correlation, encounter difficulties regarding computational complexity, hardware synchronization, and [...] Read more.
Accurate indoor positioning and navigation remain significant challenges, with audio sensor-based sound source localization emerging as a promising sensing modality. Conventional methods, often reliant on multi-channel processing or time-delay estimation techniques such as Generalized Cross-Correlation, encounter difficulties regarding computational complexity, hardware synchronization, and reverberant environments where time difference in arrival cues are masked. While machine learning approaches have shown potential, their performance depends heavily on the discriminative power of input features. This paper proposes a novel feature extraction method named Short-Time Homomorphic Deconvolution, which transforms multi-channel audio signals into a 2D Time × Time-of-Flight representation. Unlike prior 1D methods, this feature effectively captures the temporal evolution and stability of time-of-flight differences between microphone pairs, offering a rich and robust input for deep learning models. We validate this feature using a lightweight Convolutional Neural Network integrated with a dual-stage channel attention mechanism, designed to prioritize reliable spatial cues. The system was trained on a large-scale dataset generated via simulations and rigorously tested using real-world data acquired in an ISO-certified anechoic chamber. Experimental results demonstrate that the proposed model achieves precise Direction of Arrival estimation with a Mean Absolute Error of 1.99 degrees in real-world scenarios. Notably, the system exhibits remarkable consistency between simulation and physical experiments, proving its effectiveness for robust indoor navigation and positioning systems. Full article
Show Figures

Figure 1

19 pages, 922 KB  
Article
The Greek Vocative-Based Marker Moré in Contexts of Disagreement
by Angeliki Alvanoudi
Languages 2026, 11(1), 18; https://doi.org/10.3390/languages11010018 - 20 Jan 2026
Viewed by 157
Abstract
This study examines the functions of the vocative-based marker moré in contexts of disagreement in Greek conversation, drawing on interactional linguistics. The analysis of audio-recorded informal face-to-face conversations and telephone calls from the Corpus of Spoken Greek shows that, in such contexts, moré [...] Read more.
This study examines the functions of the vocative-based marker moré in contexts of disagreement in Greek conversation, drawing on interactional linguistics. The analysis of audio-recorded informal face-to-face conversations and telephone calls from the Corpus of Spoken Greek shows that, in such contexts, moré functions as an interpersonal marker, signaling solidarity and friendliness and thereby mitigating the potential face threat posed by disagreement. It also functions as a cognitive marker, conveying counterexpectation to the addressee. The study compares moré with its grammaticalized form, vre. Both moré and vre appear in contexts of ‘friendly’ disagreement with similar discourse functions. However, unlike vre, moré occurs in a broader range of disagreement types from the most to the least face-aggravating, including challenges, contradictions and counterclaims, and it also appears in contexts of impoliteness. This suggests that the two forms have different affordances, with vre displaying a higher level of solidarity than more. Full article
(This article belongs to the Special Issue Greek Speakers and Pragmatics)
24 pages, 5019 KB  
Article
A Dual Stream Deep Learning Framework for Alzheimer’s Disease Detection Using MRI Sonification
by Nadia A. Mohsin and Mohammed H. Abdul Ameer
J. Imaging 2026, 12(1), 46; https://doi.org/10.3390/jimaging12010046 - 15 Jan 2026
Viewed by 201
Abstract
Alzheimer’s Disease (AD) is an advanced brain illness that affects millions of individuals across the world. It causes gradual damage to the brain cells, leading to memory loss and cognitive dysfunction. Although Magnetic Resonance Imaging (MRI) is widely used in AD diagnosis, the [...] Read more.
Alzheimer’s Disease (AD) is an advanced brain illness that affects millions of individuals across the world. It causes gradual damage to the brain cells, leading to memory loss and cognitive dysfunction. Although Magnetic Resonance Imaging (MRI) is widely used in AD diagnosis, the existing studies rely solely on the visual representations, leaving alternative features unexplored. The objective of this study is to explore whether MRI sonification can provide complementary diagnostic information when combined with conventional image-based methods. In this study, we propose a novel dual-stream multimodal framework that integrates 2D MRI slices with their corresponding audio representations. MRI images are transformed into audio signals using a multi-scale, multi-orientation Gabor filtering, followed by a Hilbert space-filling curve to preserve spatial locality. The image and sound modalities are processed using a lightweight CNN and YAMNet, respectively, then fused via logistic regression. The experimental results of the multimodal achieved the highest accuracy in distinguishing AD from Cognitively Normal (CN) subjects at 98.2%, 94% for AD vs. Mild Cognitive Impairment (MCI), and 93.2% for MCI vs. CN. This work provides a new perspective and highlights the potential of audio transformation of imaging data for feature extraction and classification. Full article
(This article belongs to the Section AI in Imaging)
Show Figures

Figure 1

34 pages, 4760 KB  
Article
Design, Implementation, and Evaluation of a Low-Complexity Yelp Siren Detector Based on Frequency Modulation Symmetry
by Elena-Valentina Dumitrascu, Radu-Alexandru Badea, Răzvan Rughiniș and Robert Alexandru Dobre
Symmetry 2026, 18(1), 152; https://doi.org/10.3390/sym18010152 - 14 Jan 2026
Viewed by 128
Abstract
Robust detection of emergency vehicle sirens remains difficult due to modern soundproofing, competing audio, and variable traffic noise. Although many simulation-based studies have been reported, relatively few systems have been realized in hardware, and many proposed approaches rely on complex or artificial intelligence-based [...] Read more.
Robust detection of emergency vehicle sirens remains difficult due to modern soundproofing, competing audio, and variable traffic noise. Although many simulation-based studies have been reported, relatively few systems have been realized in hardware, and many proposed approaches rely on complex or artificial intelligence-based processing with limited interpretability. This work presents a physical implementation of a low-complexity yelp siren detector that leverages the symmetries of the yelp signal, together with its characterization under realistic conditions. The design is not based on conventional signal processing or machine learning pipelines. Instead, it uses a simple analog envelope-based principle with threshold-crossing rate analysis and a fixed comparator threshold. Its performance was evaluated using an open dataset of more than 1000 real-world audio recordings spanning different road conditions. Detection accuracy, false-positive behavior, and robustness were systematically evaluated on a real hardware implementation using multiple deployable decision rules. Among the evaluated detection rules, a representative operating point achieved a true positive rate of 0.881 at a false positive rate of 0.01, corresponding to a Matthews correlation coefficient of 0.899. The results indicate that a fixed-threshold realization can provide reliable yelp detection with very low computational requirements while preserving transparency and ease of implementation. The study establishes a pathway from conceptual detection principle to deployable embedded hardware. Full article
(This article belongs to the Section Engineering and Materials)
Show Figures

Figure 1

19 pages, 8336 KB  
Article
Dendritic Spiking Neural Networks with Combined Membrane Potential Decay and Dynamic Threshold for Sequential Recognition
by Qian Zhou, Wenjie Wang and Mengting Qiao
Appl. Sci. 2026, 16(2), 748; https://doi.org/10.3390/app16020748 - 11 Jan 2026
Viewed by 324
Abstract
Spiking neural networks (SNNs) aim to simulate human neural networks with biologically plausible neurons. However, conventional SNNs based on point neurons ignore the inherent dendritic computation of biological neurons. Additionally, these point neurons usually employ single membrane potential decay and a fixed firing [...] Read more.
Spiking neural networks (SNNs) aim to simulate human neural networks with biologically plausible neurons. However, conventional SNNs based on point neurons ignore the inherent dendritic computation of biological neurons. Additionally, these point neurons usually employ single membrane potential decay and a fixed firing threshold, which is in contrast to the heterogeneity of real neural networks and limits the neuronal dynamic diversity needed when dealing with multi-scale sequential tasks. In this work, we propose a dendritic spiking neuron model with combined membrane potential decay and a dynamic firing threshold. Then, we extend the neuron model to the feedforward network level, termed dendritic spiking neural network with combined membrane potential decay and dynamic threshold (CD-DT-DSNN). By learning the heterogeneous neuronal decay factors, which combine two different membrane potential decay mechanisms, and learning adaptive factors, our networks can rapidly respond to input signals and dynamically regulate neuronal firing rates, which help the extraction of multi-scale spatio-temporal features. Experiments on four spike-based audio and image sequential datasets demonstrate that our CD-DT-DSNN outperformed state-of-the-art heterogeneous SNNs and dendritic compartment SNNs with higher classification accuracy and fewer parameters. This work suggests that heterogeneity in neuronal membrane potential decay and neural firing thresholds is a critical component in learning multi-timescale temporal dynamics and maintaining long-term memory, providing a novel perspective for constructing high biologically plausible neuromorphic computing models. It provides a solution for multi-timescale temporal sequential tasks, such as speech recognition, EEG signal recognition, and robot place recognition. Full article
Show Figures

Figure 1

28 pages, 3179 KB  
Article
FakeVoiceFinder: An Open-Source Framework for Synthetic and Deepfake Audio Detection
by Cesar Pachon and Dora Ballesteros
Big Data Cogn. Comput. 2026, 10(1), 25; https://doi.org/10.3390/bdcc10010025 - 7 Jan 2026
Viewed by 432
Abstract
AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective [...] Read more.
AI-based audio generation has advanced rapidly, enabling deepfake audio to reach levels of naturalness that closely resemble real recordings and complicate the distinction between authentic and synthetic signals. While numerous CNN- and Transformer-based detection approaches have been proposed, most adopt a model-centric perspective in which the spectral representation remains fixed. Parallel data-centric efforts have explored alternative representations such as scalograms and CQT, yet the field still lacks a unified framework that jointly evaluates the influence of model architecture, its hyperparameters (e.g., learning rate, number of epochs), and the spectral representation along with its own parameters (e.g., representation type, window size). Moreover, there is no standardized approach for benchmarking custom architectures against established baselines under consistent experimental conditions. FakeVoiceFinder addresses this gap by providing a systematic framework that enables direct comparison of model-centric, data-centric, and hybrid evaluation strategies. It supports controlled experimentation, flexible configuration of models and representations, and comprehensive performance reporting tailored to the detection task. This framework enhances reproducibility and helps clarify how architectural and representational choices interact in synthetic audio detection. Full article
Show Figures

Figure 1

29 pages, 808 KB  
Review
Spectrogram Features for Audio and Speech Analysis
by Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song and Donny Soh
Appl. Sci. 2026, 16(2), 572; https://doi.org/10.3390/app16020572 - 6 Jan 2026
Viewed by 581
Abstract
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency [...] Read more.
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a range of machine learning techniques such as convolutional neural networks, which had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

24 pages, 2626 KB  
Article
Markov Chain Wave Generative Adversarial Network for Bee Bioacoustic Signal Synthesis
by Kumudu Samarappuli, Iman Ardekani, Mahsa Mohaghegh and Abdolhossein Sarrafzadeh
Sensors 2026, 26(2), 371; https://doi.org/10.3390/s26020371 - 6 Jan 2026
Viewed by 273
Abstract
This paper presents a framework for synthesizing bee bioacoustic signals associated with hive events. While existing approaches like WaveGAN have shown promise in audio generation, they often fail to preserve the subtle temporal and spectral features of bioacoustic signals critical for event-specific classification. [...] Read more.
This paper presents a framework for synthesizing bee bioacoustic signals associated with hive events. While existing approaches like WaveGAN have shown promise in audio generation, they often fail to preserve the subtle temporal and spectral features of bioacoustic signals critical for event-specific classification. The proposed method, MCWaveGAN, extends WaveGAN with a Markov Chain refinement stage, producing synthetic signals that more closely match the distribution of real bioacoustic data. Experimental results show that this method captures signal characteristics more effectively than WaveGAN alone. Furthermore, when integrated into a classifier, synthesized signals improved hive status prediction accuracy. These results highlight the potential of the proposed method to alleviate data scarcity in bioacoustics and support intelligent monitoring in smart beekeeping, with broader applicability to other ecological and agricultural domains. Full article
(This article belongs to the Special Issue AI, Sensors and Algorithms for Bioacoustic Applications)
Show Figures

Figure 1

23 pages, 1037 KB  
Article
Acoustic Side-Channel Vulnerabilities in Keyboard Input Explored Through Convolutional Neural Network Modeling: A Pilot Study
by Michał Rzemieniuk, Artur Niewiarowski and Wojciech Książek
Appl. Sci. 2026, 16(2), 563; https://doi.org/10.3390/app16020563 - 6 Jan 2026
Viewed by 323
Abstract
This paper presents the findings of a pilot study investigating the feasibility of recognizing keyboard keystroke sounds using Convolutional Neural Networks (CNNs) as a means of simulating an acoustic side-channel attack aimed at recovering typed text. A dedicated dataset of keyboard audio recordings [...] Read more.
This paper presents the findings of a pilot study investigating the feasibility of recognizing keyboard keystroke sounds using Convolutional Neural Networks (CNNs) as a means of simulating an acoustic side-channel attack aimed at recovering typed text. A dedicated dataset of keyboard audio recordings was collected and preprocessed using signal-processing techniques, including Fourier-transform-based feature extraction and mel-spectrogram analysis. Data augmentation methods were applied to improve model robustness, and a CNN-based prediction architecture was developed and trained. A series of experiments was performed under multiple conditions, including controlled laboratory settings, scenarios with background noise interference, tests involving a different keyboard model, and evaluations following model quantization. The results indicate that CNN-based models can achieve high keystroke-prediction accuracy, demonstrating that this class of acoustic side-channel attacks is technically viable. Additionally, the study outlines potential mitigation strategies designed to reduce exposure to such threats. Overall, the findings highlight the need for increased awareness of acoustic side-channel vulnerabilities and underscore the importance of further research to more comprehensively understand, evaluate, and prevent attacks of this nature. Full article
(This article belongs to the Special Issue Artificial Neural Network and Deep Learning in Cybersecurity)
Show Figures

Figure 1

14 pages, 1392 KB  
Article
AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots
by Xiugong Qin, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun and Xiao Zhong
Electronics 2026, 15(1), 239; https://doi.org/10.3390/electronics15010239 - 5 Jan 2026
Viewed by 304
Abstract
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited [...] Read more.
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications. Full article
Show Figures

Figure 1

Back to TopTop