Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (445)

Search Parameters:
Keywords = voice detection

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 1129 KB  
Article
Enhancing Early Detection of Alzheimer’s Disease: An Ensemble Model for Multi-Domain Cognitive Assessment Using Voice and Video
by Shinwoo Ham, Donghun Min, Hyo Jin Jon, Jung Eun Shin and Eun Yi Kim
Sensors 2026, 26(12), 3833; https://doi.org/10.3390/s26123833 (registering DOI) - 16 Jun 2026
Abstract
Accurate early screening of Alzheimer’s disease (AD) is crucial, yet traditional diagnostic methods are often limited by invasiveness or high costs. Therefore, there is a critical need for non-invasive biomarkers that enable precise and accessible screening. In this study, we propose a multi-modal [...] Read more.
Accurate early screening of Alzheimer’s disease (AD) is crucial, yet traditional diagnostic methods are often limited by invasiveness or high costs. Therefore, there is a critical need for non-invasive biomarkers that enable precise and accessible screening. In this study, we propose a multi-modal digital biomarker framework designed to accurately detect AD by evaluating impairments across multiple cognitive domains, such as language, working memory, and visuospatial attention. By leveraging voice and video data, our approach significantly enhances user accessibility and real-world applicability. We validated the proposed framework using a dataset of 128 participants, comprising 77 healthy controls (HCs) and 51 patients with AD. While individual cognitive tasks yielded F1-scores ranging from 69.23% to 77.78% and sensitivities from 69.23% to 80.77%, our ensemble strategy significantly enhanced detection performance, achieving an F1-score of 83.64% and a sensitivity of 88.46%. These findings confirm that the proposed multi-modal digital biomarker framework, enhanced via ensembling, provides a highly accurate, scalable, and practical solution for the non-invasive screening and detection of AD. Full article
(This article belongs to the Section Intelligent Sensors)
26 pages, 5864 KB  
Article
An Electrophysiological Study on the Neural Responses of Speaker Discrimination
by Puyang Geng, Xingui Wang, Hong Guo and Weibei Dou
Behav. Sci. 2026, 16(6), 1011; https://doi.org/10.3390/bs16061011 (registering DOI) - 16 Jun 2026
Abstract
The ability to distinguish speakers based on speech signals is a fundamental human ability essential for social communication, yet the neural mechanisms underlying this process remain poorly understood. The present study investigated the temporal dynamics of neural activity during speaker discrimination using event-related [...] Read more.
The ability to distinguish speakers based on speech signals is a fundamental human ability essential for social communication, yet the neural mechanisms underlying this process remain poorly understood. The present study investigated the temporal dynamics of neural activity during speaker discrimination using event-related potentials (ERPs). Twenty-four native Mandarin speakers completed two tasks: an oddball session, in which participants passively listened to speech stimuli from standard and deviant speakers, and a voice line-up session, in which participants explicitly judged whether two consecutively presented speech stimuli were produced by the same or different speakers. In the oddball session, deviant stimuli elicited robust mismatch negativity (MMN) and P3a components compared to standard stimuli, indicating pre-attentive detection of speaker changes. In the voice line-up session, the different-speaker condition elicited more negative N1 and N400 amplitudes and more positive P2 amplitudes than the same-speaker condition, suggesting that speaker discrimination engages both early sensory processing and later cognitive integration. No significant differences were observed between the P300 and P600 components. These findings reveal distinct neural signatures associated with speaker-related processing across multiple temporal stages, with the MMN and P3a reflecting automatic detection of speaker-related acoustic changes, and the N1, P2, and N400 reflecting explicit speaker discrimination processes. While the present paradigm cannot fully isolate identity-level representations from low-level acoustic discrimination, the results provide novel ERP evidence on the temporal architecture engaged when listeners process speaker-specific information, contributing to a deeper understanding of speaker-related processing in the broader context of speaker identification research. Full article
Show Figures

Figure 1

25 pages, 4330 KB  
Article
Extreme Edge Computing for Secure and Private Multimodal Biometric Identification in Intelligent IoT Systems
by José Antonio de la Torre, Fernando Rincón, Soledad Escolar, Antonio Caruso, Julián Caba and Jesús Barba
Sensors 2026, 26(12), 3756; https://doi.org/10.3390/s26123756 (registering DOI) - 12 Jun 2026
Viewed by 108
Abstract
The exponential growth of Internet of Things (IoT) ecosystems is driving a paradigm shift from centralized cloud computing towards decentralized architectures to mitigate latency and bandwidth constraints. While edge computing addresses some of these challenges, data transmission to local gateways still raises critical [...] Read more.
The exponential growth of Internet of Things (IoT) ecosystems is driving a paradigm shift from centralized cloud computing towards decentralized architectures to mitigate latency and bandwidth constraints. While edge computing addresses some of these challenges, data transmission to local gateways still raises critical security and privacy concerns. This study explores the Compute Continuum by pushing intelligence to the extreme edge using TinyML. We propose a secure, privacy-preserving multimodal biometric authentication system designed for resource-constrained embedded devices. Our solution implements a hierarchical processing chain: an ultra-lightweight person-detection filter acts as an intelligent wake-up mechanism, followed by robust facial and voice authentication modules. Operating as a strict hierarchical pipeline, the system achieves a combined False Acceptance Rate (FAR) of just 0.12%. Experimental results on an ESP32 microcontroller demonstrate exceptional energy efficiency, requiring only 0.15 J per inference cycle. This allows the system to operate autonomously for over 39 h of continuous inference on a standard 600 mAh battery, proving the viability of standalone, privacy-by-design biometric sensors in intelligent IoT environments. Full article
Show Figures

Figure 1

26 pages, 3315 KB  
Article
Remote Tower Air Traffic Controller Fatigue Detection Based on Eye-Tracking and EEG Fusion
by Dajiang Song, Weijun Pan, Zirui Yin, Boyuan Han and Huafei Gao
Aerospace 2026, 13(6), 549; https://doi.org/10.3390/aerospace13060549 (registering DOI) - 12 Jun 2026
Viewed by 136
Abstract
Remote tower operations require air traffic controllers to maintain continuous visual monitoring and integrate information from panoramic displays, radar data, flight strips, and voice communication. Such screen-mediated and sustained surveillance tasks may lead to covert fatigue, which is difficult to capture using a [...] Read more.
Remote tower operations require air traffic controllers to maintain continuous visual monitoring and integrate information from panoramic displays, radar data, flight strips, and voice communication. Such screen-mediated and sustained surveillance tasks may lead to covert fatigue, which is difficult to capture using a single physiological or behavioral signal. To address this issue, this study proposes a Gated EEG–Eye Fusion Network (GEEF-Net) for window-level fatigue detection in remote tower controllers. EEG and eye-tracking signals were synchronously collected during simulated remote tower tasks and segmented into 5 s windows with a 2 s step. For each window, 53 EEG features and 47 eye-tracking features were extracted to construct a 100-dimensional multimodal representation. GEEF-Net adopts a lightweight modality-gating mechanism to adaptively weight EEG and eye-tracking representations before fatigue classification. Under the main subject-dependent validation setting, GEEF-Net achieved an Accuracy of 0.883, an F1-score of 0.788, and a ROC-AUC of 0.944, outperforming EEG-only, eye-only, and early-fusion baselines in most overall metrics. The gating analysis indicated that eye-tracking features received a higher average weight than EEG features, suggesting the importance of visual behavior in remote tower fatigue detection. Cross-subject validation showed that individual differences remain a major challenge, while few-shot subject-specific calibration improved model adaptation when limited target-subject samples were available. These findings suggest that EEG–eye-tracking fusion with lightweight modality gating is a feasible approach for fatigue detection in simulated remote tower tasks. However, larger datasets and operationally realistic validation considering shift work, circadian effects, and operational pressure are still required before the approach can be considered operationally reliable. Full article
(This article belongs to the Section Air Traffic and Transportation)
Show Figures

Figure 1

18 pages, 1777 KB  
Article
DeepFakeX: A Comprehensive Multimodal Deepfake Dataset for Research and Analysis
by Sonia Salman, Jawwad Ahmed Shamsi and Rizwan Qureshi
Data 2026, 11(6), 141; https://doi.org/10.3390/data11060141 - 11 Jun 2026
Viewed by 124
Abstract
The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled [...] Read more.
The expanding capabilities of deep learning-based media synthesis have intensified concerns regarding the authenticity of digital content and the reliability of forensic analysis tools. In response to these challenges, this work introduces DeepFakeX, a collection of 800 synthetically generated videos available under controlled access for research purposes. The dataset encompasses four distinct categories of AI-driven synthesis: facial identity replacement, audio track substitution, neural voice cloning, and combined audiovisual alteration. Unlike existing deepfake datasets that predominantly focus on facial synthesis, DeepFakeX covers a broader range of manipulation modalities, reflecting the diversity of synthetic media encountered in real-world settings. All deepfakes were generated using state-of-the-art, publicly available tools. Standardized post-processing procedures were applied to each video to ensure uniformity in terms of quality, duration and encoding format. DeepFakeX also emphasizes diversity in gender, age, ethnicity, and language. Video contexts span speeches, informational videos, movie clips, news broadcasts, and interviews that reflect content scenarios commonly encountered in real-world online environments. The dataset includes videos in both English and Urdu. The dataset’s quality and structural variability were assessed through visual and audio analyses using the Structural Similarity Index Measure (SSIM), Mel-Frequency Cepstral Coefficients (MFCCs), and Principal Component Analysis (PCA). The evaluation results revealed substantial variability within each manipulation category, along with clearly distinguishable patterns specific to each modality. DeepFakeX has been developed to facilitate rigorous and transparent research in deepfake detection, cross-modal forensic analysis, and AI-driven media forensics. It is hosted on Zenodo under controlled access for research use. Full article
36 pages, 12426 KB  
Article
Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components
by Veena Phunpeng, Kreetiwat Chaiyasin, Kitsana Khodcharad, Wipada Boransan, Watcharapong Patangtalo and Attaphon Chaimanatsakun
Appl. Syst. Innov. 2026, 9(6), 120; https://doi.org/10.3390/asi9060120 - 2 Jun 2026
Viewed by 324
Abstract
Ensuring the cleanliness of precision components is critical in Hard Disk Drive (HDD) manufacturing, where microscopic dust contamination on the Voice Coil Motor Assembly (VCMA) can lead to positioning errors, unstable head movement, and long-term reliability failures. However, automated inspection of such contamination [...] Read more.
Ensuring the cleanliness of precision components is critical in Hard Disk Drive (HDD) manufacturing, where microscopic dust contamination on the Voice Coil Motor Assembly (VCMA) can lead to positioning errors, unstable head movement, and long-term reliability failures. However, automated inspection of such contamination remains challenging because dust particles are extremely small, visually irregular, and often appear under complex microscopic backgrounds. This study presents an explainable hybrid deep learning framework for microscopic dust inspection by integrating object detection for precise localization and image classification for defect confirmation. Three YOLO architectures, namely YOLOv5, YOLOv8, and YOLOv11, were comparatively evaluated for dust detection, while three convolutional neural network (CNN) models, ResNet50, EfficientNetB0, and MobileNetV2, were implemented using transfer learning with frozen feature extraction layers for Good (G) and Not Good (NG) image-level classification. The experimental dataset consisted of annotated microscopic VCMA images, with data augmentation applied to the training subset to mitigate limited sample size and class imbalance. Experimental results showed that YOLOv8 achieved the strongest overall aggregate detection performance, whereas YOLOv5 was selected as the preferred detector for subsequent hybrid integration because it produced fewer false positives under reflective and textured microscopic backgrounds. YOLOv11 exhibited lower detection performance in the present setting, likely due to its architectural characteristics being less suited to the limited-data and high-background-complexity conditions of this study. In the present experimental setting, YOLOv5 achieved mAP@0.5 = 0.62, precision = 0.75, and recall = 0.69. For image-level classification, EfficientNetB0 achieved the highest classification accuracy of 93.10%, with F1-score = 0.932 and AUC = 0.986. In addition, Grad-CAM visualizations demonstrated that EfficientNetB0 consistently focused on physically meaningful dust-contaminated regions, thereby enhancing the interpretability of the classification results. Overall, the proposed hybrid framework integrating YOLOv5-based localization with EfficientNetB0-based defect confirmation showed promising potential for improving inspection reliability, false-alarm control, and explainability in automated VCMA quality inspection. These findings support the feasibility of explainable deep learning for microscopic defect inspection in HDD manufacturing and suggest its potential applicability to other precision manufacturing environments. Full article
Show Figures

Figure 1

20 pages, 2019 KB  
Review
Diagnostic Accuracy of Artificial Intelligence in Laryngeal Disorders: An Integrative Review
by Samantha Mairesse, Antonino Maniaci, Giovanni Briganti and Jerome R. Lechien
J. Pers. Med. 2026, 16(6), 301; https://doi.org/10.3390/jpm16060301 - 1 Jun 2026
Viewed by 553
Abstract
Background/Objectives: Laryngeal disorders are among the most prevalent conditions in otolaryngology, yet they remain challenging to diagnose without specialized expertise. Artificial intelligence (AI) systems leveraging machine learning (ML) and deep learning (DL) have demonstrated promising performance for the automatic detection and classification [...] Read more.
Background/Objectives: Laryngeal disorders are among the most prevalent conditions in otolaryngology, yet they remain challenging to diagnose without specialized expertise. Artificial intelligence (AI) systems leveraging machine learning (ML) and deep learning (DL) have demonstrated promising performance for the automatic detection and classification of voice disorders and laryngeal lesions. Methods: This review synthesizes findings from 88 studies published between 2015 and 2025 on AI-based laryngeal disorder detection, considering physioacoustic mechanisms, databases and acquisition protocols, AI architectures and validation strategies, and diagnostic performance. Results: The current literature supports high internal accuracies for binary healthy versus pathological detection (88–99%); meanwhile, performance decreases for higher-level tasks such as pathophysiological category classification and identification, particularly under external validation. From a clinical perspective, clinicians do not infer specific diagnoses from isolated acoustic parameters such as percent jitter or shimmer. Instead, they rely on how these perturbation patterns dynamically evolve during connected speech, where alterations guide perceptual differentiation between underlying disorders. Recurrent sources of bias include dependence on a limited number of historical vowel-based databases, class and demographic imbalance, and limited ecological validity of recording protocols. Additional concerns involve the predominant use of internal cross-validation and insufficient reproducibility or code sharing. Conclusions: Drawing on the literature, an integrative three-level clinical recognition framework is proposed, delineating realistic use cases for AI as a decision-support tool rather than an autonomous diagnostic system. Key priorities for future personalized medicine and research are also identified, including diversified multi-center datasets, standardized methodological reporting, rigorous external validation, and compliance with regulatory and ethical requirements for medical AI deployment. Full article
Show Figures

Figure 1

35 pages, 3033 KB  
Article
Exploring Pre-Service Teachers’ Reflection on Nonverbal Behavior in Microteaching Through Three-Point Comparison Feedback
by Shota Shirasaka, Takahisa Imagawa and Shuichi Enokida
Educ. Sci. 2026, 16(5), 760; https://doi.org/10.3390/educsci16050760 - 11 May 2026
Cited by 1 | Viewed by 278
Abstract
Discrepancies among feedback sources are typically treated as measurement errors, yet they may serve as catalysts for deeper professional reflection. This exploratory, single-group mixed-methods study, conducted in one Japanese teacher education context, examined how three-point comparison feedback (3PCF)—the simultaneous presentation of automated video-based [...] Read more.
Discrepancies among feedback sources are typically treated as measurement errors, yet they may serve as catalysts for deeper professional reflection. This exploratory, single-group mixed-methods study, conducted in one Japanese teacher education context, examined how three-point comparison feedback (3PCF)—the simultaneous presentation of automated video-based evaluation, peer evaluation, and self-evaluation—relates to pre-service teachers’ reflection on nonverbal teaching behavior in microteaching. Drawing on Hattie and Timperley’s feedback model and the concept of cognitive conflict, 27 participants received 3PCF on multiple nonverbal behaviors and completed written reflections analyzed using an ordinal coding scheme, keyword detection, and text mining. Quantitative analysis revealed that agreement between automated and peer evaluation was strongly item-dependent (e.g., voice volume: r = 0.853; facial expression: r = 0.164). Qualitative analysis showed that discrepancies were associated with multi-layered reflection; as exploratory, prompt-sensitive indicators, keyword detection suggested that 67% of participants recognized gaps between self-perception and external evaluations, 41% reasoned about why sources diverged, and 70% formulated specific behavioral improvement plans. Text mining further identified distinct reflection patterns, suggesting multiple cognitive pathways. These findings, based on a single cohort, suggest that structured comparison across feedback sources can reframe evaluation discrepancies as educational resources associated with reflective and actionable responses in teacher education. Full article
(This article belongs to the Section Teacher Education)
Show Figures

Figure 1

34 pages, 2094 KB  
Review
Sensor-Driven Deep Learning for Smart Home Intelligence: Signal Analysis, Multimodal Perception, and System-Level Applications
by Chenchen Wu, Ziqian Yang and Tao Sun
Sensors 2026, 26(10), 2993; https://doi.org/10.3390/s26102993 - 9 May 2026
Viewed by 761
Abstract
Smart home environments are evolving toward context-aware intelligent systems with the rapid integration of the Internet of Things (IoT), edge computing, and artificial intelligence. In such settings, large volumes of heterogeneous sensor data must be continuously processed to support perception, behavior understanding, and [...] Read more.
Smart home environments are evolving toward context-aware intelligent systems with the rapid integration of the Internet of Things (IoT), edge computing, and artificial intelligence. In such settings, large volumes of heterogeneous sensor data must be continuously processed to support perception, behavior understanding, and autonomous decision-making. Deep learning has emerged as a key approach for transforming raw sensor signals into structured representations that enable these functions. This review examines recent advances in deep learning for smart home applications from a sensor-driven perspective. Existing studies are organized into five major domains: human activity recognition, health monitoring and assisted living, smart energy management, security monitoring and anomaly detection, and voice interaction and intelligent control. Representative methodological paradigms—including convolutional and recurrent neural networks, Transformers, graph-based learning, multimodal fusion, and deep reinforcement learning—are discussed with emphasis on their roles in signal representation, multimodal integration, and decision-oriented modeling. Despite notable progress, several challenges continue to limit real-world deployment. These include the scarcity of high-quality labeled data, privacy and security concerns associated with continuous sensing, limited generalization across environments and users, constraints of edge devices, and the limited interpretability of model output. Addressing these issues requires advances not only in model design but also in data-efficient learning, privacy-preserving architectures, and system-level integration. Future research is expected to focus on multimodal perception, distributed and edge intelligence, knowledge-enhanced modeling, and human-centered explainable systems. By synthesizing current developments and highlighting open challenges, this review aims to support the development of robust and deployable deep learning solutions for next-generation smart home systems. Full article
Show Figures

Figure 1

20 pages, 935 KB  
Article
FDE-Mamba: Selective State Space Modeling for Personal Voice Activity Detection
by Chien-Chia Chiu, Tai-You Chen, Tzu-Wei Wang, Berlin Chen and Jeih-Weih Hung
Appl. Sci. 2026, 16(10), 4688; https://doi.org/10.3390/app16104688 - 9 May 2026
Viewed by 240
Abstract
Voice Activity Detection (VAD) and Personal Voice Activity Detection (PVAD) are fundamental components in modern voice-based human–machine interaction systems. While VAD distinguishes speech from non-speech segments, PVAD further identifies whether the detected speech belongs to a specific target speaker, enabling more robust performance [...] Read more.
Voice Activity Detection (VAD) and Personal Voice Activity Detection (PVAD) are fundamental components in modern voice-based human–machine interaction systems. While VAD distinguishes speech from non-speech segments, PVAD further identifies whether the detected speech belongs to a specific target speaker, enabling more robust performance in multi-speaker environments. Recently, the Flexible Dynamic Encoder RNN (FDE-RNN) has demonstrated state-of-the-art performance on PVAD tasks by leveraging a detachable Personalization module (P-module) built upon a Dynamic Encoder RNN backbone. However, the Long Short-Term Memory (LSTM) networks employed throughout FDE-RNN inherently suffer from sequential processing constraints that prevent parallelization across time steps, and their fixed-size hidden state may restrict representational capacity for fine-grained speaker discrimination. In this paper, we propose FDE-Mamba, which replaces all three LSTM components in FDE-RNN—the Prediction RNN, the Encoder RNN, and the P-module temporal model—with independent Mamba blocks, each equipped with a selective state space mechanism and an expansion layer for enriched feature representation. The proposed architecture retains the weighted residual connection, FiLM-based speaker embedding fusion, and parallel training strategy of the original FDE-RNN without modification. Experimental results on the LibriSpeech corpus demonstrate that FDE-Mamba achieves a PVAD mAP of 0.9605, representing a 1.97% improvement over the reproduced FDE-RNN baseline (0.9419), along with an accuracy improvement from 86.85% to 89.87% and a 3.16× reduction in real-time factor owing to the memory-efficient linear recurrences of the Mamba selective scan during inference, alongside its inherent parallelizability during training. Ablation studies further confirm that both the D skip connection and the expansion layer within each Mamba block contribute meaningfully to the observed performance gains, validating the effectiveness of each architectural design choice. These results suggest that Mamba is a compelling alternative to LSTM for temporal modeling in PVAD systems, and that the proposed integration provides a design blueprint for future selective SSM applications in gated PVAD architectures. Full article
(This article belongs to the Special Issue Application of Deep Learning in Speech Enhancement Technology)
Show Figures

Figure 1

17 pages, 10399 KB  
Article
Postoperative Hypoglossal Nerve Palsy in Breast Reconstruction Surgery
by Gil Joon Lee, Woosung Jang, Joon Suk Moon, Byeongju Kang, Jeeyeon Lee, Ho Yong Park, Jeong Yeop Ryu, Kang Young Choi, Jung Dug Yang, Ho Yun Chung and Joon Seok Lee
Medicina 2026, 62(5), 912; https://doi.org/10.3390/medicina62050912 - 8 May 2026
Viewed by 367
Abstract
Background/Objectives: Hypoglossal nerve palsy is a rare but disabling complication of general anesthesia, typically associated with tracheal intubation and head and neck surgery. This study evaluated the incidence, clinical characteristics, and potential mechanisms of postoperative tongue deviation after breast reconstruction and other surgeries [...] Read more.
Background/Objectives: Hypoglossal nerve palsy is a rare but disabling complication of general anesthesia, typically associated with tracheal intubation and head and neck surgery. This study evaluated the incidence, clinical characteristics, and potential mechanisms of postoperative tongue deviation after breast reconstruction and other surgeries performed under general anesthesia with orotracheal intubation. Methods: We retrospectively reviewed 240,646 consecutive general anesthetic procedures with orotracheal intubation performed at two tertiary hospitals between September 2011 and October 2025. Eighteen patients who developed new-onset postoperative tongue deviation were identified, and demographic features, surgical department, breast reconstruction status, anesthetic details, patient positioning, laterality of deviation, symptom duration, and recovery outcomes were analyzed. Results: Postoperative tongue deviation was documented in 18 patients, corresponding to an overall incidence of approximately 0.01%, most frequently after breast reconstruction (7/18, 38.9%), followed by vascular (27.8%), head and neck tumor (16.7%), neurosurgical (11.1%), and hepatobiliary–pancreatic surgery (5.6%). All seven breast-reconstruction cases occurred at the breast-cancer center hospital, corresponding to 0.31% of 2256 breast reconstructions. The median age was 58.0 years; 66.7% patients were female. Most patients (77.8%) achieved complete recovery, whereas 16.7% had residual deviation. Conclusions: Postoperative hypoglossal nerve palsy with tongue deviation is an exceptionally rare event after general anesthesia. In our two-center cohort, it was observed most frequently among patients undergoing breast reconstruction at one participating center; this pattern is confounded by institution-specific anesthetic and positioning practices and should not be interpreted as evidence that the procedure itself carries inherent risk. The findings are hypothesis-generating and suggest that prolonged operating time, repeated intraoperative position changes, and specific head-fixation and tube-fixation practices warrant prospective investigation. Meticulous head–neck alignment, careful tube fixation, and a structured postoperative cranial-nerve check (tongue-protrusion and voice-quality assessment in the recovery room and on postoperative day 1) may aid the early detection of this complication. Full article
(This article belongs to the Section Surgery)
Show Figures

Figure 1

15 pages, 747 KB  
Article
Multimodal Recognition of Out-of-Distribution Individuals Using Contrastive Learning
by Sergio Garcia, Francisco Gomez-Donoso and Miguel Cazorla
AI 2026, 7(5), 162; https://doi.org/10.3390/ai7050162 - 6 May 2026
Viewed by 742
Abstract
This paper presents an innovative methodology detecting out-of-distribution individuals based on a multimodal contrastive learning approach. The system combines voice and facial image data by projecting them into a shared representation in the embedding space, enable accurate identification of previously unseen individuals. This [...] Read more.
This paper presents an innovative methodology detecting out-of-distribution individuals based on a multimodal contrastive learning approach. The system combines voice and facial image data by projecting them into a shared representation in the embedding space, enable accurate identification of previously unseen individuals. This approach overcomes the limitations of traditional methods by providing more robust and consistent detection in dynamic scenarios, using advanced neural networks and optimized contrastive losses. Specifically, the main contribution of this work is the introduction of a multimodal contrastive framework that performs cross-modal consistency verification between facial and vocal representations, enabling reliable detection of out-of-distribution individuals without the need for identity gallery retrieval. Experimental results on multiple datasets highlight the effectiveness of the system, with accuracy above 90% in detecting in-distribution samples in all evaluated cases. Regarding the identification of out-of-distribution cases, the system maintains outstanding performance, achieving values close to 90% on average, with some datasets exceeding 95%. These results underscore its ability to recognize both known identities and handle unknown data, even under challenging conditions. This approach represents a significant advancement in the multimodal recognition of individuals, with potential applications in critical areas such as security, surveillance, and human–computer interaction. Full article
Show Figures

Figure 1

29 pages, 1725 KB  
Article
A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots
by Arecia Segura-Bencomo, Marcos Maroto-Gómez, Juan José Gamboa-Montero and José Carlos Castillo
Appl. Sci. 2026, 16(9), 4548; https://doi.org/10.3390/app16094548 - 5 May 2026
Viewed by 373
Abstract
Social robots are systems designed to assist people across different fields. During their operation, they have to interact with people with different characteristics and necessities. Consequently, correctly recognising the user interacting with the robot facilitates the generation of a personalised experience that satisfies [...] Read more.
Social robots are systems designed to assist people across different fields. During their operation, they have to interact with people with different characteristics and necessities. Consequently, correctly recognising the user interacting with the robot facilitates the generation of a personalised experience that satisfies the user’s needs. In robotics, user recognition is typically based on face recognition from image processing and datasets that require retraining the network to include new users. However, some robots, such as pet-like companions, often lack a camera due to reduced dimensions, limited computational resources, or privacy constraints. Additionally, robots can occasionally encounter new users, requiring online recognition to provide a personalised interaction experience. To address these limitations, this article presents a user recognition system based on voice biometrics and dynamic clustering for adaptive social robots. We evaluate a set of open-source models for voice biometric extraction using different clustering algorithms to identify the best combination for our application. The resulting system is implemented in a pet-like robot companion that is used for the affective support of older adults, demonstrating its capacities in a real-world scenario. The system achieves more than 73% accuracy in recognising users who had previously spoken to the robot and more than 71% success in recognising new users who had not previously interacted with the robot and creating a personal profile for them. However, the system still detects noise, especially when the speaker has never interacted with the robot. Full article
(This article belongs to the Section Robotics and Automation)
Show Figures

Figure 1

13 pages, 2849 KB  
Article
Statistical Disturbance Detection Algorithm for Control of Camera Module Miniature Actuators
by Junseok Oh and Changsoo Eun
Electronics 2026, 15(9), 1925; https://doi.org/10.3390/electronics15091925 - 2 May 2026
Viewed by 368
Abstract
This paper proposes disturbance detection algorithms to mitigate the oscillations in smartphone camera module actuators induced by external shocks (e.g., drop events). Smartphone camera modules operate under volumetric constraints with inter-component trade-offs. Specifically, the limited space leads to insufficient performance because actuators are [...] Read more.
This paper proposes disturbance detection algorithms to mitigate the oscillations in smartphone camera module actuators induced by external shocks (e.g., drop events). Smartphone camera modules operate under volumetric constraints with inter-component trade-offs. Specifically, the limited space leads to insufficient performance because actuators are unstable under external disturbances. To optimize actuator function, we define the dynamic model of a voice coil motor (VCM) actuator, a controller model, and a shock disturbance model and perform worst-case operational analysis with MATLAB/Simulink (R2015a) simulations. Moreover, we propose two disturbance detection techniques: a phase-based detection algorithm that statistically analyzes the phase difference between the control input and the position feedback signal to detect disturbances and a frequency-based detection algorithm that uses discrete Fourier transform (DFT) to identify the characteristic spectral component of disturbances at 500 Hz. According to the simulation results, both methods reduce recovery time upon disturbance. Furthermore, the frequency-based algorithm achieves faster recovery performance than the phase-based detection algorithm. The phase-based detection method offers low computational complexity but increased processing latency, while the frequency-based detection method requires more memory capacity. The proposed techniques are anticipated to improve the recovery time of smartphone camera modules under disturbances, thereby enhancing system robustness and contributing to a more stable user imaging experience by mitigating image blur. Full article
Show Figures

Figure 1

17 pages, 247 KB  
Article
Human vs. LLM-Generated Speech Transcripts: Psycholinguistic Proxies and Discourse Dynamics
by Alaa Alsaeedi, Amal Almansour and Amani Jamal
Appl. Sci. 2026, 16(9), 4176; https://doi.org/10.3390/app16094176 - 24 Apr 2026
Viewed by 292
Abstract
Voice cloning enables realistic fake speech in which a speaker’s identity is preserved while the spoken message is semantically altered. This paper asks whether such meaning-level manipulation leaves detectable traces in transcripts alone. To study this problem, we introduce FakeSpeech+, a paired real–fake [...] Read more.
Voice cloning enables realistic fake speech in which a speaker’s identity is preserved while the spoken message is semantically altered. This paper asks whether such meaning-level manipulation leaves detectable traces in transcripts alone. To study this problem, we introduce FakeSpeech+, a paired real–fake dataset built from authentic speech clips and their matched semantically altered counterparts, re-embedded into cloned voices while preserving speaker identity. Using this dataset, we conduct a transcript-first analysis based on interpretable text-only features from two groups: (i) linguistic content organization and discourse dynamics, and (ii) compact production-related proxy cues, including hesitation and disfluency markers. We evaluate these cues under transcript-length control through residualization and compare authentic and manipulated transcripts using statistical and experimental analyses. The results show that only a limited subset of features retains strong separation after length control, with coordination-related structure and emotion anchoring emerging as the clearest cues, while several production-related and discourse-variability features show weaker but still informative differences. In contrast, a number of syntactic, lexical-diversity, and other discourse-level features show substantial overlap after residualization. These findings indicate that transcript-level structure and selected production-related cues remain informative under realistic content-manipulation threats, supporting the value of transcript-based analysis for identity-preserving fake speech. Full article
Back to TopTop