Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (448)

Search Parameters:
Keywords = voice-recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 5665 KB  
Article
Munir: A Multimodal Smart-Glasses System for Enhancing Human–Computer Interaction for Visually Impaired Individuals
by Nora Alhammad, Aljawharah Alsubaie, Rama Alomair, Fajer Alamro and Mashael Alammar
Sensors 2026, 26(12), 3950; https://doi.org/10.3390/s26123950 - 22 Jun 2026
Viewed by 273
Abstract
Visual impairment affects approximately 2.2 billion people worldwide, yet existing assistive technologies remain fragmented and prohibitively expensive. This paper presents Munir, an integrated multimodal assistive system designed to enhance human–computer interaction through a combination of a mobile application and Bluetooth-enabled smart glasses. Munir [...] Read more.
Visual impairment affects approximately 2.2 billion people worldwide, yet existing assistive technologies remain fragmented and prohibitively expensive. This paper presents Munir, an integrated multimodal assistive system designed to enhance human–computer interaction through a combination of a mobile application and Bluetooth-enabled smart glasses. Munir leverages a hybrid machine learning architecture to provide inclusive, real-time support for daily living activities. The system integrates ten core capabilities—including face recognition, optical character recognition, and scene description—all accessible through a unified bilingual (Arabic/English) voice interface. By employing on-device processing for biometric tasks, Munir ensures user privacy and trust while maintaining high responsiveness. End-to-end system evaluation on the SCface dataset achieves a 96.69% recognition rate with 0% False Accept Rate. At an estimated first-year total cost of $806, Munir demonstrates a 4–5× cost advantage over commercial alternatives, providing a scalable and affordable multimodal solution for global digital inclusion. Full article
(This article belongs to the Special Issue Human–Computer Interaction in Sensor Systems)
Show Figures

Figure 1

33 pages, 17208 KB  
Article
Reliability-Aware Dynamic Score Fusion for Robust Face–Voice Biometric Identification Under Mask and Transparent Shield Conditions
by Kamal Abuqaaud, Ali Bou Nassif and Ismail Shahin
Electronics 2026, 15(12), 2612; https://doi.org/10.3390/electronics15122612 - 12 Jun 2026
Viewed by 154
Abstract
Multimodal biometric systems have become essential components of modern electronic identity and authentication platforms where robustness under real-world degradation is critical. However, opaque face masks impose severe facial occlusion and attenuate high-frequency spectral components. Conversely, transparent face shields introduce complex specular reflections and [...] Read more.
Multimodal biometric systems have become essential components of modern electronic identity and authentication platforms where robustness under real-world degradation is critical. However, opaque face masks impose severe facial occlusion and attenuate high-frequency spectral components. Conversely, transparent face shields introduce complex specular reflections and act as an acoustic channel distortion source. Addressing these asymmetric degradation challenges, this paper proposes a reliability-aware Dynamic Score Fusion (DSF) for multimodal biometric identification. The proposed method performs sample-level reliability estimation for both face and voice modalities at the input stage. This enables sample-wise adaptive weighting of modality scores based on their estimated reliability. The framework integrates an ElasticFace-Arc backbone for face recognition with an Emphasized Channel Attention, Propagation and Aggregation—Time Delay Neural Network (ECAPA-TDNN) for speaker identification. The proposed approach is evaluated on the FaciaVox dataset, comprising face images and voice recordings acquired under multiple face-covering conditions. Experiments under the Standard to Cross-Condition Protocol (SCCP) and Multi-Condition Protocol (MCP) demonstrate that the proposed DSF consistently outperforms conventional score-level fusion methods, including Weighted Sum Fusion (WSF) and Logistic Regression Fusion (LRF). It achieves average Rank-1 accuracies of 89.6% (SCCP) and 93.7% (MCP), with gains of up to 9.3 percentage points over these baselines. The reliability estimators further demonstrate strong predictive capability, yielding Area Under the Curve (AUC) values above 0.95 for both modalities in distinguishing correctly and incorrectly identified samples under the closed-set identification setting. These findings confirm that sample-wise reliability modeling provides an effective mechanism for enhancing multimodal biometric performance under challenging mask and shield conditions, supporting the deployment of robust AI-driven electronic identification systems. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

24 pages, 4330 KB  
Article
Extreme Edge Computing for Secure and Private Multimodal Biometric Identification in Intelligent IoT Systems
by José Antonio de la Torre, Fernando Rincón, Soledad Escolar, Antonio Caruso, Julián Caba and Jesús Barba
Sensors 2026, 26(12), 3756; https://doi.org/10.3390/s26123756 - 12 Jun 2026
Viewed by 227
Abstract
The exponential growth of Internet of Things (IoT) ecosystems is driving a paradigm shift from centralized cloud computing towards decentralized architectures to mitigate latency and bandwidth constraints. While edge computing addresses some of these challenges, data transmission to local gateways still raises critical [...] Read more.
The exponential growth of Internet of Things (IoT) ecosystems is driving a paradigm shift from centralized cloud computing towards decentralized architectures to mitigate latency and bandwidth constraints. While edge computing addresses some of these challenges, data transmission to local gateways still raises critical security and privacy concerns. This study explores the Compute Continuum by pushing intelligence to the extreme edge using TinyML. We propose a secure, privacy-preserving multimodal biometric authentication system designed for resource-constrained embedded devices. Our solution implements a hierarchical processing chain: an ultra-lightweight person-detection filter acts as an intelligent wake-up mechanism, followed by robust facial and voice authentication modules. Operating as a strict hierarchical pipeline, the system achieves a combined False Acceptance Rate (FAR) of just 0.12%. Experimental results on an ESP32 microcontroller demonstrate exceptional energy efficiency, requiring only 0.15 J per inference cycle. This allows the system to operate autonomously for over 39 h of continuous inference on a standard 600 mAh battery, proving the viability of standalone, privacy-by-design biometric sensors in intelligent IoT environments. Full article
Show Figures

Figure 1

20 pages, 906 KB  
Project Report
Design, Development, and Evaluation of Multimodal Conversational Agents for Health Data Registration and Monitoring: Framework Proposal and Pilot Exploratory Study
by Mateus Klein Roman, Luan Zanatta, Jeangrei Emanoelli Veiga, Ericles Andrei Bellei and Ana Carolina Bertoletti De Marchi
Healthcare 2026, 14(12), 1641; https://doi.org/10.3390/healthcare14121641 - 10 Jun 2026
Viewed by 192
Abstract
Objectives: This study proposes an implementation-oriented design framework for multimodal conversational agents handling patient-generated health data and reports an exploratory experiment evaluating its instantiation in hypertension self-monitoring, focusing on user experience of conversational data-entry workflows. Methods: The framework operationalizes four complementary dimensions (social [...] Read more.
Objectives: This study proposes an implementation-oriented design framework for multimodal conversational agents handling patient-generated health data and reports an exploratory experiment evaluating its instantiation in hypertension self-monitoring, focusing on user experience of conversational data-entry workflows. Methods: The framework operationalizes four complementary dimensions (social intelligence, communication style, anthropomorphic characteristics, and technological mapping) and was instantiated in two agents integrated into an eHealth platform. Each agent supports users by providing prompts, interpreting responses, checking data plausibility, and confirming submission. A three-arm, single-session feasibility experiment (n=18, n=6 per group) compared a conventional app interface with text-based and voice-based conversational agents. Evaluation triangulated three sources of evidence: open-ended qualitative responses analyzed through descriptive content analysis, session-level researcher observation notes, and the User Experience Questionnaire (UEQ) reported descriptively with one-way ANOVA and η2 effect sizes. Results: All three modalities were acceptable to participants and produced UEQ scores in the positive range. Hesitation was observed in 2 of 6 Control participants, 1 of 6 Text participants, and 3 of 6 Voice participants, with self-reports indicating that voice-related difficulties were modality-specific (diction, command phrasing) and resolved within the session. Qualitative themes of acceptability and innovation, perceived effort, and modality-specific facilitators emerged across the corpus. Between-group ANOVAs did not reach statistical significance (p>0.05), as expected for an underpowered design, yet η2 values were medium for Attractiveness, Efficiency, Dependability, and Pragmatic Quality and large for Stimulation and Hedonic Quality, converging with the qualitative innovation and engagement signal in the conversational conditions. Conclusions: The framework and feasibility experiment provide preliminary, hypothesis-generating evidence on the potential of multimodal conversational interfaces in healthcare. However, no clinical, behavioral, or longitudinal outcomes were assessed. The four design dimensions can be tentatively associated with themes recognizable in user discourse, and the observed effect-size pattern motivates adequately powered longitudinal studies that incorporate behavioral and clinical endpoints alongside user experience measures. Full article
Show Figures

Figure 1

26 pages, 3422 KB  
Article
Voice-Driven Support System for Speech Practice in Older Adults: An Accessible Web–Mobile Approach
by Lucrecia Llerena, Nancy Rodríguez, Bertha Vásquez, John W. Castro and Alexander Herrera
Algorithms 2026, 19(6), 469; https://doi.org/10.3390/a19060469 - 9 Jun 2026
Viewed by 194
Abstract
Population aging poses significant challenges to oral communication due to age-related changes in articulation, verbal fluency, and speech pacing, even among older adults without neurodegenerative conditions. Despite advances in voice-based assistive technologies, there remains a lack of integrated engineering solutions that support structured, [...] Read more.
Population aging poses significant challenges to oral communication due to age-related changes in articulation, verbal fluency, and speech pacing, even among older adults without neurodegenerative conditions. Despite advances in voice-based assistive technologies, there remains a lack of integrated engineering solutions that support structured, autonomous speech practice in non-clinical environments. This study proposes a deterministic, rule-based speech evaluation workflow implemented within a hybrid web–mobile assistive system. The workflow integrates audio capture, cloud-based automatic speech recognition (ASR), rule-based pronunciation evaluation, immediate multimodal feedback, and progress monitoring within a unified system architecture. The proposed architecture includes a mobile application for older adults and a web platform for configuration and monitoring by caregivers. A prototyping-oriented methodology was applied, including requirements elicitation, system design, implementation, and usability evaluation using the Thinking Aloud method and the System Usability Scale (SUS). Results showed stable system behavior under controlled evaluation conditions, an average recognition accuracy of 90% during preliminary evaluation sessions, and a response latency of 1.82 s, supporting stable real-time interaction during guided speech exercises. These findings demonstrate the feasibility of the proposed assistive architecture as an accessible and reproducible solution for guided speech support in older adults. Full article
Show Figures

Figure 1

17 pages, 276 KB  
Article
Light Against Darkness: Rhetoric and the Struggle over LGBTQ+ in Israel
by Dolly Eliyahu-Levi and Avi Gvura
Soc. Sci. 2026, 15(6), 373; https://doi.org/10.3390/socsci15060373 - 8 Jun 2026
Viewed by 331
Abstract
The article examines conservative rhetoric and discourse in Israel toward the LGBTQ+ community from a sociolinguistic perspective that conceptualizes language as an arena of socio-cultural struggle over identity, power, and normativity. Drawing on queer linguistics theory and identity politics, the study explores how [...] Read more.
The article examines conservative rhetoric and discourse in Israel toward the LGBTQ+ community from a sociolinguistic perspective that conceptualizes language as an arena of socio-cultural struggle over identity, power, and normativity. Drawing on queer linguistics theory and identity politics, the study explores how language constructs reality through metaphors of illness, sin, and existential threat, as well as through theological framing and appeals to family and national values. These rhetorical strategies produce a social hierarchy in which heteronormativity is positioned as a “natural truth” while queer identities are labelled as deviant or threatening. From sociological perspective, the study reveals how conservative discourse establishes social boundaries and reinforces collective identity through the exclusion of the Other, thereby reproducing power relations and hierarchies. The article calls for the development of an alternative public discourse grounded in pluralism, inclusion, and the recognition of diverse identities as a means of strengthening democracy and social justice. While existing studies have examined conservative discourse toward LGBTQ+ communities primarily in Western contexts, this study contributes to the field by centering the Israeli case as a distinctive site of analysis, where conservative voices emerge from multiple and ideologically heterogeneous traditions: national-religious, ultra-Orthodox, and Muslim-Arab. By examining how rhetorically divergent speakers converge around shared mechanisms of exclusion, the study reveals that heteronormative discourse is not the product of a single ideological source, but a cross-sectoral phenomenon embedded in the specific political and cultural tensions of Israeli society. Full article
18 pages, 10628 KB  
Article
From Speech to Summary in Turkmen: A Parameter-Efficient Neural Pipeline
by Ualsher Tukeyev and Maksim Ocheretin
Appl. Sci. 2026, 16(12), 5734; https://doi.org/10.3390/app16125734 - 6 Jun 2026
Viewed by 271
Abstract
This paper presents the development of a neural model pipeline for automatic speech recognition (ASR) and text summarization in Turkmen, a low-resource language with agglutinative morphology. For the ASR task, the MMS-1b-all model (Meta) was employed with LoRA adaptation and CTC decoding, fine-tuned [...] Read more.
This paper presents the development of a neural model pipeline for automatic speech recognition (ASR) and text summarization in Turkmen, a low-resource language with agglutinative morphology. For the ASR task, the MMS-1b-all model (Meta) was employed with LoRA adaptation and CTC decoding, fine-tuned on the Common Voice corpus (2733 samples). For summarization, the mBART-50-large model was used with Turkmen-specific tokenization and was trained on a news text corpus (10,248 samples). The following results were achieved: WER = 17.59% for ASR (baseline model: 107.33%) and ROUGE-L = 0.4255 for summarization (zero-shot baseline: 0.2294). The scientific contribution is the creation of a parameter-efficient neural pipeline for speech-to-summary for Turkmen. The developed system can be applied to automated meeting transcription and text data processing in the Turkmen language. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

26 pages, 3204 KB  
Article
An Ergonomic Approach to Medical Safety Training Using Augmented Reality Glasses: System Design, Cognitive–Neuroscientific Theoretical Framework, and Preliminary Outcomes
by Kohei Tanaka, Kurumi Asaumi, Ryosuke Kasai, Hirotaka Sato, Ryosuke Uchibayashi and Motoki Shigenaga
Theor. Appl. Ergon. 2026, 2(2), 10; https://doi.org/10.3390/tae2020010 - 5 Jun 2026
Viewed by 234
Abstract
Healthcare professionals must acquire and maintain both declarative knowledge and fine psychomotor skills across a wide range of clinical procedures. Human working memory is physiologically limited, and the high cognitive demands of clinical environments frequently contribute to medical errors and adverse events. Intra-individual [...] Read more.
Healthcare professionals must acquire and maintain both declarative knowledge and fine psychomotor skills across a wide range of clinical procedures. Human working memory is physiologically limited, and the high cognitive demands of clinical environments frequently contribute to medical errors and adverse events. Intra-individual performance variability—driven by fatigue, stress, and motivation—represents a further challenge that conventional medical safety education has not adequately addressed. According to the World Health Organization, patient harm ranks fourteenth in the global burden of disease, with approximately 10% of hospitalised patients in high-income countries experiencing harm within healthcare facilities. This study reports the design, theoretical rationale, and preliminary outcomes of an augmented reality (AR) glasses system for hands-free, self-directed medical procedural training, developed from a human factors and ergonomics (HFE) perspective. The system integrates a see-through head-mounted display (HMD; Epson Moverio BT-40S), bone-conduction earphones (Shokz OpenComm), and an industrial-grade voice recognition application (NEC Solution Innovators), achieving fully hands-free operation compatible with aseptic technique. Content design is grounded in cognitive load theory (CLT) and the cognitive theory of multimedia learning (CTML), extended by neuroscientific evidence on multisensory integration and memory consolidation. More than 40 procedure-specific modules have been developed in-house at Tokyo University of Technology, spanning airway management, vascular access, respiratory therapy, dialysis, and cardiac support. In a four-year longitudinal survey (virtual reality (VR) simulator; n = 286), major satisfaction items consistently exceeded the scale midpoint. In an AR endotracheal suctioning cohort (n = 38/22), procedural flow understanding was rated 3.95/5.0. A peer-reviewed randomised controlled trial (Clinical Simulation in Nursing, n = 36) demonstrated significantly superior skill improvement (p < 0.001) and learning motivation (p = 0.001) in the AR group versus textbook self-practice. Principal ergonomic limitations of current HMD hardware—excessive weight, narrow field of view, and absence of medical-grade certification—are documented, and AI-based real-time procedural assessment is identified as a priority for the next research phase. Full article
Show Figures

Figure 1

20 pages, 2019 KB  
Review
Diagnostic Accuracy of Artificial Intelligence in Laryngeal Disorders: An Integrative Review
by Samantha Mairesse, Antonino Maniaci, Giovanni Briganti and Jerome R. Lechien
J. Pers. Med. 2026, 16(6), 301; https://doi.org/10.3390/jpm16060301 - 1 Jun 2026
Viewed by 611
Abstract
Background/Objectives: Laryngeal disorders are among the most prevalent conditions in otolaryngology, yet they remain challenging to diagnose without specialized expertise. Artificial intelligence (AI) systems leveraging machine learning (ML) and deep learning (DL) have demonstrated promising performance for the automatic detection and classification [...] Read more.
Background/Objectives: Laryngeal disorders are among the most prevalent conditions in otolaryngology, yet they remain challenging to diagnose without specialized expertise. Artificial intelligence (AI) systems leveraging machine learning (ML) and deep learning (DL) have demonstrated promising performance for the automatic detection and classification of voice disorders and laryngeal lesions. Methods: This review synthesizes findings from 88 studies published between 2015 and 2025 on AI-based laryngeal disorder detection, considering physioacoustic mechanisms, databases and acquisition protocols, AI architectures and validation strategies, and diagnostic performance. Results: The current literature supports high internal accuracies for binary healthy versus pathological detection (88–99%); meanwhile, performance decreases for higher-level tasks such as pathophysiological category classification and identification, particularly under external validation. From a clinical perspective, clinicians do not infer specific diagnoses from isolated acoustic parameters such as percent jitter or shimmer. Instead, they rely on how these perturbation patterns dynamically evolve during connected speech, where alterations guide perceptual differentiation between underlying disorders. Recurrent sources of bias include dependence on a limited number of historical vowel-based databases, class and demographic imbalance, and limited ecological validity of recording protocols. Additional concerns involve the predominant use of internal cross-validation and insufficient reproducibility or code sharing. Conclusions: Drawing on the literature, an integrative three-level clinical recognition framework is proposed, delineating realistic use cases for AI as a decision-support tool rather than an autonomous diagnostic system. Key priorities for future personalized medicine and research are also identified, including diversified multi-center datasets, standardized methodological reporting, rigorous external validation, and compliance with regulatory and ethical requirements for medical AI deployment. Full article
Show Figures

Figure 1

22 pages, 692 KB  
Article
Negotiating (Mis-)Recognition in Physical Education: Interactions Between Teachers and Students with Special Educational Needs in the Area of Emotional and Social Development
by Leefke Brunssen and Valerie Kastrup
Behav. Sci. 2026, 16(5), 793; https://doi.org/10.3390/bs16050793 - 16 May 2026
Cited by 1 | Viewed by 406
Abstract
For students with Special Educational Needs in their Emotional and Social Development (SEN-ESD), school interactions can intensify distrust in adults or foster corrective relational experiences. Physical Education (PE) presents a dual-natured context for this group: while curricula promote social–emotional skill development, students are [...] Read more.
For students with Special Educational Needs in their Emotional and Social Development (SEN-ESD), school interactions can intensify distrust in adults or foster corrective relational experiences. Physical Education (PE) presents a dual-natured context for this group: while curricula promote social–emotional skill development, students are particularly dependent on sensitive teacher interactions. Yet, no study has examined how recognition, as the prerequisite for inclusion, is negotiated in these teacher–student interactions. This Grounded Theory study reconstructed these negotiation processes and explains them through a Honneth–Prengel recognition framework. Using an iterative design, we conducted and analysed semi-structured interviews with 18 PE teachers and 22 students with SEN-ESD in German regular secondary schools until theoretical saturation. Constant comparative analysis and iterative open and axial coding revealed the dimension of interactional dignity (property: level of affirmation; ranging from low ↔ high). Five patterns detail its constitution through three core domains: relational security, fairness and voice, and valuing individual skills. Interactions are strained by perceptual discrepancies, one concerning what counts as just and the other whose reality is recognised. Furthermore, a systemic grading paradox emerged, which may function as institutional misrecognition and may risk double marginalization for students with SEN-ESD, who are assessed on their very area of need in PE. Findings suggest that addressing this requires structural reform beyond teacher practice. Inclusive PE needs resources for individualised pedagogy, teachers who acknowledge individual needs and realities, and systemic reform of assessment practices. Full article
(This article belongs to the Special Issue Self-Determination and Motivation in Physical Education)
Show Figures

Figure 1

34 pages, 2094 KB  
Review
Sensor-Driven Deep Learning for Smart Home Intelligence: Signal Analysis, Multimodal Perception, and System-Level Applications
by Chenchen Wu, Ziqian Yang and Tao Sun
Sensors 2026, 26(10), 2993; https://doi.org/10.3390/s26102993 - 9 May 2026
Viewed by 833
Abstract
Smart home environments are evolving toward context-aware intelligent systems with the rapid integration of the Internet of Things (IoT), edge computing, and artificial intelligence. In such settings, large volumes of heterogeneous sensor data must be continuously processed to support perception, behavior understanding, and [...] Read more.
Smart home environments are evolving toward context-aware intelligent systems with the rapid integration of the Internet of Things (IoT), edge computing, and artificial intelligence. In such settings, large volumes of heterogeneous sensor data must be continuously processed to support perception, behavior understanding, and autonomous decision-making. Deep learning has emerged as a key approach for transforming raw sensor signals into structured representations that enable these functions. This review examines recent advances in deep learning for smart home applications from a sensor-driven perspective. Existing studies are organized into five major domains: human activity recognition, health monitoring and assisted living, smart energy management, security monitoring and anomaly detection, and voice interaction and intelligent control. Representative methodological paradigms—including convolutional and recurrent neural networks, Transformers, graph-based learning, multimodal fusion, and deep reinforcement learning—are discussed with emphasis on their roles in signal representation, multimodal integration, and decision-oriented modeling. Despite notable progress, several challenges continue to limit real-world deployment. These include the scarcity of high-quality labeled data, privacy and security concerns associated with continuous sensing, limited generalization across environments and users, constraints of edge devices, and the limited interpretability of model output. Addressing these issues requires advances not only in model design but also in data-efficient learning, privacy-preserving architectures, and system-level integration. Future research is expected to focus on multimodal perception, distributed and edge intelligence, knowledge-enhanced modeling, and human-centered explainable systems. By synthesizing current developments and highlighting open challenges, this review aims to support the development of robust and deployable deep learning solutions for next-generation smart home systems. Full article
Show Figures

Figure 1

15 pages, 747 KB  
Article
Multimodal Recognition of Out-of-Distribution Individuals Using Contrastive Learning
by Sergio Garcia, Francisco Gomez-Donoso and Miguel Cazorla
AI 2026, 7(5), 162; https://doi.org/10.3390/ai7050162 - 6 May 2026
Viewed by 761
Abstract
This paper presents an innovative methodology detecting out-of-distribution individuals based on a multimodal contrastive learning approach. The system combines voice and facial image data by projecting them into a shared representation in the embedding space, enable accurate identification of previously unseen individuals. This [...] Read more.
This paper presents an innovative methodology detecting out-of-distribution individuals based on a multimodal contrastive learning approach. The system combines voice and facial image data by projecting them into a shared representation in the embedding space, enable accurate identification of previously unseen individuals. This approach overcomes the limitations of traditional methods by providing more robust and consistent detection in dynamic scenarios, using advanced neural networks and optimized contrastive losses. Specifically, the main contribution of this work is the introduction of a multimodal contrastive framework that performs cross-modal consistency verification between facial and vocal representations, enabling reliable detection of out-of-distribution individuals without the need for identity gallery retrieval. Experimental results on multiple datasets highlight the effectiveness of the system, with accuracy above 90% in detecting in-distribution samples in all evaluated cases. Regarding the identification of out-of-distribution cases, the system maintains outstanding performance, achieving values close to 90% on average, with some datasets exceeding 95%. These results underscore its ability to recognize both known identities and handle unknown data, even under challenging conditions. This approach represents a significant advancement in the multimodal recognition of individuals, with potential applications in critical areas such as security, surveillance, and human–computer interaction. Full article
Show Figures

Figure 1

7 pages, 151 KB  
Proceeding Paper
PhotoVoice and Visual Narrative: A Pedagogical Perspective on Inclusion and Intellectual Disability
by Letizia Pistone, Daniela Pasqualetto and Alessandra Lo Piccolo
Proceedings 2026, 139(1), 16; https://doi.org/10.3390/proceedings2026139016 - 5 May 2026
Viewed by 305
Abstract
The growing interest in visual methodologies within the educational field reflects the need to rethink teaching–learning processes from a participatory, multimodal, and inclusive perspective. Among these approaches, PhotoVoice emerges as a research–action and training strategy that combines photography and autobiographical narration, activating accessible [...] Read more.
The growing interest in visual methodologies within the educational field reflects the need to rethink teaching–learning processes from a participatory, multimodal, and inclusive perspective. Among these approaches, PhotoVoice emerges as a research–action and training strategy that combines photography and autobiographical narration, activating accessible expressive practices centred on subjectivity and lived experience. This contribution presents a theoretical–methodological analysis grounded in pedagogical and visual research literature, aiming to outline an operational framework for the educational application of PhotoVoice in inclusive pathways addressed to individuals with intellectual disabilities. Framed within the paradigm of Visual Education and a pedagogy oriented toward recognition and relationality, PhotoVoice is examined as a pedagogical device capable of fostering symbolic mediation, identity construction, and narrative agency. The photographic image, conceived as an embodied, situated, and relational language, enables access to forms of knowledge often excluded from dominant verbal codes, restoring visibility and epistemic dignity to marginalised subjectivities. The paper delineates key operational phases of the method and identifies core educational objectives, including the strengthening of narrative agency, self-determination, and reflective participation. From this perspective, visual narration is configured as a situated pedagogical practice integrating aesthetics, ethics, and social transformation, capable of generating equitable and meaning-generative learning environments. Within this framework, PhotoVoice shifts inclusion from an abstract principle to a concrete educational process, enabling participants to narrate, interpret, and actively reshape their own learning contexts. Full article
29 pages, 1725 KB  
Article
A User Recognition Methodology Based on Voice Biometrics and Dynamic Clustering for Social Robots
by Arecia Segura-Bencomo, Marcos Maroto-Gómez, Juan José Gamboa-Montero and José Carlos Castillo
Appl. Sci. 2026, 16(9), 4548; https://doi.org/10.3390/app16094548 - 5 May 2026
Viewed by 548
Abstract
Social robots are systems designed to assist people across different fields. During their operation, they have to interact with people with different characteristics and necessities. Consequently, correctly recognising the user interacting with the robot facilitates the generation of a personalised experience that satisfies [...] Read more.
Social robots are systems designed to assist people across different fields. During their operation, they have to interact with people with different characteristics and necessities. Consequently, correctly recognising the user interacting with the robot facilitates the generation of a personalised experience that satisfies the user’s needs. In robotics, user recognition is typically based on face recognition from image processing and datasets that require retraining the network to include new users. However, some robots, such as pet-like companions, often lack a camera due to reduced dimensions, limited computational resources, or privacy constraints. Additionally, robots can occasionally encounter new users, requiring online recognition to provide a personalised interaction experience. To address these limitations, this article presents a user recognition system based on voice biometrics and dynamic clustering for adaptive social robots. We evaluate a set of open-source models for voice biometric extraction using different clustering algorithms to identify the best combination for our application. The resulting system is implemented in a pet-like robot companion that is used for the affective support of older adults, demonstrating its capacities in a real-world scenario. The system achieves more than 73% accuracy in recognising users who had previously spoken to the robot and more than 71% success in recognising new users who had not previously interacted with the robot and creating a personal profile for them. However, the system still detects noise, especially when the speaker has never interacted with the robot. Full article
(This article belongs to the Section Robotics and Automation)
Show Figures

Figure 1

27 pages, 2923 KB  
Article
An Assistant System for Speaker and Sentiment Recognition Using RAM and a Hybrid AI Model
by Fatma Bozyiğit, İrfan Aygün, Oğuzhan Sağlam, Eren Özcan, Emin Borandağ and Bahadır Karasulu
Electronics 2026, 15(8), 1731; https://doi.org/10.3390/electronics15081731 - 19 Apr 2026
Viewed by 894
Abstract
In the age of remote communication and digital archiving, automated analysis of voice data has become increasingly important in various application areas. Despite significant advances in the field of Automatic Speech Recognition, integrating speaker recognition, textual sentiment analysis, and acoustic sentiment detection within [...] Read more.
In the age of remote communication and digital archiving, automated analysis of voice data has become increasingly important in various application areas. Despite significant advances in the field of Automatic Speech Recognition, integrating speaker recognition, textual sentiment analysis, and acoustic sentiment detection within a unified real-time processing pipeline remains a challenging task. Current approaches are often limited to monolithic designs or operate in batch processing modes, which restricts their scalability and real-time applicability. To address this gap, this work proposes a novel feature selection method called RAM, along with a hybrid decision-level merging approach combining Conv1D CNN and AutoML-based models. The proposed hybrid framework enables independent model training and integrates its probabilistic outputs through a weighted merging strategy for performance improvement. Furthermore, a scalable microservice-based software architecture has been developed to support real-time processing, feature selection, and model deployment. This design enhances system modularity, flexibility, and integration capability in practical applications. Experimental results show that when the proposed RAM method is used in conjunction with a hybrid AI model, it achieves over 97% accuracy in speaker recognition and over 82% accuracy in emotion classification, even with short audio samples. These findings demonstrate that the proposed approach provides a robust and efficient solution for real-time speech analysis tasks. Full article
(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)
Show Figures

Figure 1

Back to TopTop