Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (229)

Search Parameters:
Keywords = audio-visual systems

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 8344 KiB  
Article
Research and Implementation of Travel Aids for Blind and Visually Impaired People
by Jun Xu, Shilong Xu, Mingyu Ma, Jing Ma and Chuanlong Li
Sensors 2025, 25(14), 4518; https://doi.org/10.3390/s25144518 - 21 Jul 2025
Viewed by 145
Abstract
Blind and visually impaired (BVI) people face significant challenges in perception, navigation, and safety during travel. Existing infrastructure (e.g., blind lanes) and traditional aids (e.g., walking sticks, basic audio feedback) provide limited flexibility and interactivity for complex environments. To solve this problem, we [...] Read more.
Blind and visually impaired (BVI) people face significant challenges in perception, navigation, and safety during travel. Existing infrastructure (e.g., blind lanes) and traditional aids (e.g., walking sticks, basic audio feedback) provide limited flexibility and interactivity for complex environments. To solve this problem, we propose a real-time travel assistance system based on deep learning. The hardware comprises an NVIDIA Jetson Nano controller, an Intel D435i depth camera for environmental sensing, and SG90 servo motors for feedback. To address embedded device computational constraints, we developed a lightweight object detection and segmentation algorithm. Key innovations include a multi-scale attention feature extraction backbone, a dual-stream fusion module incorporating the Mamba architecture, and adaptive context-aware detection/segmentation heads. This design ensures high computational efficiency and real-time performance. The system workflow is as follows: (1) the D435i captures real-time environmental data; (2) the processor analyzes this data, converting obstacle distances and path deviations into electrical signals; (3) servo motors deliver vibratory feedback for guidance and alerts. Preliminary tests confirm that the system can effectively detect obstacles and correct path deviations in real time, suggesting its potential to assist BVI users. However, as this is a work in progress, comprehensive field trials with BVI participants are required to fully validate its efficacy. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

18 pages, 1150 KiB  
Article
Navigating by Design: Effects of Individual Differences and Navigation Modality on Spatial Memory Acquisition
by Xianyun Liu, Yanan Zhang and Baihu Sun
Behav. Sci. 2025, 15(7), 959; https://doi.org/10.3390/bs15070959 - 15 Jul 2025
Viewed by 214
Abstract
Spatial memory is a critical component of spatial cognition, particularly in unfamiliar environments. As navigation systems become integral to daily life, understanding how individuals with varying spatial abilities respond to different navigation modes is increasingly important. This study employed a virtual driving environment [...] Read more.
Spatial memory is a critical component of spatial cognition, particularly in unfamiliar environments. As navigation systems become integral to daily life, understanding how individuals with varying spatial abilities respond to different navigation modes is increasingly important. This study employed a virtual driving environment to examine how participants with varying spatial abilities (good or poor) performed under three navigation modes, namely visual, audio, and combined audio–visual navigation modes. A total of 78 participants were divided into two groups, good sense of direction (G-SOD) and poor sense of direction (P-SOD), according to their Santa Barbara Sense of Direction (SBSOD) scores. They were randomly assigned to one of the three navigation modes (visual, audio, audio–visual). Participants followed navigation cues and simulated driving behavior to the end point twice during the learning phase, then completed the route retracing task, recognizing scenes task and recognizing the order task. Significant main effects were found for both SOD group and navigation mode, with no interaction. G-SOD participants outperformed P-SOD participants in route retracing task. Audio navigation mode led to better performance in tasks involving complex spatial decisions, such as turn intersections and recognizing the order. The accuracy of recognizing scenes did not significantly differ across SOD groups or navigation modes. These findings suggest that audio navigation mode may reduce visual distraction and support more effective spatial encoding and that individual spatial abilities influence navigation performance independently of guidance type. These findings highlight the importance of aligning navigation modalities with users’ cognitive profiles and support the development of adaptive navigation systems that accommodate individual differences in spatial ability. Full article
(This article belongs to the Section Cognition)
Show Figures

Figure 1

20 pages, 5700 KiB  
Article
Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features
by Hyeonuk Bhin and Jongsuk Choi
Electronics 2025, 14(14), 2837; https://doi.org/10.3390/electronics14142837 - 15 Jul 2025
Viewed by 313
Abstract
Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose [...] Read more.
Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose a multimodal personality recognition model that classifies the Big Five personality traits by extracting features from three heterogeneous sources: audio processed using Wav2Vec2, video represented as Skeleton Landmark time series, and text encoded through Bidirectional Encoder Representations from Transformers (BERT) and Doc2Vec embeddings. Each modality is handled through an independent Self-Attention block that highlights salient temporal information, and these representations are then summarized and integrated using a late fusion approach to effectively reflect both the inter-modal complementarity and cross-modal interactions. Compared to traditional recurrent neural network (RNN)-based multimodal models and unimodal classifiers, the proposed model achieves an improvement of up to 12 percent in the F1-score. It also maintains a high prediction accuracy and robustness under limited input conditions. Furthermore, a visualization based on t-distributed Stochastic Neighbor Embedding (t-SNE) demonstrates clear distributional separation across the personality classes, enhancing the interpretability of the model and providing insights into the structural characteristics of its latent representations. To support real-time deployment, a lightweight thread-based processing architecture is implemented, ensuring computational efficiency. By leveraging deep learning-based feature extraction and the Self-Attention mechanism, we present a novel personality recognition framework that balances performance with interpretability. The proposed approach establishes a strong foundation for practical applications in HRI, counseling, education, and other interactive systems that require personalized adaptation. Full article
(This article belongs to the Special Issue Explainable Machine Learning and Data Mining)
Show Figures

Figure 1

19 pages, 1779 KiB  
Article
Through the Eyes of the Viewer: The Cognitive Load of LLM-Generated vs. Professional Arabic Subtitles
by Hussein Abu-Rayyash and Isabel Lacruz
J. Eye Mov. Res. 2025, 18(4), 29; https://doi.org/10.3390/jemr18040029 - 14 Jul 2025
Viewed by 255
Abstract
As streaming platforms adopt artificial intelligence (AI)-powered subtitle systems to satisfy global demand for instant localization, the cognitive impact of these automated translations on viewers remains largely unexplored. This study used a web-based eye-tracking protocol to compare the cognitive load that GPT-4o-generated Arabic [...] Read more.
As streaming platforms adopt artificial intelligence (AI)-powered subtitle systems to satisfy global demand for instant localization, the cognitive impact of these automated translations on viewers remains largely unexplored. This study used a web-based eye-tracking protocol to compare the cognitive load that GPT-4o-generated Arabic subtitles impose with that of professional human translations among 82 native Arabic speakers who viewed a 10 min episode (“Syria”) from the BBC comedy drama series State of the Union. Participants were randomly assigned to view the same episode with either professionally produced Arabic subtitles (Amazon Prime’s human translations) or machine-generated GPT-4o Arabic subtitles. In a between-subjects design, with English proficiency entered as a moderator, we collected fixation count, mean fixation duration, gaze distribution, and attention concentration (K-coefficient) as indices of cognitive processing. GPT-4o subtitles raised cognitive load on every metric; viewers produced 48% more fixations in the subtitle area, recorded 56% longer fixation durations, and spent 81.5% more time reading the automated subtitles than the professional subtitles. The subtitle area K-coefficient tripled (0.10 to 0.30), a shift from ambient scanning to focal processing. Viewers with advanced English proficiency showed the largest disruptions, which indicates that higher linguistic competence increases sensitivity to subtle translation shortcomings. These results challenge claims that large language models (LLMs) lighten viewer burden; despite fluent surface quality, GPT-4o subtitles demand far more cognitive resources than expert human subtitles and therefore reinforce the need for human oversight in audiovisual translation (AVT) and media accessibility. Full article
Show Figures

Figure 1

21 pages, 2816 KiB  
Article
AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features
by Minjun Oh, Howon Jang and Daeho Lee
Appl. Sci. 2025, 15(13), 7613; https://doi.org/10.3390/app15137613 - 7 Jul 2025
Viewed by 278
Abstract
StageMix is a video compilation of multiple stage performances of the same song, edited seamlessly together using appropriate editing points. However, generating a StageMix requires specialized editing techniques and is a considerable time-consuming process. To address this challenge, we introduce AutoStageMix, an automated [...] Read more.
StageMix is a video compilation of multiple stage performances of the same song, edited seamlessly together using appropriate editing points. However, generating a StageMix requires specialized editing techniques and is a considerable time-consuming process. To address this challenge, we introduce AutoStageMix, an automated StageMix generation system designed to perform all processes automatically. The system is structured into five principal stages: preprocessing, feature extraction, identifying a transition point, editing path determination, and StageMix generation. The initial stage of the process involves audio analysis to synchronize the sequences across all input videos, followed by frame extraction. After that, the facial features are extracted from each video frame. Next, transition points are identified, which form the basis for face-based transitions, inter-stage cuts, and intra-stage cuts. Subsequently, a cost function is defined to facilitate the creation of cross-edited sequences. The optimal editing path is computed using Dijkstra’s algorithm to minimize the total cost of editing. Finally, the StageMix is generated by applying appropriate editing effects tailored to each transition type, aiming to maximize visual appeal. Experimental results suggest that our method generally achieves lower NME scores than existing StageMix generation approaches across multiple test songs. In a user study with 21 participants, AutoStageMix achieved viewer satisfaction comparable to that of professionally edited StageMixes, with no statistically significant difference between the two. AutoStageMix enables users to produce StageMixes effortlessly and efficiently by eliminating the need for manual editing. Full article
Show Figures

Figure 1

16 pages, 1166 KiB  
Article
Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network
by Fengzheng Bi and Lidong Yang
Sensors 2025, 25(13), 3930; https://doi.org/10.3390/s25133930 - 24 Jun 2025
Viewed by 358
Abstract
Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing [...] Read more.
Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing on three dimensions: the time dimension of the spectrogram, the frequency dimension, and the high- and low-frequency information extracted by a wavelet transform through a time–frequency–wavelet module. Multidimensional information was fused through the gated temporal–spatial attention unit, and the visual state space module was introduced to enhance the contextual modeling capability of audio sequences. In addition, Kolmogorov–Arnold network layers were used in place of multilayer perceptrons in the classifier part. The experimental results show that the proposed method achieves a 56.16% average accuracy on the TAU Urban Acoustic Scenes 2022 mobile development dataset, which is an improvement of 6.53% compared to the official baseline system. This performance improvement demonstrates the effectiveness of the model in complex scenarios. In addition, the accuracy of the proposed method on the UrbanSound8K dataset reached 97.60%, which is significantly better than the existing methods, further verifying the generalization ability of the proposed model in the acoustic scene classification task. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

34 pages, 9431 KiB  
Article
Gait Recognition via Enhanced Visual–Audio Ensemble Learning with Decision Support Methods
by Ruixiang Kan, Mei Wang, Tian Luo and Hongbing Qiu
Sensors 2025, 25(12), 3794; https://doi.org/10.3390/s25123794 - 18 Jun 2025
Viewed by 403
Abstract
Gait is considered a valuable biometric feature, and it is essential for uncovering the latent information embedded within gait patterns. Gait recognition methods are expected to serve as significant components in numerous applications. However, existing gait recognition methods exhibit limitations in complex scenarios. [...] Read more.
Gait is considered a valuable biometric feature, and it is essential for uncovering the latent information embedded within gait patterns. Gait recognition methods are expected to serve as significant components in numerous applications. However, existing gait recognition methods exhibit limitations in complex scenarios. To address these, we construct a dual-Kinect V2 system that focuses more on gait skeleton joint data and related acoustic signals. This setup lays a solid foundation for subsequent methods and updating strategies. The core framework consists of enhanced ensemble learning methods and Dempster–Shafer Evidence Theory (D-SET). Our recognition methods serve as the foundation, and the decision support mechanism is used to evaluate the compatibility of various modules within our system. On this basis, our main contributions are as follows: (1) an improved gait skeleton joint AdaBoost recognition method based on Circle Chaotic Mapping and Gramian Angular Field (GAF) representations; (2) a data-adaptive gait-related acoustic signal AdaBoost recognition method based on GAF and a Parallel Convolutional Neural Network (PCNN); and (3) an amalgamation of the Triangulation Topology Aggregation Optimizer (TTAO) and D-SET, providing a robust and innovative decision support mechanism. These collaborations improve the overall recognition accuracy and demonstrate their considerable application values. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

24 pages, 9841 KiB  
Article
The Audiovisual Assessment of Monocultural Vegetation Based on Facial Expressions
by Mary Nwankwo, Qi Meng, Da Yang and Mengmeng Li
Forests 2025, 16(6), 937; https://doi.org/10.3390/f16060937 - 3 Jun 2025
Viewed by 473
Abstract
Plant vegetation is nature’s symphony, offering sensory experiences that influence ecological systems, human well-being, and emotional states and significantly impact human societal progress. This study investigated the emotional and perceptual impacts of specific monocultural vegetation (palm and rubber) in Nigeria, through audiovisual interactions [...] Read more.
Plant vegetation is nature’s symphony, offering sensory experiences that influence ecological systems, human well-being, and emotional states and significantly impact human societal progress. This study investigated the emotional and perceptual impacts of specific monocultural vegetation (palm and rubber) in Nigeria, through audiovisual interactions using facial expression analysis, soundscape, and visual perception assessments. The findings reveal three key outcomes: (1) Facial expressions varied significantly by vegetation type and time of day, with higher “happy” valence values recorded for palm vegetation in the morning (mean = 0.39), and for rubber vegetation in the afternoon (mean = 0.37). (2) Gender differences in emotional response were observed, as male participants exhibited higher positive expressions (mean = 0.40) compared to females (mean = 0.33). (3) Perceptual ratings indicated that palm vegetation was perceived as more visually beautiful (mean = 4.05), whereas rubber vegetation was rated as having a more pleasant soundscape (mean = 4.10). However, facial expressions showed weak correlations with soundscape and visual perceptions, suggesting that other cognitive or sensory factors may be more influential. This study addresses a critical gap in soundscape research for monocultural vegetation and offers valuable insights for urban planners, environmental psychologists, and restorative landscape designs. Full article
(This article belongs to the Special Issue Soundscape in Urban Forests—2nd Edition)
Show Figures

Figure 1

25 pages, 5837 KiB  
Article
Analysis of Facial Cues for Cognitive Decline Detection Using In-the-Wild Data
by Fatimah Alzahrani, Steve Maddock and Heidi Christensen
Appl. Sci. 2025, 15(11), 6267; https://doi.org/10.3390/app15116267 - 3 Jun 2025
Viewed by 470
Abstract
The development of automatic methods for early cognitive impairment (CI) detection has a crucial role to play in helping people obtain suitable treatment and care. Video-based analysis offers a promising, low-cost alternative to resource-intensive clinical assessments. This paper investigates visual features (eye blink [...] Read more.
The development of automatic methods for early cognitive impairment (CI) detection has a crucial role to play in helping people obtain suitable treatment and care. Video-based analysis offers a promising, low-cost alternative to resource-intensive clinical assessments. This paper investigates visual features (eye blink rate (EBR), head turn rate (HTR), and head movement statistical features (HMSFs)) for distinguishing between neurodegenerative disorders (NDs), mild cognitive impairment (MCI), functional memory disorders (FMDs), and healthy controls (HCs). Following prior work, we improve the multiple thresholds (MTs) approach specifically for EBR calculation to enhance performance and robustness, while the HTR and HMSFs are extracted using methods from previous work. The EBR, HTR, and HMSFs are evaluated using an in-the-wild video dataset captured in challenging environments. This method leverages clinically validated cues and automatically extracts features to enable classification. Experiments show that the proposed approach achieves competitive performance in distinguishing between ND, MCI, FMD, and HCs on in-the-wild datasets, with results comparable to audiovisual-based methods conducted in a lab-controlled environment. The findings highlight the potential of visual-based approaches to complement existing diagnostic tools and provide an efficient home-based monitoring system. This work advances the field by addressing traditional limitations and offering a scalable, cost-effective solution for early detection. Full article
Show Figures

Figure 1

21 pages, 813 KiB  
Review
Light, Sound, and Melatonin: Investigating Multisensory Pathways for Visual Restoration
by Dario Rusciano
Medicina 2025, 61(6), 1009; https://doi.org/10.3390/medicina61061009 - 28 May 2025
Cited by 1 | Viewed by 808
Abstract
Multisensory integration is fundamental for coherent perception and interaction with the environment. While cortical mechanisms of multisensory convergence are well studied, emerging evidence implicates specialized retinal ganglion cells—particularly melanopsin-expressing intrinsically photosensitive retinal ganglion cells (ipRGCs)—in crossmodal processing. This review explores how hierarchical brain [...] Read more.
Multisensory integration is fundamental for coherent perception and interaction with the environment. While cortical mechanisms of multisensory convergence are well studied, emerging evidence implicates specialized retinal ganglion cells—particularly melanopsin-expressing intrinsically photosensitive retinal ganglion cells (ipRGCs)—in crossmodal processing. This review explores how hierarchical brain networks (e.g., superior colliculus, parietal cortex) and ipRGCs jointly shape perception and behavior, focusing on their convergence in multisensory plasticity. We highlight ipRGCs as gatekeepers of environmental light cues. Their anatomical projections to multisensory areas like the superior colliculus are well established, although direct evidence for their role in human audiovisual integration remains limited. Through melanopsin signaling and subcortical projections, they may modulate downstream multisensory processing, potentially enhancing the salience of crossmodal inputs. A key theme is the spatiotemporal synergy between melanopsin and melatonin: melanopsin encodes light, while melatonin fine-tunes ipRGC activity and synaptic plasticity, potentially creating time-sensitive rehabilitation windows. However, direct evidence linking ipRGCs to audiovisual rehabilitation remains limited, with their role primarily inferred from anatomical and functional studies. Future implementations should prioritize quantitative optical metrics (e.g., melanopic irradiance, spectral composition) to standardize light-based interventions and enhance reproducibility. Nonetheless, we propose a translational framework combining multisensory stimuli (e.g., audiovisual cues) with circadian-timed melatonin to enhance recovery in visual disorders like hemianopia and spatial neglect. By bridging retinal biology with systems neuroscience, this review redefines the retina’s role in multisensory processing and offers novel, mechanistically grounded strategies for neurorehabilitation. Full article
(This article belongs to the Section Ophthalmology)
Show Figures

Figure 1

39 pages, 13529 KiB  
Article
Intelligent Monitoring of BECS Conveyors via Vision and the IoT for Safety and Separation Efficiency
by Shohreh Kia and Benjamin Leiding
Appl. Sci. 2025, 15(11), 5891; https://doi.org/10.3390/app15115891 - 23 May 2025
Viewed by 641
Abstract
Conveyor belts are critical in various industries, particularly in the barrier eddy current separator systems used in recycling processes. However, hidden issues, such as belt misalignment, excessive heat that can lead to fire hazards, and the presence of sharp or irregularly shaped materials, [...] Read more.
Conveyor belts are critical in various industries, particularly in the barrier eddy current separator systems used in recycling processes. However, hidden issues, such as belt misalignment, excessive heat that can lead to fire hazards, and the presence of sharp or irregularly shaped materials, reduce operational efficiency and pose serious threats to the health and safety of personnel on the production floor. This study presents an intelligent monitoring and protection system for barrier eddy current separator conveyor belts designed to safeguard machinery and human workers simultaneously. In this system, a thermal camera continuously monitors the surface temperature of the conveyor belt, especially in the area above the magnetic drum—where unwanted ferromagnetic materials can lead to abnormal heating and potential fire risks. The system detects temperature anomalies in this critical zone. The early detection of these risks triggers audio–visual alerts and IoT-based warning messages that are sent to technicians, which is vital in preventing fire-related injuries and minimizing emergency response time. Simultaneously, a machine vision module autonomously detects and corrects belt misalignment, eliminating the need for manual intervention and reducing the risk of worker exposure to moving mechanical parts. Additionally, a line-scan camera integrated with the YOLOv11 AI model analyses the shape of materials on the conveyor belt, distinguishing between rounded and sharp-edged objects. This system enhances the accuracy of material separation and reduces the likelihood of injuries caused by the impact or ejection of sharp fragments during maintenance or handling. The YOLOv11n-seg model implemented in this system achieved a segmentation mask precision of 84.8 percent and a recall of 84.5 percent in industry evaluations. Based on this high segmentation accuracy and consistent detection of sharp particles, the system is expected to substantially reduce the frequency of sharp object collisions with the BECS conveyor belt, thereby minimizing mechanical wear and potential safety hazards. By integrating these intelligent capabilities into a compact, cost-effective solution suitable for real-world recycling environments, the proposed system contributes significantly to improving workplace safety and equipment longevity. This project demonstrates how digital transformation and artificial intelligence can play a pivotal role in advancing occupational health and safety in modern industrial production. Full article
Show Figures

Figure 1

30 pages, 1008 KiB  
Article
Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis
by Ioannis Galanakis, Rigas Filippos Soldatos, Nikitas Karanikolas, Athanasios Voulodimos, Ioannis Voyiatzis and Maria Samarakou
Appl. Sci. 2025, 15(11), 5823; https://doi.org/10.3390/app15115823 - 22 May 2025
Viewed by 640
Abstract
Aggression in patients with dementia poses significant caregiving and clinical issues. In this work, fusion approaches—Early Fusion and Late Fusion—were compared to classify aggression using audio and visual signals. Early Fusion integrates the extracted features of the two modalities into one dataset before [...] Read more.
Aggression in patients with dementia poses significant caregiving and clinical issues. In this work, fusion approaches—Early Fusion and Late Fusion—were compared to classify aggression using audio and visual signals. Early Fusion integrates the extracted features of the two modalities into one dataset before classification, while Late Fusion integrates the prediction probabilities of standalone audio and visual classifiers with a meta-classifier. Both models were tested using a Random Forest classifier with five-fold cross-validation, and the performance was compared on accuracy, precision, recall, F1-score, ROC-AUC, and inference time. The results showcase that Late Fusion is superior to Early Fusion in terms of accuracy (0.876 vs. 0.828), recall (0.914 vs. 0.818), F1-score (0.867 vs. 0.835), and ROC-AUC score (0.970 vs. 0.922), proving more suitable for high-sensitivity use cases like healthcare and security. However, Early Fusion exhibited higher precision (0.852 vs. 0.824), indicating that in cases when false positives are a requirement, Early Fusion is preferable. Paired t-tests were applied for statistical comparison and indicate that precision alone is significantly different, with the advantage of Early Fusion. Late Fusion also performs slightly less in inference time, which makes it suitable for use in real-time systems. These findings provide significant information on multimodal fusion strategies and their applicability in the detection of aggressive behavior, which can contribute to the development of efficient monitoring systems for dementia care. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

33 pages, 10073 KiB  
Article
A Versatile Tool for Haptic Feedback Design Towards Enhancing User Experience in Virtual Reality Applications
by Vasilije Bursać and Dragan Ivetić
Appl. Sci. 2025, 15(10), 5419; https://doi.org/10.3390/app15105419 - 13 May 2025
Viewed by 859
Abstract
The past 15 years of extensive experience teaching VR system development has taught us that haptic feedback must be more sophisticatedly integrated into VR systems, alongside the already realistic high-fidelity visual and audio feedback. The third generation of students is enhancing VR interactive [...] Read more.
The past 15 years of extensive experience teaching VR system development has taught us that haptic feedback must be more sophisticatedly integrated into VR systems, alongside the already realistic high-fidelity visual and audio feedback. The third generation of students is enhancing VR interactive experiences by incorporating haptic feedback through traditional, proven, commercially available gamepad controllers. Insights and discoveries gained through this process contributed to the development of versatile Unity custom editor tool, which is the focus of this article. The developed tool supports a wide range of use cases, enabling the visual, parametric, and descriptive creation of reusable haptic effects. To enhance productivity in commercial development, it supports the creation of haptic and haptic/audio stimulus libraries, which can be further expanded and combined based on object-oriented principles. Additionally, the tool allows for the definition of specific areas within the virtual space where these stimuli can be experienced, depending on the virtual object the avatar holds and the activities they perform. This intuitive platform allows the design of reusable haptic effects through graphical editor, audio conversion, programmatic scripting, and AI-powered guidance. The sophistication and usability of the tool have been demonstrated through several student VR projects across various application areas. Full article
Show Figures

Figure 1

15 pages, 4273 KiB  
Article
Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models
by Jamsher Bhanbhro, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur and Madeha Memon
Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022 - 9 May 2025
Cited by 1 | Viewed by 1583
Abstract
Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. [...] Read more.
Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service. Full article
Show Figures

Figure 1

23 pages, 1213 KiB  
Article
Mobile-AI-Based Docent System: Navigation and Localization for Visually Impaired Gallery Visitors
by Hyeyoung An, Woojin Park, Philip Liu and Soochang Park
Appl. Sci. 2025, 15(9), 5161; https://doi.org/10.3390/app15095161 - 6 May 2025
Viewed by 487
Abstract
Smart guidance systems in museums and galleries are now essential for delivering quality user experiences. Visually impaired visitors face significant barriers when navigating galleries due to existing smart guidance systems’ dependence on visual cues like QR codes, manual numbering, or static beacon positioning. [...] Read more.
Smart guidance systems in museums and galleries are now essential for delivering quality user experiences. Visually impaired visitors face significant barriers when navigating galleries due to existing smart guidance systems’ dependence on visual cues like QR codes, manual numbering, or static beacon positioning. These traditional methods often fail to provide adaptive navigation and meaningful content delivery tailored to their needs. In this paper, we propose a novel Mobile-AI-based Smart Docent System that seamlessly integrates real-time navigation and depth of guide services to enrich gallery experiences for visually impaired users. Our system leverages camera-based on-device processing and adaptive BLE-based localization to ensure accurate path guidance and real-time obstacle avoidance. An on-device object detection model reduces delays from large visual data processing, while BLE beacons, fixed across the gallery, dynamically update location IDs for better accuracy. The system further refines positioning by analyzing movement history and direction to minimize navigation errors. By intelligently modulating audio content based on user movement—whether passing by, approaching for more details, or leaving mid-description—the system offers personalized, context-sensitive interpretations while eliminating unnecessary audio clutter. Experimental validation conducted in an authentic gallery environment yielded empirical evidence of user satisfaction, affirming the efficacy of our methodological approach in facilitating enhanced navigational experiences for visually impaired individuals. These findings substantiate the system’s capacity to enable more autonomous, secure, and enriched cultural engagement for visually impaired individuals within complex indoor environments. Full article
(This article belongs to the Special Issue IoT in Smart Cities and Homes, 2nd Edition)
Show Figures

Figure 1

Back to TopTop