Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (76)

Search Parameters:
Keywords = voice activity detection

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 290 KiB  
Article
Body Weight Loss Experience Among Adults from Saudi Arabia and Assessment of Factors Associated with Weight Regain: A Cross-Sectional Study
by Ibrahim M. Gosadi
Nutrients 2025, 17(14), 2341; https://doi.org/10.3390/nu17142341 - 17 Jul 2025
Viewed by 579
Abstract
Background/Objectives: Weight loss and its subsequent regain pose significant challenges for those dealing with overweight and obesity. This study explores weight loss strategies among adults in Saudi Arabia and evaluates factors linked to weight regain. Methods: This cross-sectional study focused on [...] Read more.
Background/Objectives: Weight loss and its subsequent regain pose significant challenges for those dealing with overweight and obesity. This study explores weight loss strategies among adults in Saudi Arabia and evaluates factors linked to weight regain. Methods: This cross-sectional study focused on adults residing in Jazan, located in southwest Saudi Arabia. Data collection was conducted using a self-administered questionnaire that assessed participants’ demographics, medical history, perceptions of body weight, weight loss methods, and the incidence of weight regain. Logistic regression was used to determine whether there were statistically significant differences related to the occurrence of weight regain. Results: A total of 368 participants reported efforts to lose weight over the past 3 years. The average age of these participants was 32.7 years (standard deviation: 11.3), and the gender distribution was almost equal. The majority of the sample (65%) voiced dissatisfaction with their body weight. Some participants employed a combination of weight loss methods, with exercise, reduced food intake, and intermittent fasting being the most frequently mentioned. The findings also indicate that a minority sought professional help, whether from a physician or a nutritionist. Over 90% claimed to have successfully lost weight at least once during their attempts, but more than half (139 individuals) experienced weight regain following their weight loss efforts. Within the univariate logistic regression, higher odds ratios of weight regain were detected among men, older participants, those living in rural areas, individuals with higher levels of education, employed persons or business owners, those with higher monthly incomes, smokers, khat chewers, and those diagnosed with a chronic condition (p values < 0.05). However, the multivariate logistic regression revealed that only residence, monthly income, smoking status, and being diagnosed with a chronic disease remained statistically significant as predictors of weight regain after adjusting for other variables (p values < 0.05). Conclusions: These findings highlight the significance of incorporating weight regain prevention into body weight management for individuals dealing with overweight and obesity. Further research is needed to evaluate specific dietary, physical activity, and psychological factors that may increase the risk of weight regain in certain participants. Full article
(This article belongs to the Special Issue The Role of Physical Activity and Diet on Weight Management)
Show Figures

Figure 1

26 pages, 32088 KiB  
Article
Fall Detection Algorithm Using Enhanced HRNet Combined with YOLO
by Huan Shi, Xiaopeng Wang and Jia Shi
Sensors 2025, 25(13), 4128; https://doi.org/10.3390/s25134128 - 2 Jul 2025
Viewed by 551
Abstract
To address the issues of insufficient feature extraction, single-fall judgment method, and poor real-time performance of traditional fall detection algorithms in occluded scenes, a top-down fall detection algorithm based on improved YOLOv8 combined with BAM-HRNet is proposed. First, the Shufflenetv2 network is used [...] Read more.
To address the issues of insufficient feature extraction, single-fall judgment method, and poor real-time performance of traditional fall detection algorithms in occluded scenes, a top-down fall detection algorithm based on improved YOLOv8 combined with BAM-HRNet is proposed. First, the Shufflenetv2 network is used to make the backbone of YOLOv8 light weight, and a mixed attention mechanism network is connected stage-wise at the neck to enable the network to better obtain human body position information. Second, the HRNet network integrated with the channel attention mechanism can effectively extract the position information of key points. Then, by analyzing the position information of skeletal key points, the decline speed of the center of mass, the angular velocity between the trunk and the ground, and the human body height-to-width ratio are jointly used as the discriminant basis for identifying fall behaviors. In addition, when a suspected fall is detected, the system automatically activates a voice inquiry mechanism to improve the accuracy of fall judgment. The results show that the accuracy of the object detection module on the COCO and Pascal VOC datasets is 64.1% and 61.7%, respectively. The accuracy of the key point detection module on the COCO and OCHuman datasets reaches 73.49% and 70.11%, respectively. On the fall detection datasets, the accuracy of the proposed algorithm exceeds 95% and the frame rate reaches 18.1 fps. Compared with traditional algorithms, it demonstrates superior ability to distinguish between normal and fall behaviors. Full article
Show Figures

Figure 1

14 pages, 12231 KiB  
Article
Habitat Requirements of the Grey-Headed Woodpecker in Lowland Areas of NE Poland: Evidence from the Playback Experiment
by Grzegorz Zawadzki and Dorota Zawadzka
Birds 2025, 6(3), 32; https://doi.org/10.3390/birds6030032 - 20 Jun 2025
Viewed by 522
Abstract
The grey-headed woodpecker (Picus canus) (GHW) is one of the least-studied European woodpeckers, listed in Annex I of the Birds Directive. We examined the key environmental characteristics that determine the possibility of GHW occurrence in vast forests in northeast Poland. Woodpeckers [...] Read more.
The grey-headed woodpecker (Picus canus) (GHW) is one of the least-studied European woodpeckers, listed in Annex I of the Birds Directive. We examined the key environmental characteristics that determine the possibility of GHW occurrence in vast forests in northeast Poland. Woodpeckers were inventoried in spring on 54 study plots (4 km2) covering 20% of the forest area. Active territories were detected and mapped using the playback experiment of territorial voices and drumming. The generalized linear model GLM, random forest RF, and Boosting were used for modeling. GLM was used to indicate the most critical factors affecting the abundance of GHW. The number of territories in a single study plot ranged from 0 to 3; the most frequent were areas without woodpeckers. The probability of the nesting of the GHW was increasing at plots with watercourses, a bigger share of mixed forest area, and a proportion of stands over 120 years old. The calculation for all 400 quadrats allowed us to estimate the population size at approximately 180–200 breeding pairs. The overall density of GHW in the study area was assessed at 0.13/km2, while at the optimal quadrats, it increased to about 0.75/km2. Preference for watercourses was linked to alders growing along water banks. Near the water, there are often small meadows where the GHW can prey on ants. In turn, old-growth forests above 120 years old increased the probability of the presence of the GHW. There are more dead and dying trees in older forests, which are the ones the GHW chooses to excavate cavities. To effectively protect the habitats of the GHW, it is necessary to maintain a larger area of stands over 120 years old, mainly on wet sites. Full article
Show Figures

Figure 1

17 pages, 1071 KiB  
Article
Empirical Analysis of Learning Improvements in Personal Voice Activity Detection Frameworks
by Yu-Tseng Yeh, Chia-Chi Chang and Jeih-Weih Hung
Electronics 2025, 14(12), 2372; https://doi.org/10.3390/electronics14122372 - 10 Jun 2025
Viewed by 626
Abstract
Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized [...] Read more.
Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized voice assistants and robust speech recognition, where accurately identifying a target speaker’s voice amidst background speech and noise is crucial for both user experience and computational efficiency. Despite significant progress, PVAD frameworks still face challenges related to temporal modeling, integration of speaker information, class imbalance, and deployment on resource-constrained devices. In this study, we present a systematic enhancement of the PVAD framework through four key innovations: (1) a Bi-GRU (Bidirectional Gated Recurrent Unit) layer for improved temporal modeling of speech dynamics, (2) a cross-attention mechanism for context-aware speaker embedding integration, (3) a hybrid CE-AUROC (Cross-Entropy and Area Under Receiver Operating Characteristic) loss function to address class imbalance, and (4) Cosine Annealing Learning Rate (CALR) for optimized training convergence. Evaluated on LibriSpeech datasets under varied acoustic conditions, the proposed modifications demonstrate significant performance gains over the baseline PVAD framework, achieving 87.59% accuracy (vs. 86.18%) and 0.9481 mean Average Precision (vs. 0.9378) while maintaining real-time processing capabilities. These advancements address critical challenges in PVAD deployment, including robustness to noisy environments, with the hybrid loss function reducing false negatives by 12% in imbalanced scenarios. The work provides practical insights for implementing personalized voice interfaces on resource-constrained devices. Future extensions will explore quantized inference and multi-modal sensor fusion to further bridge the gap between laboratory performance and real-world deployment requirements. Full article
(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)
Show Figures

Figure 1

18 pages, 3278 KiB  
Article
Efficient Detection of Mind Wandering During Reading Aloud Using Blinks, Pitch Frequency, and Reading Rate
by Amir Rabinovitch, Eden Ben Baruch, Maor Siton, Nuphar Avital, Menahem Yeari and Dror Malka
AI 2025, 6(4), 83; https://doi.org/10.3390/ai6040083 - 18 Apr 2025
Cited by 2 | Viewed by 1049
Abstract
Mind wandering is a common issue among schoolchildren and academic students, often undermining the quality of learning and teaching effectiveness. Current detection methods mainly rely on eye trackers and electrodermal activity (EDA) sensors, focusing on external indicators such as facial movements but neglecting [...] Read more.
Mind wandering is a common issue among schoolchildren and academic students, often undermining the quality of learning and teaching effectiveness. Current detection methods mainly rely on eye trackers and electrodermal activity (EDA) sensors, focusing on external indicators such as facial movements but neglecting voice detection. These methods are often cumbersome, uncomfortable for participants, and invasive, requiring specialized, expensive equipment that disrupts the natural learning environment. To overcome these challenges, a new algorithm has been developed to detect mind wandering during reading aloud. Based on external indicators like the blink rate, pitch frequency, and reading rate, the algorithm integrates these three criteria to ensure the accurate detection of mind wandering using only a standard computer camera and microphone, making it easy to implement and widely accessible. An experiment with ten participants validated this approach. Participants read aloud a text of 1304 words while the algorithm, incorporating the Viola–Jones model for face and eye detection and pitch-frequency analysis, monitored for signs of mind wandering. A voice activity detection (VAD) technique was also used to recognize human speech. The algorithm achieved 76% accuracy in predicting mind wandering during specific text segments, demonstrating the feasibility of using noninvasive physiological indicators. This method offers a practical, non-intrusive solution for detecting mind wandering through video and audio data, making it suitable for educational settings. Its ability to integrate seamlessly into classrooms holds promise for enhancing student concentration, improving the teacher–student dynamic, and boosting overall teaching effectiveness. By leveraging standard, accessible technology, this approach could pave the way for more personalized, technology-enhanced education systems. Full article
Show Figures

Figure 1

43 pages, 2542 KiB  
Article
Mathematical Background and Algorithms of a Collection of Android Apps for a Google Play Store Page
by Roland Szabo
Appl. Sci. 2025, 15(8), 4431; https://doi.org/10.3390/app15084431 - 17 Apr 2025
Viewed by 436
Abstract
This paper discusses three algorithmic strategies tailored for distinct applications, each aiming to tackle specific operational challenges. The first application unveils an innovative SMS messaging system that substitutes manual typing with voice interaction. The key algorithm facilitates real-time conversion from speech to text [...] Read more.
This paper discusses three algorithmic strategies tailored for distinct applications, each aiming to tackle specific operational challenges. The first application unveils an innovative SMS messaging system that substitutes manual typing with voice interaction. The key algorithm facilitates real-time conversion from speech to text for message creation and from text to speech for message playback, thus turning SMS communication into an audio-focused exchange while preserving conventional messaging standards. The second application suggests a secure file management system for Android, utilizing encryption and access control algorithms to safeguard user privacy. Its mathematical framework centers on cryptographic methods for file security and authentication processes to prevent unauthorized access. The third application redefines flashlight functionality using an optimized touch interface algorithm. By employing a screen-wide double-tap gesture recognition system, this approach removes the reliance on a physical button, depending instead on advanced event detection and hardware control logic to activate the device’s flash. All applications are fundamentally based on mathematical modeling and algorithmic effectiveness, emphasizing computational approaches over implementation specifics. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

20 pages, 20407 KiB  
Article
VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection
by Andrea Appiani and Cigdem Beyan
Information 2025, 16(3), 233; https://doi.org/10.3390/info16030233 - 16 Mar 2025
Viewed by 1721
Abstract
Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining [...] Read more.
Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets. Full article
(This article belongs to the Special Issue Application of Machine Learning in Human Activity Recognition)
Show Figures

Figure 1

15 pages, 4108 KiB  
Article
Vocal Emotion Perception and Musicality—Insights from EEG Decoding
by Johannes M. Lehnen, Stefan R. Schweinberger and Christine Nussbaum
Sensors 2025, 25(6), 1669; https://doi.org/10.3390/s25061669 - 8 Mar 2025
Viewed by 1047
Abstract
Musicians have an advantage in recognizing vocal emotions compared to non-musicians, a performance advantage often attributed to enhanced early auditory sensitivity to pitch. Yet a previous ERP study only detected group differences from 500 ms onward, suggesting that conventional ERP analyses might not [...] Read more.
Musicians have an advantage in recognizing vocal emotions compared to non-musicians, a performance advantage often attributed to enhanced early auditory sensitivity to pitch. Yet a previous ERP study only detected group differences from 500 ms onward, suggesting that conventional ERP analyses might not be sensitive enough to detect early neural effects. To address this, we re-analyzed EEG data from 38 musicians and 39 non-musicians engaged in a vocal emotion perception task. Stimuli were generated using parameter-specific voice morphing to preserve emotional cues in either the pitch contour (F0) or timbre. By employing a neural decoding framework with a Linear Discriminant Analysis classifier, we tracked the evolution of emotion representations over time in the EEG signal. Converging with the previous ERP study, our findings reveal that musicians—but not non-musicians—exhibited significant emotion decoding between 500 and 900 ms after stimulus onset, a pattern observed for F0-Morphs only. These results suggest that musicians’ superior vocal emotion recognition arises from more effective integration of pitch information during later processing stages rather than from enhanced early sensory encoding. Our study also demonstrates the potential of neural decoding approaches using EEG brain activity as a biological sensor for unraveling the temporal dynamics of voice perception. Full article
(This article belongs to the Special Issue Sensing Technologies in Neuroscience and Brain Research)
Show Figures

Figure 1

17 pages, 4873 KiB  
Article
An Ensemble Approach for Speaker Identification from Audio Files in Noisy Environments
by Syed Shahab Zarin, Ehzaz Mustafa, Sardar Khaliq uz Zaman, Abdallah Namoun and Meshari Huwaytim Alanazi
Appl. Sci. 2024, 14(22), 10426; https://doi.org/10.3390/app142210426 - 13 Nov 2024
Viewed by 1237
Abstract
Automatic noise-robust speaker identification is essential in various applications, including forensic analysis, e-commerce, smartphones, and security systems. Audio files containing suspect speech often include background noise, as they are typically not recorded in soundproof environments. To this end, we address the challenges of [...] Read more.
Automatic noise-robust speaker identification is essential in various applications, including forensic analysis, e-commerce, smartphones, and security systems. Audio files containing suspect speech often include background noise, as they are typically not recorded in soundproof environments. To this end, we address the challenges of noise robustness and accuracy in speaker identification systems. An ensemble approach is proposed combining two different neural network architectures including an RNN and DNN using softmax. This approach enhances the system’s ability to identify speakers even in noisy environments accurately. Using softmax, we combine voice activity detection (VAD) with a multilayer perceptron (MLP). The VAD component aims to remove noisy frames from the recording. The softmax function addresses these residual traces by assigning a higher probability to the speaker’s voice compared to the noise. We tested our proposed solution on the Kaggle speaker recognition dataset and compared it to two baseline systems. Experimental results show that our approach outperforms the baseline systems, achieving a 3.6% and 5.8% increase in test accuracy. Additionally, we compared the proposed MLP system with Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) classifiers. The results demonstrate that the MLP with VAD and softmax outperforms the LSTM by 23.2% and the BiLSTM by 6.6% in test accuracy. Full article
(This article belongs to the Special Issue Advances in Intelligent Information Systems and AI Applications)
Show Figures

Figure 1

14 pages, 1413 KiB  
Article
Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
by Kyung-Min Roh and Seok-Pil Lee
Appl. Sci. 2024, 14(21), 9890; https://doi.org/10.3390/app14219890 - 29 Oct 2024
Cited by 3 | Viewed by 1451
Abstract
With the advancement of Artificial Intelligence (AI) and the Internet of Things (IoT), research in the field of emotion detection and recognition has been actively conducted worldwide in modern society. Among this research, speech emotion recognition has gained increasing importance in various areas [...] Read more.
With the advancement of Artificial Intelligence (AI) and the Internet of Things (IoT), research in the field of emotion detection and recognition has been actively conducted worldwide in modern society. Among this research, speech emotion recognition has gained increasing importance in various areas of application such as personalized services, enhanced security, and the medical field. However, subjective emotional expressions in voice data can be perceived differently by individuals, and issues such as data imbalance and limited datasets fail to provide the diverse situations necessary for model training, thus limiting performance. To overcome these challenges, this paper proposes a novel data augmentation technique using Conditional-DCGAN, which combines CGAN and DCGAN. This study analyzes the temporal signal changes using Mel-spectrograms extracted from the Emo-DB dataset and applies a loss function calculation method borrowed from reinforcement learning to generate data that accurately reflects emotional characteristics. To validate the proposed method, experiments were conducted using a model combining CNN and Bi-LSTM. The results, including augmented data, achieved significant performance improvements, reaching WA 91.46% and UAR 91.61%, compared to using only the original data (WA 79.31%, UAR 78.16%). These results outperform similar previous studies, such as those reporting WA 84.49% and UAR 83.33%, demonstrating the positive effects of the proposed data augmentation technique. This study presents a new data augmentation method that enables effective learning even in situations with limited data, offering a progressive direction for research in speech emotion recognition. Full article
(This article belongs to the Special Issue Human Activity Recognition (HAR) in Healthcare, 2nd Edition)
Show Figures

Figure 1

21 pages, 9368 KiB  
Article
Advanced Neural Classifier-Based Effective Human Assistance Robots Using Comparable Interactive Input Assessment Technique
by Mohammed Albekairi, Khaled Kaaniche, Ghulam Abbas, Paolo Mercorelli, Meshari D. Alanazi and Ahmad Almadhor
Mathematics 2024, 12(16), 2500; https://doi.org/10.3390/math12162500 - 13 Aug 2024
Cited by 2 | Viewed by 1226
Abstract
The role of robotic systems in human assistance is inevitable with the bots that assist with interactive and voice commands. For cooperative and precise assistance, the understandability of these bots needs better input analysis. This article introduces a Comparable Input Assessment Technique (CIAT) [...] Read more.
The role of robotic systems in human assistance is inevitable with the bots that assist with interactive and voice commands. For cooperative and precise assistance, the understandability of these bots needs better input analysis. This article introduces a Comparable Input Assessment Technique (CIAT) to improve the bot system’s understandability. This research introduces a novel approach for HRI that uses optimized algorithms for input detection, analysis, and response generation in conjunction with advanced neural classifiers. This approach employs deep learning models to enhance the accuracy of input identification and processing efficiency, in contrast to previous approaches that often depended on conventional detection techniques and basic analytical methods. Regardless of the input type, this technique defines cooperative control for assistance from previous histories. The inputs are cooperatively validated for the instruction responses for human assistance through defined classifications. For this purpose, a neural classifier is used; the maximum possibilities for assistance using self-detected instructions are recommended for the user. The neural classifier is divided into two categories according to its maximum comparable limits: precise instruction and least assessment inputs. For this purpose, the robot system is trained using previous histories and new assistance activities. The learning process performs comparable validations between detected and unrecognizable inputs with a classification that reduces understandability errors. Therefore, the proposed technique was found to reduce response time by 6.81%, improve input detection by 8.73%, and provide assistance by 12.23% under varying inputs. Full article
Show Figures

Figure 1

30 pages, 909 KiB  
Article
Emotion Detection from EEG Signals Using Machine Deep Learning Models
by João Vitor Marques Rabelo Fernandes, Auzuir Ripardo de Alexandria, João Alexandre Lobo Marques, Débora Ferreira de Assis, Pedro Crosara Motta and Bruno Riccelli dos Santos Silva
Bioengineering 2024, 11(8), 782; https://doi.org/10.3390/bioengineering11080782 - 2 Aug 2024
Cited by 15 | Viewed by 7280
Abstract
Detecting emotions is a growing field aiming to comprehend and interpret human emotions from various data sources, including text, voice, and physiological signals. Electroencephalogram (EEG) is a unique and promising approach among these sources. EEG is a non-invasive monitoring technique that records the [...] Read more.
Detecting emotions is a growing field aiming to comprehend and interpret human emotions from various data sources, including text, voice, and physiological signals. Electroencephalogram (EEG) is a unique and promising approach among these sources. EEG is a non-invasive monitoring technique that records the brain’s electrical activity through electrodes placed on the scalp’s surface. It is used in clinical and research contexts to explore how the human brain responds to emotions and cognitive stimuli. Recently, its use has gained interest in real-time emotion detection, offering a direct approach independent of facial expressions or voice. This is particularly useful in resource-limited scenarios, such as brain–computer interfaces supporting mental health. The objective of this work is to evaluate the classification of emotions (positive, negative, and neutral) in EEG signals using machine learning and deep learning, focusing on Graph Convolutional Neural Networks (GCNN), based on the analysis of critical attributes of the EEG signal (Differential Entropy (DE), Power Spectral Density (PSD), Differential Asymmetry (DASM), Rational Asymmetry (RASM), Asymmetry (ASM), Differential Causality (DCAU)). The electroencephalography dataset used in the research was the public SEED dataset (SJTU Emotion EEG Dataset), obtained through auditory and visual stimuli in segments from Chinese emotional movies. The experiment employed to evaluate the model results was “subject-dependent”. In this method, the Deep Neural Network (DNN) achieved an accuracy of 86.08%, surpassing SVM, albeit with significant processing time due to the optimization characteristics inherent to the algorithm. The GCNN algorithm achieved an average accuracy of 89.97% in the subject-dependent experiment. This work contributes to emotion detection in EEG, emphasizing the effectiveness of different models and underscoring the importance of selecting appropriate features and the ethical use of these technologies in practical applications. The GCNN emerges as the most promising methodology for future research. Full article
(This article belongs to the Special Issue Monitoring and Analysis of Human Biosignals, Volume II)
Show Figures

Figure 1

19 pages, 3360 KiB  
Article
ATC-SD Net: Radiotelephone Communications Speaker Diarization Network
by Weijun Pan, Yidi Wang, Yumei Zhang and Boyuan Han
Aerospace 2024, 11(7), 599; https://doi.org/10.3390/aerospace11070599 - 22 Jul 2024
Cited by 1 | Viewed by 2173
Abstract
This study addresses the challenges that high-noise environments and complex multi-speaker scenarios present in civil aviation radio communications. A novel radiotelephone communications speaker diffraction network is developed specifically for these circumstances. To improve the precision of the speaker diarization network, three core modules [...] Read more.
This study addresses the challenges that high-noise environments and complex multi-speaker scenarios present in civil aviation radio communications. A novel radiotelephone communications speaker diffraction network is developed specifically for these circumstances. To improve the precision of the speaker diarization network, three core modules are designed: voice activity detection (VAD), end-to-end speaker separation for air–ground communication (EESS), and probabilistic knowledge-based text clustering (PKTC). First, the VAD module uses attention mechanisms to separate silence from irrelevant noise, resulting in pure dialogue commands. Subsequently, the EESS module distinguishes between controllers and pilots by levying voice print differences, resulting in effective speaker segmentation. Finally, the PKTC module addresses the issue of pilot voice print ambiguity using text clustering, introducing a novel flight prior knowledge-based text-related clustering model. To achieve robust speaker diarization in multi-pilot scenarios, this model uses prior knowledge-based graph construction, radar data-based graph correction, and probabilistic optimization. This study also includes the development of the specialized ATCSPEECH dataset, which demonstrates significant performance improvements over both the AMI and ATCO2 PROJECT datasets. Full article
Show Figures

Figure 1

24 pages, 5034 KiB  
Perspective
AI Detection of Human Understanding in a Gen-AI Tutor
by Earl Woodruff
AI 2024, 5(2), 898-921; https://doi.org/10.3390/ai5020045 - 18 Jun 2024
Cited by 5 | Viewed by 4610
Abstract
Subjective understanding is a complex process that involves the interplay of feelings and cognition. This paper explores how computers can monitor a user’s sympathetic and parasympathetic nervous system activity in real-time to detect the nature of the understanding the user is experiencing as [...] Read more.
Subjective understanding is a complex process that involves the interplay of feelings and cognition. This paper explores how computers can monitor a user’s sympathetic and parasympathetic nervous system activity in real-time to detect the nature of the understanding the user is experiencing as they engage with study materials. By leveraging advancements in facial expression analysis, transdermal optical imaging, and voice analysis, I demonstrate how one can identify the physiological feelings that indicate a user’s mental state and level of understanding. The mental state model, which views understandings as composed of assembled beliefs, values, emotions, and feelings, provides a framework for understanding the multifaceted nature of the emotion–cognition relationship. As learners progress through the phases of nascent understanding, misunderstanding, confusion, emergent understanding, and deep understanding, they experience a range of cognitive processes, emotions, and physiological responses that can be detected and analyzed by AI-driven assessments. Based on the above approach, I further propose the development of Abel Tutor. This AI-driven system uses real-time monitoring of physiological feelings to provide individualized, adaptive tutoring support designed to guide learners toward deep understanding. By identifying the feelings associated with each phase of understanding, Abel Tutor can offer targeted interventions, such as clarifying explanations, guiding questions, or additional resources, to help students navigate the challenges they encounter and promote engagement. The ability to detect and respond to a student’s emotional state in real-time can revolutionize the learning experience, creating emotionally resonant learning environments that adapt to individual needs and optimize educational outcomes. As we continue to explore the potential of AI-driven assessments of subjective understanding, it is crucial to ensure that these technologies are grounded in sound pedagogical principles and ethical considerations, ultimately empowering learners and facilitating the attainment of deep understanding and lifelong learning for advantaged and disadvantaged students. Full article
Show Figures

Figure 1

35 pages, 13690 KiB  
Article
An Audio-Based SLAM for Indoor Environments: A Robotic Mixed Reality Presentation
by Elfituri S. F. Lahemer and Ahmad Rad
Sensors 2024, 24(9), 2796; https://doi.org/10.3390/s24092796 - 27 Apr 2024
Cited by 2 | Viewed by 2799
Abstract
In this paper, we present a novel approach referred to as the audio-based virtual landmark-based HoloSLAM. This innovative method leverages a single sound source and microphone arrays to estimate the voice-printed speaker’s direction. The system allows an autonomous robot equipped with a single [...] Read more.
In this paper, we present a novel approach referred to as the audio-based virtual landmark-based HoloSLAM. This innovative method leverages a single sound source and microphone arrays to estimate the voice-printed speaker’s direction. The system allows an autonomous robot equipped with a single microphone array to navigate within indoor environments, interact with specific sound sources, and simultaneously determine its own location while mapping the environment. The proposed method does not require multiple audio sources in the environment nor sensor fusion to extract pertinent information and make accurate sound source estimations. Furthermore, the approach incorporates Robotic Mixed Reality using Microsoft HoloLens to superimpose landmarks, effectively mitigating the audio landmark-related issues of conventional audio-based landmark SLAM, particularly in situations where audio landmarks cannot be discerned, are limited in number, or are completely missing. The paper also evaluates an active speaker detection method, demonstrating its ability to achieve high accuracy in scenarios where audio data are the sole input. Real-time experiments validate the effectiveness of this method, emphasizing its precision and comprehensive mapping capabilities. The results of these experiments showcase the accuracy and efficiency of the proposed system, surpassing the constraints associated with traditional audio-based SLAM techniques, ultimately leading to a more detailed and precise mapping of the robot’s surroundings. Full article
(This article belongs to the Section Navigation and Positioning)
Show Figures

Figure 1

Back to TopTop