MDPI - Publisher of Open Access Journals

21 pages, 698 KB

Open AccessArticle

Automatic Diacritization Models for a High-Population Low-Resource African Language (Yorùbá)

by Joshua I. Ayoola and Peter O. Olukanmi

Appl. Sci. 2026, 16(12), 6195; https://doi.org/10.3390/app16126195 (registering DOI) - 18 Jun 2026

Diacritization is an essential part of the reading and writing of text in Yorùbá, a widely-spoken tonal language in West Africa and some parts of the American continent. Unfortunately, typical computer-typed texts are not diacritized. Thus, automatic diacritization is a critical issue in [...] Read more.

Diacritization is an essential part of the reading and writing of text in Yorùbá, a widely-spoken tonal language in West Africa and some parts of the American continent. Unfortunately, typical computer-typed texts are not diacritized. Thus, automatic diacritization is a critical issue in Yorùbá natural language processing (NLP), since missing tone marks and underdots affect text comprehension, translation and speech technology. This paper begins by reviewing the state of the art. While there is a paucity of Yorùbá diacritization models, four models found were studied to explore their performances using the standardised Yorùbá Automatic Diacritization Dataset: the 2018 Volta Baseline, the mT5_base_yoruba_adr, GPT-5.2 and Gemini 3.1 Pro. We measured the performance based on a set of metrics: Word Error Rate (WER), Character Error Rate (CER), Diacritization Error Rate (DER), Word Diacritization Error Rate (WDER), BLEU and ChrF, using the complete diacritic removal condition of the YAD test set. To ensure reproducibility, the LLM evaluations were conducted via the respective official APIs and AI Studio with pinned snapshots and deterministic settings, with each model evaluated across three independent full-dataset runs. The findings showed that the specialised mT5_base_yoruba_adr model slightly outperforms the LLMs, achieving the lowest error rates of 34.85% CER, 18.34% WER, 43.37% DER and 18.33% WDER, as well as a BLEU of 0.6872 and ChrF of 0.8436. Gemini 3.1 Pro ranked second across all error rate metrics with 35.68% CER, 18.96% WER, and 44.84% DER but outperformed mT5 by a small margin on ChrF (0.8469), followed by GPT-5.2 with 54.01% CER, 38.05% WER, and 62.64% DER. The Volta Baseline built on the early seq2seq showed the weakest performance with 92.37% CER and 94.42% DER. These results challenge the assumption that large parameter count and massive pre-training guarantee superior performance in low-resource language tasks and show that targeted fine-tuning on Yorùbá-specific data remains important. Our work serves as a reference for researchers seeking an overview of the state of the art, as well as a detailed and reproducible evaluation of existing models. The results highlight methodological progress and gaps in current systems. Addressing these gaps will require domain-adaptive fine-tuning, improved algorithms, and robust datasets to advance the state-of-the-art in African-language automatic diacritization research. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP): Technologies and Applications)

23 pages, 3410 KB

Open AccessArticle

Human Detection of Voice-Cloned Speech Under GSM, VoLTE and VoIP Conditions

by Jakub Warzych, Michał Łuczyński and Janusz Klink

Acoustics 2026, 8(2), 41; https://doi.org/10.3390/acoustics8020041 - 17 Jun 2026

Viewed by 6

Abstract

The rapid progress of generative speech synthesis and voice-cloning technologies has enabled the creation of highly natural synthetic voices that pose a serious threat to telecommunication security. While most prior studies evaluate human ability to detect audio deepfakes using high-quality, studio-grade recordings, little [...] Read more.

The rapid progress of generative speech synthesis and voice-cloning technologies has enabled the creation of highly natural synthetic voices that pose a serious threat to telecommunication security. While most prior studies evaluate human ability to detect audio deepfakes using high-quality, studio-grade recordings, little is known about how real-world telecommunication channels affect perceptual detection. This study investigates the influence of three transmission scenarios—GSM (AMR-NB), VoLTE (AMR-WB), and VoIP with packet-loss modeling—on the human ability to distinguish natural speech from AI-generated speech. A custom speech corpus was developed, consisting of natural recordings from nine speakers and corresponding synthetic utterances generated using a state-of-the-art voice cloning system (ElevenLabs). All samples were processed through simulated telecommunication channels using real codec implementations. A listening test with 95 participants was conducted, involving binary classification (human vs. synthetic) and confidence ratings. Results show an overall detection accuracy of 54.8%, confirming that humans are poorly equipped to identify synthetic speech. Surprisingly, the highest accuracy was achieved for the narrowband GSM channel (63.7%), while VoLTE yielded the lowest performance (44.0%). The findings suggest that restricted bandwidth may emphasize prosodic irregularities typical of generative models, whereas high-quality channels mask synthetic artifacts, increasing susceptibility to voice spoofing. The results highlight the necessity of deploying additional security mechanisms in telecommunication systems relying on voice identity verification. Full article

► Show Figures

Figure 1

17 pages, 2502 KB

Open AccessArticle

Child- and Adult-Centered Toy Play Across Languages in Thai–English Bilingual Mother–Child Interactions

by Sirada Rochanavibhata and Viorica Marian

Behav. Sci. 2026, 16(6), 1017; https://doi.org/10.3390/bs16061017 - 17 Jun 2026

Viewed by 42

Abstract

Play is a universal activity. Yet there are cultural and linguistic differences in how families engage in adult–child play. In the present study, Thai–English bilingual mother–child dyads completed a toy play task in both languages. The results revealed cross-linguistic differences in bilingual mothers’ [...] Read more.

Play is a universal activity. Yet there are cultural and linguistic differences in how families engage in adult–child play. In the present study, Thai–English bilingual mother–child dyads completed a toy play task in both languages. The results revealed cross-linguistic differences in bilingual mothers’ and children’s conversation styles. When speaking Thai, the nature of bilinguals’ dyadic play was more adult-centered, characterized by the use of directives by the mothers and use of repetitions by the children, which was congruent with parent–child interpersonal dynamics in high-power-distance Asian cultures. When speaking English, the play session was more child-centered, evidenced by children’s use of directives and encouragements, which was congruent with behavioral norms in low-power-distance Western cultures. Bilingual mothers and children exhibited positive associations in their narrative styles during both the Thai and English sessions. Additionally, the preliminary results provided evidence that cross-linguistic differences in mother–child speech patterns may be moderated by child gender. These findings suggest that the communicative and interactional patterns that bilingual caregivers modeled for bilingual children varied across languages and that preschoolers aligned their behaviors with those exemplified by their mothers. We conclude that bilingualism influences early social communication, with theoretical and applied implications for researchers, educators, and clinicians. Full article

(This article belongs to the Special Issue Language and Cognitive Development in Bilingual Children)

► Show Figures

Figure 1

21 pages, 466 KB

Open AccessReview

Artificial Intelligence for Patient-Reported Outcomes in Oncology: Current Applications and Future Directions Toward Multimodal Monitoring

by Sebastian Gorecki, Aleksandra Tatka and Malgorzata Osmola

Cancers 2026, 18(12), 1905; https://doi.org/10.3390/cancers18121905 - 11 Jun 2026

Viewed by 292

Abstract

Patient-reported outcomes (PROs) are an integral component of contemporary oncology. They provide direct insight into symptom severity, treatment tolerability, and health-related quality of life. Despite their clinical relevance, routine implementation faces several hurdles. Key limitations include patient survey fatigue, challenges in real-time interpretation [...] Read more.

Patient-reported outcomes (PROs) are an integral component of contemporary oncology. They provide direct insight into symptom severity, treatment tolerability, and health-related quality of life. Despite their clinical relevance, routine implementation faces several hurdles. Key limitations include patient survey fatigue, challenges in real-time interpretation of complex symptom trajectories, and incomplete longitudinal data that limit reliable analysis. This narrative review summarizes recent advances (2020–2026) in applying artificial intelligence (AI) to structured questionnaires, including EORTC QLQ-C30, PROMIS, and PRO-CTCAE, as well as to unstructured clinical text. Machine learning and natural language processing may enhance the clinical utility of PROs through automated analysis, symptom extraction, and predictive modeling. Current studies suggest that AI-based approaches can support the prediction of symptom deterioration, treatment-related toxicity, and healthcare utilization, including unplanned hospitalizations and emergency department visits. Furthermore, NLP models can extract clinically meaningful information from free-text narratives. We also discuss emerging non-invasive digital biomarkers derived from speech and facial expressions. Multimodal approaches suggest that these features may provide complementary indicators of pain, fatigue, and affective state. Overall, AI has the potential to transform PROs from static assessment tools into dynamic clinical instruments. This shift may enable more continuous and proactive symptom monitoring and support the integration of multimodal patient data into oncology decision-making workflows. Full article

(This article belongs to the Special Issue Machine Learning and Artificial Intelligence in Cancer Diagnostic and Monitoring)

► Show Figures

Figure 1

24 pages, 1730 KB

Open AccessArticle

An Unsupervised Subspace Weighting Co-Clustering Framework for Hate Speech Detection Patterns in Social Media

by Maya Sultan ALGhafri, Imran Khan and Abdelhamid Abdesselam

AI 2026, 7(6), 204; https://doi.org/10.3390/ai7060204 - 4 Jun 2026

Viewed by 399

Abstract

The exponential growth of social media has revolutionized global communication, enabling instant idea exchange and transforming information sharing into a worldwide phenomenon while simultaneously accelerating the spread of abusive and hateful content that threatens online harmony and poses a serious risk to online [...] Read more.

The exponential growth of social media has revolutionized global communication, enabling instant idea exchange and transforming information sharing into a worldwide phenomenon while simultaneously accelerating the spread of abusive and hateful content that threatens online harmony and poses a serious risk to online community integrity and public trust. Although supervised deep learning approaches achieve impressive accuracy for hate speech detection, they remain fundamentally reliant on extensive annotated corpora, and their lack of interpretability makes them insufficient for transparent and scalable real-world hate speech detection. This study presents a category-oriented unsupervised architecture for English hate-speech detection and classification that substantially reduces reliance on large labeled datasets by requiring only minimal supervision (10% of labels for post hoc cluster interpretation), ensuring transparency and a high degree of semantic interpretability. We introduce an unsupervised Subspace Weighting Co-Clustering framework that uses HateBERT-driven contextual embeddings, enabling simultaneous interpretable feature weighting and semantic understanding for robust hate-speech detection. The obtained embeddings are further structured using the Subspace Weighting Co-Clustering approach, which enables the unsupervised discovery of latent subspaces and the organization of tweets into semantically coherent hate categories. The comprehensive evaluation shows that the framework achieves superior accuracy over existing methods, providing a more robust and effective mechanism for digital platforms to identify and mitigate hate speech and promote safer online interactions. Full article

(This article belongs to the Special Issue The Digital Immune System: AI-Driven Detection and Mitigation of Online Harms)

► Show Figures

Figure 1

22 pages, 982 KB

Open AccessArticle

Context-Oriented Method for Resolving Lexical Ambiguities in Speech Synthesis for a Low-Resource Language

by Elisa Izrailova, Andrey Ronzhin, Salaudin Umarkhadzhiev, Arslanbek Astemirov, Aleksandra Figurek and Zelimkhan Sultanov

Big Data Cogn. Comput. 2026, 10(6), 181; https://doi.org/10.3390/bdcc10060181 - 1 Jun 2026

Viewed by 283

Abstract

Disambiguation resolution in speech synthesis is one of the main challenges in text-to-speech conversion. Machine learning methods and artificial neural networks have been successfully applied to this problem in synthesis systems for English, Spanish, and other common languages. For low-resource languages, the available [...] Read more.

Disambiguation resolution in speech synthesis is one of the main challenges in text-to-speech conversion. Machine learning methods and artificial neural networks have been successfully applied to this problem in synthesis systems for English, Spanish, and other common languages. For low-resource languages, the available data are insufficient to train artificial neural networks, so heuristic methods for context analysis and selection of the correct homonym for polysemantic words should be used. The purpose of this study is to develop a word sense disambiguation (WSD) method for the low-resource Chechen language and to introduce it into a speech synthesis system. The study presents the developed method and three algorithms: AWEN (based on Euclidean distance), AWA (weighted average), and AWN (weighted normalized distance) for word sense disambiguation. A corpus of Chechen texts, CheWSData, was compiled, containing 15,035 manually selected sentences derived from 5 million annotated words and reflecting the natural frequency of polysemy across grammatical categories. Experimental results show that the proposed AWN method achieves the best performance, with an F1-score of 0.78 and an accuracy of 0.80, outperforming AWA (F1: 0.74) and AWEN (F1: 0.40). For specific parts of speech, AWN reaches F1-scores of 0.82 for nouns, 0.83 for verbs, and 0.85 for adverbs. Comparative analysis with existing WSD methods for low-resource languages (Kashmiri, Hausa, Assamese, Urdu, and Vietnamese) demonstrates that AWN is competitive, ranking second after ViConBERT (F1: 0.87) and ahead of XLM-R for Hausa (F1: 0.79). The developed software module for homonym recognition was integrated into the Chechen speech synthesis system, contributing to more natural synthesized speech. Full article

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

► Show Figures

Figure 1

30 pages, 1349 KB

Open AccessArticle

A Lightweight Multimodal Architecture for Punctuation Restoration in Kazakh ASR

by Aidana Karibayeva, Oleg Myssov, Balzhan Abduali, Dina Amirova and Adina Karybayeva

Computers 2026, 15(6), 345; https://doi.org/10.3390/computers15060345 - 28 May 2026

Viewed by 498

Abstract

In this paper, we first present a multimodal architecture called CrossAttn-v1. This model is designed to recover punctuation marks in Kazakh and combines contextual XLM-RoBERTa-large text embeddings with the Whisper large-v3 encoder states via a cross-attention mechanism. In addition, a 4-dimensional prosodic vector [...] Read more.

In this paper, we first present a multimodal architecture called CrossAttn-v1. This model is designed to recover punctuation marks in Kazakh and combines contextual XLM-RoBERTa-large text embeddings with the Whisper large-v3 encoder states via a cross-attention mechanism. In addition, a 4-dimensional prosodic vector and a CRF output layer are used. The model was trained using an adapted Whisper ASR model on 33,332 utterances from the KazakhTTS2 corpus. After adaptation, the word error rate decreased from 45.7% to 4.25%. On the in-domain test set (56,396 tokens), CrossAttn-v1 achieved F1-macro = 0.8485 for recovering five-class punctuation marks. Furthermore, CrossAttn-v1 outperformed the GPT-4o zero-shot model by +0.294 F1 and the M3 Hybrid model based on prosody alone by +0.070 F1. The class analysis showed that the Whisper encoder states were particularly useful for prosody-dependent punctuation. For example, it outperformed M3 Hybrid by +9.5 percentage points on the QUESTION mark and by +20.2 percentage points on the EXCLAIM mark. On 883 out-of-domain natural speech recordings, the model performed similarly to the text-only baseline model (Δ = −0.041, not significant), suggesting that domain mismatch in the Whisper training corpus was a major factor limiting generalization. Full article

(This article belongs to the Special Issue Advances in Multimodal Learning and Representation)

► Show Figures

Figure 1

16 pages, 693 KB

Open AccessReview

Presbycusis Across the Lifespan: Genetic, Molecular, and Multi-Omics Contributions

by Anna Morgan, Paolo Gasparini and Giorgia Girotto

Audiol. Res. 2026, 16(3), 81; https://doi.org/10.3390/audiolres16030081 - 26 May 2026

Viewed by 281

Abstract

Presbycusis, or age-related hearing loss (ARHL), is a multifactorial disorder characterized by a gradual, bilateral sensorineural decline in hearing sensitivity, predominantly affecting high-frequency sounds. It is one of the most common chronic conditions in the aging population and represents a major public health [...] Read more.

Presbycusis, or age-related hearing loss (ARHL), is a multifactorial disorder characterized by a gradual, bilateral sensorineural decline in hearing sensitivity, predominantly affecting high-frequency sounds. It is one of the most common chronic conditions in the aging population and represents a major public health concern due to its high prevalence and progressive nature. Presbycusis significantly impairs speech perception, especially in noisy environments, leading to communication difficulties, reduced social participation, increased risk of social isolation, and a decline in quality of life. Moreover, growing evidence highlights a strong association between ARHL and cognitive impairment, dementia, depression, and increased frailty in older adults. The etiology of presbycusis is complex and involves the interplay between genetic predisposition and cumulative environmental and lifestyle-related factors. Genetic susceptibility influences cochlear aging, neural degeneration, and vulnerability to external insults. Non-genetic contributors include chronic noise exposure, cardiovascular and metabolic disorders such as diabetes and dyslipidemia, ototoxic medications, smoking, and other lifestyle factors that may accelerate cochlear damage through oxidative stress and microvascular dysfunction. This narrative review aims to provide an updated overview of the genetic and environmental determinants involved in the development and progression of presbycusis. Furthermore, it discusses the clinical implications of these factors for early identification, audiological evaluation, prevention strategies, and personalized management approaches. A better understanding of the multifactorial nature of presbycusis may support the development of targeted interventions to preserve hearing function and improve overall health outcomes in the aging population. Full article

(This article belongs to the Special Issue The Aging Ear)

► Show Figures

Figure 1

28 pages, 4453 KB

Open AccessArticle

Layered Network Diffusion of Misinformation on YouTube: A Multi-Level Analysis of Video and Channel Interactions

by Md Irfanuzzaman Khan, Benedict Sheehy and Bruce Baer Arnold

Platforms 2026, 4(2), 9; https://doi.org/10.3390/platforms4020009 - 25 May 2026

Viewed by 215

Abstract

Misinformation has become a persistent feature of contemporary digital information environments. Platform designs and business models often privilege attention, engagement, and repeated exposure over epistemic quality. However, misinformation does not diffuse uniformly across platform structures. This study examines how contested claims in a [...] Read more.

Misinformation has become a persistent feature of contemporary digital information environments. Platform designs and business models often privilege attention, engagement, and repeated exposure over epistemic quality. However, misinformation does not diffuse uniformly across platform structures. This study examines how contested claims in a South Korean social policy controversy circulate on YouTube. The analysis focuses on unfounded allegations regarding permanent employment offers to part-time workers at Incheon International Airport across two analytic levels: (1) a videoclip network, in which video-to-video ties are formed through shared commenters over time, and (2) a channel network, in which channel-to-channel ties are formed through shared commenters over time. Drawing on YouTube Data API records, we employ a mixed computational approach that integrates social network analysis, speech-to-text transcription, natural language processing, semantic network analysis, and automated content classification. Videos are classified as misinformation or non-misinformation based on the presence of demonstrably incorrect claims or corrective content. We compare network structure, diffusion patterns, and engagement dynamics across these two layers. The results reveal pronounced layer-specific differences. Misinformation diffuses more extensively within the channel network, which exhibits higher density and stronger cross-channel interconnectedness, suggesting that creator-level infrastructures function as stabilising conduits for the circulation of false claims. By contrast, diffusion pathways at the videoclip level show comparatively weaker differentiation between misinformation and non-misinformation content. Engagement patterns also diverge misinformation videos attract significantly more likes, while message format and channel attributes are less consistently distinguishing. From a theoretical standpoint, this study advances a multi-layer diffusion perspective on platform-mediated misinformation by demonstrating how platform architectures shape the visibility, persistence, and amplification of false claims. The findings highlight the importance of intervention strategies that move beyond individual content moderation toward creator- and network-level governance mechanisms, with implications for the design of platform features, recommendation systems, and misinformation mitigation tools. Full article

► Show Figures

Figure 1

18 pages, 1839 KB

Open AccessReview

Deep Learning in Medical Speech to Text: Methods and Challenges

by Maciej Sztabinski and Pawel Weichbroth

Symmetry 2026, 18(6), 885; https://doi.org/10.3390/sym18060885 - 23 May 2026

Viewed by 368

Abstract

Automated clinical documentation based on clinician-patient conversations is an emerging application of deep learning, driven by advances in medical speech recognition and natural language processing. Despite technological progress, real-world adoption remains limited. This review analyzes deep learning–based medical speech-to-text systems, focusing on methodologies, [...] Read more.

Automated clinical documentation based on clinician-patient conversations is an emerging application of deep learning, driven by advances in medical speech recognition and natural language processing. Despite technological progress, real-world adoption remains limited. This review analyzes deep learning–based medical speech-to-text systems, focusing on methodologies, evaluation strategies, and barriers to clinical implementation. A systematic review of 31 studies was conducted, covering automatic speech recognition, clinical dialogue processing, and large language model-based documentation pipelines. Speech recognition accuracy varies considerably in noisy, multi-speaker, and spontaneous clinical environments. Downstream tasks such as entity extraction and summarization are highly sensitive to transcription errors and constrained by limited real-world datasets. Most systems lack external clinical validation and are tested in controlled settings. Key challenges include speaker diarization, domain adaptation, privacy protection, and the need for standardized evaluation frameworks. Although LLMs demonstrate strong potential, concerns remain regarding hallucinations and factual reliability, necessitating improved robustness and clinician oversight. Full article

(This article belongs to the Special Issue Optimal Control and Symmetry: From Theoretical Foundations to Real-World Applications)

► Show Figures

Figure 1

27 pages, 4438 KB

Open AccessArticle

DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement

by Tsung-Jung Li, Bo-Yu Su, Jung-Shan Lin and Jeih-Weih Hung

Electronics 2026, 15(10), 2159; https://doi.org/10.3390/electronics15102159 - 18 May 2026

Viewed by 268

Abstract

Transformer-based speech enhancement (SE) architectures suffer from high computational complexity, while existing lightweight state space model (SSM) approaches are constrained to fixed one-dimensional scanning that cannot fully exploit the two-dimensional time–frequency structure of speech spectrograms. To address these limitations, we propose DOM-MUSE, a [...] Read more.

Transformer-based speech enhancement (SE) architectures suffer from high computational complexity, while existing lightweight state space model (SSM) approaches are constrained to fixed one-dimensional scanning that cannot fully exploit the two-dimensional time–frequency structure of speech spectrograms. To address these limitations, we propose DOM-MUSE, a lightweight U-Net-style SE framework built upon the Mamba-2 SSM with four targeted innovations. First, a Deformable Feature Extractor (DFE) predicts per location spatial offsets that warp the feature sampling grid to align with speech formant trajectories and harmonic structures, providing geometrically coherent inputs to the state space model. Second, a DOM Mamba Block with Cross-Dimensional Gated Fusion (CDGF) deploys two parallel Mamba-2 instances scanning the time and frequency axes independently, and uses Taylor Channel Attention (TCA) to derive semantic gates that modulate each SSM output before fusion. Third, a Phase-Guided Feature Conditioner (PGFC) computes local phase-gradient gates that suppress noise-dominated activations prior to the SSM stage, making the feature extraction pathway implicitly phase-aware. Fourth, an Attention-Based Skip Connection (ABSC) replaces conventional concatenation skip connections with a learned channel gate, adaptively controlling the information flow from the encoder to the decoder. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DOM-MUSE outperforms the reproduced MUSE baseline on all five evaluation metrics—including PESQ (+0.077), CSIG (+0.058), CBAK (+0.026), COVL (+0.070), and STOI (+0.002)—while reducing the parameter count by 24% (0.51 M to 0.39 M). Notably, DOM-MUSE also surpasses MUSE++ on perceptual quality metrics (PESQ +0.061, COVL +0.032) despite MUSE++ employing dynamic SNR augmentation and an augmented multi-objective loss that DOM-MUSE deliberately omits, demonstrating that the proposed architectural innovations yield genuine improvements independent of training strategy. When DOM-MUSE is additionally trained under the same augmented protocol as MUSE++, it achieves PESQ of 3.46 and COVL of 4.22, further confirming the complementary nature of architectural and training improvements. Full article

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition)

► Show Figures

Figure 1

21 pages, 5576 KB

Open AccessArticle

“Are You Okay, Honey?”: Recognizing Emotions Among Couples Managing Diabetes in Daily Life Using Multimodal Real-World Smartwatch Data

by George Boateng, Xiangyu Zhao, Malgorzata Speichert, Elgar Fleisch, Janina Lüscher, Theresa Pauly, Urte Scholz, Guy Bodenmann and Tobias Kowatsch

Sensors 2026, 26(10), 3141; https://doi.org/10.3390/s26103141 - 15 May 2026

Viewed by 524

Abstract

Couples generally manage chronic diseases together and the management takes an emotional toll on both patients and their romantic partners. Consequently, recognizing the emotions of each partner in daily life could provide insight into their emotional well-being in chronic disease management. Currently, the [...] Read more.

Couples generally manage chronic diseases together and the management takes an emotional toll on both patients and their romantic partners. Consequently, recognizing the emotions of each partner in daily life could provide insight into their emotional well-being in chronic disease management. Currently, the process of assessing each partner’s emotions is manual, time-intensive, and costly. Despite the existence of works on emotion recognition among couples, none of these works have used data collected from couples’ interactions in daily life. In this work, we collected 85 h (1021 5-min samples) of real-world multimodal smartwatch sensor data (speech, heart rate, accelerometer, and gyroscope) and self-reported emotion data (n = 612) from 26 partners (13 couples) managing diabetes mellitus type 2 in daily life. We extracted physiological, movement, acoustic, and linguistic features, and trained machine learning models (support vector machine and random forest) to recognize each partner’s self-reported emotions (valence and arousal). Our results from the best models—balanced accuracies of 63.8% and 78.1% for arousal and valence respectively—are better than the results from (1) chance, (2) prior work that also used data from German-speaking, Swiss-based couples, and (3) partners’ perceptions of each other’s emotions. This work contributes toward building automated emotion recognition systems that would eventually enable partners to monitor their emotions in daily life and enable the delivery of interventions to improve their emotional well-being. Full article

(This article belongs to the Special Issue Emotion Recognition Based on Sensors (3rd Edition))

► Show Figures

Figure 1

36 pages, 10012 KB

Open AccessReview

Long Short-Term Memory Networks Since Their Inception: Mapping 25 Years of Scientific Development via Bibliometric Analysis

by Subhashree Mohapatra, Jai Govind Singh, Subham Pankaj Samantaray and Manohar Mishra

Algorithms 2026, 19(5), 390; https://doi.org/10.3390/a19050390 - 14 May 2026

Viewed by 352

Abstract

In 1997, Long Short-Term Memory (LSTM) networks were proposed, which significantly changed the landscape of sequential data analysis by resolving the critical issue of the vanishing gradient problem in recurrent neural networks (RNNs). Over the last 25 years, LSTM has advanced from its [...] Read more.

In 1997, Long Short-Term Memory (LSTM) networks were proposed, which significantly changed the landscape of sequential data analysis by resolving the critical issue of the vanishing gradient problem in recurrent neural networks (RNNs). Over the last 25 years, LSTM has advanced from its inception as an innovative solution to its widespread adoption as an essential tool in various fields, including natural language processing (NLP), speech recognition, financial prediction, and healthcare analytics. The present study is a bibliometric review of the evolution of LSTMs. The evolution of LSTM is discussed in terms of its theoretical advancements, architectural developments, and its applications. The study is based on data obtained from the Scopus database, which is then analyzed to identify publication patterns, prominent authors, prominent institutions, and global contributions to the field. The present study is an insightful review of the evolution of LSTM, highlighting its developments and advancements, as well as its applications, to identify its future scope. Full article

► Show Figures

Figure 1

26 pages, 21948 KB

Open AccessArticle

AI-Assisted Vision Alarming System for Blind and Vision- Impaired People

by Le Chung Tran, Sinh Khai Ly, Rhys Blacklidge, Jonathan Shemmell, Nathan Difford, Daniel Edward Cox and Theresa Harada

Sensors 2026, 26(10), 2929; https://doi.org/10.3390/s26102929 - 7 May 2026

Viewed by 910

Abstract

Navigating through everyday environments, like walking down a sidewalk, which many people often take for granted, is a difficult task for millions of people with vision impairments since it involves sophisticated object detection, depth perception, and situational awareness, all working seamlessly to guide [...] Read more.

Navigating through everyday environments, like walking down a sidewalk, which many people often take for granted, is a difficult task for millions of people with vision impairments since it involves sophisticated object detection, depth perception, and situational awareness, all working seamlessly to guide a person through complex surroundings. Many current assistive devices for vision-impaired people are either expensive, information-overabundant, or missing critical information. This paper details our Vision Alarming System (VAS), which can improve the safety for blind and vision-impaired people by providing awareness of both positions and nature of nearby obstacles; thus, assisting users to make decisions to avoid collisions, reduce accidents and casualties, while enhance their experience, independence, and confidence when participating in traffic. VAS is an Artificial Intelligence/Internet-of-Things (AI/IoT)—powered system developed utilizing the cutting-edge Raspberry Pi 5, a Light Detection and Ranging (LiDAR) sensor, and an AI depth camera, operating as different containers in a Docker architecture, and leveraging a Robotic Operating System 2 (ROS 2) backbone. VAS communicates the obstacle detections to users via Bluetooth interface, using the neural Text-To-Speech (TTS) system, namely, Piper, and the Sound eXchange (SoX) technologies. Our proof-of-concept system proves that VAS can be a standalone, open-source, extremely low cost, low power consumption assistive device which can synergistically utilize the cutting-edge AI/IoT technologies to provide blind and vision-impaired users with an appropriate amount of critical information about their surrounding environments. Full article

(This article belongs to the Special Issue IoT Technologies in Smart Cities: Challenges and Sensor Applications)

► Show Figures

Figure 1

33 pages, 837 KB

Open AccessArticle

Acquiring the Pragmatics of a Heritage Language: A Case of Study Abroad Experience in Greece

by Jill C. Murray

Languages 2026, 11(5), 88; https://doi.org/10.3390/languages11050088 - 5 May 2026

Viewed by 572

Abstract

Throughout the English-speaking world, there are numerous Greek-speaking diaspora communities whose language is simultaneously influenced by English and local varieties of Greek. This study builds on the body of knowledge in cross-cultural and interlanguage pragmatics to explore a case of pragmatic acquisition in [...] Read more.

Throughout the English-speaking world, there are numerous Greek-speaking diaspora communities whose language is simultaneously influenced by English and local varieties of Greek. This study builds on the body of knowledge in cross-cultural and interlanguage pragmatics to explore a case of pragmatic acquisition in a study abroad context by one member of such a community. Data were collected from a third-generation young adult Greek Australian student prior to commencement of a 6-week Greek language programme in Athens, and on three other occasions. She described her experiences and responded to a set of scenarios involving Greek requests, refusals and apologies. The responses were analysed using established frameworks and subjectively evaluated for appropriateness by a matched Greek native speaker. The student showed evidence of a shift towards documented Standard Modern Greek pragmatic norms in some but not all speech acts, and change appeared to be loosely linked to opportunities for use. There was also some evidence of reversion to diaspora variants after her return. This study contributes to our understanding of the interaction between learning outcomes, individual learner variables, prior exposure, the nature of communicative events and levels of pragmatic awareness. It is argued that Greek and diaspora contexts involve subtly distinct pragmatic varieties of Greek and that learners can benefit from explicit awareness-raising regarding the nature of these differences. Full article

(This article belongs to the Special Issue Greek Speakers and Pragmatics)

Search Results (664)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (664)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI