Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,080)

Search Parameters:
Keywords = speech features

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 1340 KiB  
Article
Enhanced Respiratory Sound Classification Using Deep Learning and Multi-Channel Auscultation
by Yeonkyeong Kim, Kyu Bom Kim, Ah Young Leem, Kyuseok Kim and Su Hwan Lee
J. Clin. Med. 2025, 14(15), 5437; https://doi.org/10.3390/jcm14155437 (registering DOI) - 1 Aug 2025
Abstract
 Background/Objectives: Identifying and classifying abnormal lung sounds is essential for diagnosing patients with respiratory disorders. In particular, the simultaneous recording of auscultation signals from multiple clinically relevant positions offers greater diagnostic potential compared to traditional single-channel measurements. This study aims to improve [...] Read more.
 Background/Objectives: Identifying and classifying abnormal lung sounds is essential for diagnosing patients with respiratory disorders. In particular, the simultaneous recording of auscultation signals from multiple clinically relevant positions offers greater diagnostic potential compared to traditional single-channel measurements. This study aims to improve the accuracy of respiratory sound classification by leveraging multichannel signals and capturing positional characteristics from multiple sites in the same patient. Methods: We evaluated the performance of respiratory sound classification using multichannel lung sound data with a deep learning model that combines a convolutional neural network (CNN) and long short-term memory (LSTM), based on mel-frequency cepstral coefficients (MFCCs). We analyzed the impact of the number and placement of channels on classification performance. Results: The results demonstrated that using four-channel recordings improved accuracy, sensitivity, specificity, precision, and F1-score by approximately 1.11, 1.15, 1.05, 1.08, and 1.13 times, respectively, compared to using three, two, or single-channel recordings. Conclusion: This study confirms that multichannel data capture a richer set of features corresponding to various respiratory sound characteristics, leading to significantly improved classification performance. The proposed method holds promise for enhancing sound classification accuracy not only in clinical applications but also in broader domains such as speech and audio processing.  Full article
(This article belongs to the Section Respiratory Medicine)
25 pages, 2082 KiB  
Article
XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios
by Shin-Chi Lai, Yi-Chang Zhu, Szu-Ting Wang, Yen-Ching Chang, Ying-Hsiu Hung, Jhen-Kai Tang and Wen-Kai Tsai
Appl. Syst. Innov. 2025, 8(4), 108; https://doi.org/10.3390/asi8040108 - 31 Jul 2025
Abstract
As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation [...] Read more.
As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation method based on XText-to-Speech (XTTS) synthesis to tackle the challenges of small-sample, multi-class speech recognition, using profanity as a case study to achieve high-accuracy keyword recognition. Two models were therefore evaluated: a CNN model (Proposed-I) and a CNN-Transformer hybrid model (Proposed-II). Proposed-I leverages local feature extraction, improving accuracy on a real human speech (RHS) test set from 55.35% without augmentation to 80.36% with XTTS-enhanced data. Proposed-II integrates CNN’s local feature extraction with Transformer’s long-range dependency modeling, further boosting test set accuracy to 88.90% while reducing the parameter count by approximately 41%, significantly enhancing computational efficiency. Compared to a previously proposed incremental architecture, the Proposed-II model achieves an 8.49% higher accuracy while reducing parameters by about 98.81% and MACs by about 98.97%, demonstrating exceptional resource efficiency. By utilizing XTTS and public corpora to generate a novel keyword speech dataset, this study enhances sample diversity and reduces reliance on large-scale original speech data. Experimental analysis reveals that an optimal synthetic-to-real speech ratio of 1:5 significantly improves the overall system accuracy, effectively addressing data scarcity. Additionally, the Proposed-I and Proposed-II models achieve accuracies of 97.54% and 98.66%, respectively, in distinguishing real from synthetic speech, demonstrating their strong potential for speech security and anti-spoofing applications. Full article
(This article belongs to the Special Issue Advancements in Deep Learning and Its Applications)
20 pages, 1536 KiB  
Article
Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition
by Yingmin Deng, Chenyu Li, Yu Gu, He Zhang, Linsong Liu, Haixiang Lin, Shuang Wang and Hanlin Mo
Electronics 2025, 14(15), 3047; https://doi.org/10.3390/electronics14153047 - 30 Jul 2025
Viewed by 170
Abstract
Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic [...] Read more.
Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic Weighted Graph Convolutional Network (DW-GCN) for feature disentanglement and a Cross-Attention Consistency-Gated Fusion (CACG-Fusion) module for robust integration. DW-GCN models complex inter-modal relationships, enabling the extraction of both common and private features. The CACG-Fusion module subsequently enhances classification performance through dynamic alignment of cross-modal cues, employing attention-based coordination and consistency-preserving gating mechanisms to optimize feature integration. Experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method achieves state-of-the-art performance, significantly improving the ACC7, ACC2, and F1 scores. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

23 pages, 3741 KiB  
Article
Multi-Corpus Benchmarking of CNN and LSTM Models for Speaker Gender and Age Profiling
by Jorge Jorrin-Coz, Mariko Nakano, Hector Perez-Meana and Leobardo Hernandez-Gonzalez
Computation 2025, 13(8), 177; https://doi.org/10.3390/computation13080177 - 23 Jul 2025
Viewed by 252
Abstract
Speaker profiling systems are often evaluated on a single corpus, which complicates reliable comparison. We present a fully reproducible evaluation pipeline that trains Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) models independently on three speech corpora representing distinct recording conditions—studio-quality TIMIT, [...] Read more.
Speaker profiling systems are often evaluated on a single corpus, which complicates reliable comparison. We present a fully reproducible evaluation pipeline that trains Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) models independently on three speech corpora representing distinct recording conditions—studio-quality TIMIT, crowdsourced Mozilla Common Voice, and in-the-wild VoxCeleb1. All models share the same architecture, optimizer, and data preprocessing; no corpus-specific hyperparameter tuning is applied. We perform a detailed preprocessing and feature extraction procedure, evaluating multiple configurations and validating their applicability and effectiveness in improving the obtained results. A feature analysis shows that Mel spectrograms benefit CNNs, whereas Mel Frequency Cepstral Coefficients (MFCCs) suit LSTMs, and that the optimal Mel-bin count grows with corpus Signal Noise Rate (SNR). With this fixed recipe, EfficientNet achieves 99.82% gender accuracy on Common Voice (+1.25 pp over the previous best) and 98.86% on VoxCeleb1 (+0.57 pp). MobileNet attains 99.86% age-group accuracy on Common Voice (+2.86 pp) and a 5.35-year MAE for age estimation on TIMIT using a lightweight configuration. The consistent, near-state-of-the-art results across three acoustically diverse datasets substantiate the robustness and versatility of the proposed pipeline. Code and pre-trained weights are released to facilitate downstream research. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Graphical abstract

26 pages, 6051 KiB  
Article
A Novel Sound Coding Strategy for Cochlear Implants Based on Spectral Feature and Temporal Event Extraction
by Behnam Molaee-Ardekani, Rafael Attili Chiea, Yue Zhang, Julian Felding, Aswin Adris Wijetillake, Peter T. Johannesen, Enrique A. Lopez-Poveda and Manuel Segovia-Martínez
Technologies 2025, 13(8), 318; https://doi.org/10.3390/technologies13080318 - 23 Jul 2025
Viewed by 319
Abstract
This paper presents a novel cochlear implant (CI) sound coding strategy called Spectral Feature Extraction (SFE). The SFE is a novel Fast Fourier Transform (FFT)-based Continuous Interleaved Sampling (CIS) strategy that provides less-smeared spectral cues to CI patients compared to Crystalis, a predecessor [...] Read more.
This paper presents a novel cochlear implant (CI) sound coding strategy called Spectral Feature Extraction (SFE). The SFE is a novel Fast Fourier Transform (FFT)-based Continuous Interleaved Sampling (CIS) strategy that provides less-smeared spectral cues to CI patients compared to Crystalis, a predecessor strategy used in Oticon Medical devices. The study also explores how the SFE can be enhanced into a Temporal Fine Structure (TFS)-based strategy named Spectral Event Extraction (SEE), combining spectral sharpness with temporal cues. Background/Objectives: Many CI recipients understand speech in quiet settings but struggle with music and complex environments, increasing cognitive effort. De-smearing the power spectrum and extracting spectral peak features can reduce this load. The SFE targets feature extraction from spectral peaks, while the SEE enhances TFS-based coding by tracking these features across frames. Methods: The SFE strategy extracts spectral peaks and models them with synthetic pure tone spectra characterized by instantaneous frequency, phase, energy, and peak resemblance. This deblurs input peaks by estimating their center frequency. In SEE, synthetic peaks are tracked across frames to yield reliable temporal cues (e.g., zero-crossings) aligned with stimulation pulses. Strategy characteristics are analyzed using electrodograms. Results: A flexible Frequency Allocation Map (FAM) can be applied to both SFE and SEE strategies without being limited by FFT bandwidth constraints. Electrodograms of Crystalis and SFE strategies showed that SFE reduces spectral blurring and provides detailed temporal information of harmonics in speech and music. Conclusions: SFE and SEE are expected to enhance speech understanding, lower listening effort, and improve temporal feature coding. These strategies could benefit CI users, especially in challenging acoustic environments. Full article
(This article belongs to the Special Issue The Challenges and Prospects in Cochlear Implantation)
Show Figures

Figure 1

19 pages, 1711 KiB  
Article
TSDCA-BA: An Ultra-Lightweight Speech Enhancement Model for Real-Time Hearing Aids with Multi-Scale STFT Fusion
by Zujie Fan, Zikun Guo, Yanxing Lai and Jaesoo Kim
Appl. Sci. 2025, 15(15), 8183; https://doi.org/10.3390/app15158183 - 23 Jul 2025
Viewed by 231
Abstract
Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a [...] Read more.
Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a lightweight hybrid module, Temporal Statistics Enhancement, Squeeze-and-Excitation-based Dual Convolutional Attention, and Band-wise Attention (TSE, SDCA, BA) Module. The TSE module enhances single-frame spectral features by concatenating statistical descriptors—mean, standard deviation, maximum, and minimum—thereby capturing richer local information without relying on temporal context. The SDCA and BA module integrates a simplified residual structure and channel attention, while the BA component further strengthens the representation of critical frequency bands through band-wise partitioning and differentiated weighting. The proposed model requires only 0.22 million multiply–accumulate operations (MMACs) and contains a total of 112.3 K parameters, making it well suited for low-latency, real-time speech enhancement applications. Experimental results demonstrate that among lightweight models with fewer than 200K parameters, the proposed approach outperforms most existing methods in both denoising performance and computational efficiency, significantly reducing processing overhead. Furthermore, real-device deployment on an improved hearing aid confirms an inference latency as low as 2 milliseconds, validating its practical potential for real-time edge applications. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

10 pages, 857 KiB  
Proceeding Paper
Implementation of a Prototype-Based Parkinson’s Disease Detection System Using a RISC-V Processor
by Krishna Dharavathu, Pavan Kumar Sankula, Uma Maheswari Vullanki, Subhan Khan Mohammad, Sai Priya Kesapatnapu and Sameer Shaik
Eng. Proc. 2025, 87(1), 97; https://doi.org/10.3390/engproc2025087097 - 21 Jul 2025
Viewed by 170
Abstract
In the wide range of human diseases, Parkinson’s disease (PD) has a high incidence, according to a recent survey by the World Health Organization (WHO). According to WHO records, this chronic disease has affected approximately 10 million people worldwide. Patients who do not [...] Read more.
In the wide range of human diseases, Parkinson’s disease (PD) has a high incidence, according to a recent survey by the World Health Organization (WHO). According to WHO records, this chronic disease has affected approximately 10 million people worldwide. Patients who do not receive an early diagnosis may develop an incurable neurological disorder. PD is a degenerative disorder of the brain, characterized by the impairment of the nigrostriatal system. A wide range of symptoms of motor and non-motor impairment accompanies this disorder. By using new technology, the PD is detected through speech signals of the PD victims by using the reduced instruction set computing 5th version (RISC-V) processor. The RISC-V microcontroller unit (MCU) was designed for the voice-controlled human-machine interface (HMI). With the help of signal processing and feature extraction methods, the digital signal is impaired by the impairment of the nigrostriatal system. These speech signals can be classified through classifier modules. A wide range of classifier modules are used to classify the speech signals as normal or abnormal to identify PD. We use Matrix Laboratory (MATLAB R2021a_v9.10.0.1602886) to analyze the data, develop algorithms, create modules, and develop the RISC-V processor for embedded implementation. Machine learning (ML) techniques are also used to extract features such as pitch, tremor, and Mel-frequency cepstral coefficients (MFCCs). Full article
(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)
Show Figures

Figure 1

16 pages, 317 KiB  
Perspective
Listening to the Mind: Integrating Vocal Biomarkers into Digital Health
by Irene Rodrigo and Jon Andoni Duñabeitia
Brain Sci. 2025, 15(7), 762; https://doi.org/10.3390/brainsci15070762 - 18 Jul 2025
Viewed by 470
Abstract
The human voice is an invaluable tool for communication, carrying information about a speaker’s emotional state and cognitive health. Recent research highlights the potential of acoustic biomarkers to detect early signs of mental health and neurodegenerative conditions. Despite their promise, vocal biomarkers remain [...] Read more.
The human voice is an invaluable tool for communication, carrying information about a speaker’s emotional state and cognitive health. Recent research highlights the potential of acoustic biomarkers to detect early signs of mental health and neurodegenerative conditions. Despite their promise, vocal biomarkers remain underutilized in clinical settings, with limited standardized protocols for assessment. This Perspective article argues for the integration of acoustic biomarkers into digital health solutions to improve the detection and monitoring of cognitive impairment and emotional disturbances. Advances in speech analysis and machine learning have demonstrated the feasibility of using voice features such as pitch, jitter, shimmer, and speech rate to assess these conditions. Moreover, we propose that singing, particularly simple melodic structures, could be an effective and accessible means of gathering vocal biomarkers, offering additional insights into cognitive and emotional states. Given its potential to engage multiple neural networks, singing could function as an assessment tool and an intervention strategy for individuals with cognitive decline. We highlight the necessity of further research to establish robust, reproducible methodologies for analyzing vocal biomarkers and standardizing voice-based diagnostic approaches. By integrating vocal analysis into routine health assessments, clinicians and researchers could significantly advance early detection and personalized interventions for cognitive and emotional disorders. Full article
(This article belongs to the Topic Language: From Hearing to Speech and Writing)
49 pages, 3444 KiB  
Article
A Design-Based Research Approach to Streamline the Integration of High-Tech Assistive Technologies in Speech and Language Therapy
by Anna Lekova, Paulina Tsvetkova, Anna Andreeva, Georgi Dimitrov, Tanio Tanev, Miglena Simonska, Tsvetelin Stefanov, Vaska Stancheva-Popkostadinova, Gergana Padareva, Katia Rasheva, Adelina Kremenska and Detelina Vitanova
Technologies 2025, 13(7), 306; https://doi.org/10.3390/technologies13070306 - 16 Jul 2025
Viewed by 499
Abstract
Currently, high-tech assistive technologies (ATs), particularly Socially Assistive Robots (SARs), virtual reality (VR) and conversational AI (ConvAI), are considered very useful in supporting professionals in Speech and Language Therapy (SLT) for children with communication disorders. However, despite a positive public perception, therapists face [...] Read more.
Currently, high-tech assistive technologies (ATs), particularly Socially Assistive Robots (SARs), virtual reality (VR) and conversational AI (ConvAI), are considered very useful in supporting professionals in Speech and Language Therapy (SLT) for children with communication disorders. However, despite a positive public perception, therapists face difficulties when integrating these technologies into practice due to technical challenges and a lack of user-friendly interfaces. To address this gap, a design-based research approach has been employed to streamline the integration of SARs, VR and ConvAI in SLT, and a new software platform called “ATLog” has been developed for designing interactive and playful learning scenarios with ATs. ATLog’s main features include visual-based programming with graphical interface, enabling therapists to intuitively create personalized interactive scenarios without advanced programming skills. The platform follows a subprocess-oriented design, breaking down SAR skills and VR scenarios into microskills represented by pre-programmed graphical blocks, tailored to specific treatment domains, therapy goals, and language skill levels. The ATLog platform was evaluated by 27 SLT experts using the Technology Acceptance Model (TAM) and System Usability Scale (SUS) questionnaires, extended with additional questions specifically focused on ATLog structure and functionalities. According to the SUS results, most of the experts (74%) evaluated ATLog with grades over 70, indicating high acceptance of its usability. Over half (52%) of the experts rated the additional questions focused on ATLog’s structure and functionalities in the A range (90–100), while 26% rated them in the B range (80–89), showing strong acceptance of the platform for creating and running personalized interactive scenarios with ATs. According to the TAM results, experts gave high grades for both perceived usefulness (44% in the A range) and perceived ease of use (63% in the A range). Full article
Show Figures

Figure 1

18 pages, 957 KiB  
Article
CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
by Peng Ye, Yujin Jiang and Yadi Wang
Information 2025, 16(7), 610; https://doi.org/10.3390/info16070610 - 16 Jul 2025
Viewed by 328
Abstract
Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only [...] Read more.
Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations in geographic entity recognition. This study proposes a set of Chinese toponym annotation specifications that integrate spatial semantics. By leveraging the XML markup language, it deeply combines the spatial location characteristics of toponyms with linguistic features, and designs fine-grained annotation rules to address the limitations of traditional methods in semantic integrity and geographic entity recognition. On this basis, by integrating multi-source corpora from the Encyclopedia of China: Chinese Geography and People’s Daily, a large-scale Chinese toponym annotation corpus (CHTopo) covering five major categories of toponyms has been constructed. The performance of this annotated corpus was evaluated through toponym recognition, exploring the construction methods of a large-scale, diversified, and high-coverage Chinese toponym annotated corpus from the perspectives of applicability and practicality. CHTopo is conducive to providing foundational support for geographic information extraction, spatial knowledge graphs, and geoparsing research, bridging linguistic and geospatial intelligence. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

27 pages, 1817 KiB  
Article
A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media
by Muhammad Usman, Muhammad Ahmad, Grigori Sidorov, Irina Gelbukh and Rolando Quintero Tellez
Computers 2025, 14(7), 279; https://doi.org/10.3390/computers14070279 - 15 Jul 2025
Viewed by 694
Abstract
The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, [...] Read more.
The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, this study introduces a comprehensive approach for multilingual hate speech detection. To facilitate robust hate speech detection across diverse languages, this study makes several key contributions. First, we created a novel trilingual hate speech dataset consisting of 10,193 manually annotated tweets in English, Spanish, and Urdu. Second, we applied two innovative techniques—joint multilingual and translation-based approaches—for cross-lingual hate speech detection that have not been previously explored for these languages. Third, we developed detailed hate speech annotation guidelines tailored specifically to all three languages to ensure consistent and high-quality labeling. Finally, we conducted 41 experiments employing machine learning models with TF–IDF features, deep learning models utilizing FastText and GloVe embeddings, and transformer-based models leveraging advanced contextual embeddings to comprehensively evaluate our approach. Additionally, we employed a large language model with advanced contextual embeddings to identify the best solution for the hate speech detection task. The experimental results showed that our GPT-3.5-turbo model significantly outperforms strong baselines, achieving up to an 8% improvement over XLM-R in Urdu hate speech detection and an average gain of 4% across all three languages. This research not only contributes a high-quality multilingual dataset but also offers a scalable and inclusive framework for hate speech detection in underrepresented languages. Full article
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)
Show Figures

Figure 1

19 pages, 1039 KiB  
Article
Prediction of Parkinson Disease Using Long-Term, Short-Term Acoustic Features Based on Machine Learning
by Mehdi Rashidi, Serena Arima, Andrea Claudio Stetco, Chiara Coppola, Debora Musarò, Marco Greco, Marina Damato, Filomena My, Angela Lupo, Marta Lorenzo, Antonio Danieli, Giuseppe Maruccio, Alberto Argentiero, Andrea Buccoliero, Marcello Dorian Donzella and Michele Maffia
Brain Sci. 2025, 15(7), 739; https://doi.org/10.3390/brainsci15070739 - 10 Jul 2025
Viewed by 469
Abstract
Background: Parkinson’s disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s disease, affecting countless individuals worldwide. PD is characterized by the onset of a marked motor symptomatology in association with several non-motor manifestations. The clinical phase of the disease is usually [...] Read more.
Background: Parkinson’s disease (PD) is the second most common neurodegenerative disorder after Alzheimer’s disease, affecting countless individuals worldwide. PD is characterized by the onset of a marked motor symptomatology in association with several non-motor manifestations. The clinical phase of the disease is usually preceded by a long prodromal phase, devoid of overt motor symptomatology but often showing some conditions such as sleep disturbance, constipation, anosmia, and phonatory changes. To date, speech analysis appears to be a promising digital biomarker to anticipate even 10 years before the onset of clinical PD, as well serving as a useful prognostic tool for patient follow-up. That is why, the voice can be nominated as the non-invasive method to detect PD from healthy subjects (HS). Methods: Our study was based on cross-sectional study to analysis voice impairment. A dataset comprising 81 voice samples (41 from healthy individuals and 40 from PD patients) was utilized to train and evaluate common machine learning (ML) models using various types of features, including long-term (jitter, shimmer, and cepstral peak prominence (CPP)), short-term features (Mel-frequency cepstral coefficient (MFCC)), and non-standard measurements (pitch period entropy (PPE) and recurrence period density entropy (RPDE)). The study adopted multiple machine learning (ML) algorithms, including random forest (RF), K-nearest neighbors (KNN), decision tree (DT), naïve Bayes (NB), support vector machines (SVM), and logistic regression (LR). Cross-validation technique was applied to ensure the reliability of performance metrics on train and test subsets. These metrics (accuracy, recall, and precision), help determine the most effective models for distinguishing PD from healthy subjects. Result: Among all the algorithms used in this research, random forest (RF) was the best-performing model, achieving an accuracy of 82.72% with a ROC-AUC score of 89.65%. Although other models, such as support vector machine (SVM), could be considered with an accuracy of 75.29% and a ROC-AUC score of 82.63%, RF was by far the best one when evaluated across all metrics. The K-nearest neighbor (KNN) and decision tree (DT) performed the worst. Notably, by combining a comprehensive set of long-term, short-term, and non-standard acoustic features, unlike previous studies that typically focused on only a subset, our study achieved higher predictive performance, offering a more robust model for early PD detection. Conclusions: This study highlights the potential of combining advanced acoustic analysis with ML algorithms to develop non-invasive and reliable tools for early PD detection, offering substantial benefits for the healthcare sector. Full article
(This article belongs to the Section Neurodegenerative Diseases)
Show Figures

Figure 1

19 pages, 1271 KiB  
Article
Reformulation in Early 20th Century Substandard Italian
by Giulio Scivoletto
Languages 2025, 10(7), 165; https://doi.org/10.3390/languages10070165 - 3 Jul 2025
Viewed by 237
Abstract
This study investigates reformulation in a substandard variety of Italian, italiano popolare, from the early 20th Century, focusing on a collection of letters and postcards from semi-literate Sicilian peasants during World War I. The analysis identifies three reformulation markers: cioè, anzi [...] Read more.
This study investigates reformulation in a substandard variety of Italian, italiano popolare, from the early 20th Century, focusing on a collection of letters and postcards from semi-literate Sicilian peasants during World War I. The analysis identifies three reformulation markers: cioè, anzi, and vuol dire. These markers are affected by hypercorrection, interference, and structural simplification, reflecting the sociolinguistic dynamics of italiano popolare. Additionally, the study of these markers sheds light on the relationships between reformulation and related discourse functions, namely paraphrase, correction, addition, and motivation. By positioning occurrences of reformulation along a continuum between the spoken and written mode, the findings suggest that this discourse function is employed more as a rhetorical strategy that characterizes planned written texts, rather than as a feature of disfluency that is typical of unplanned speech. Ultimately, examining reformulation in italiano popolare provides valuable insights into the relationship between sociolinguistic variation and language change in the beginning of the 20th Century, a key phase in the spread of Italian as a national language. Full article
(This article belongs to the Special Issue Pragmatic Diachronic Study of the 20th Century)
Show Figures

Figure 1

27 pages, 715 KiB  
Article
Developing Comprehensive e-Game Design Guidelines to Support Children with Language Delay: A Step-by-Step Approach with Initial Validation
by Noha Badkook, Doaa Sinnari and Abeer Almakky
Multimodal Technol. Interact. 2025, 9(7), 68; https://doi.org/10.3390/mti9070068 - 3 Jul 2025
Viewed by 361
Abstract
e-Games have become increasingly important in supporting the development of children with language delays. However, most existing educational games were not designed using usability guidelines tailored to the specific needs of this group. While various general and game-specific guidelines exist, they often have [...] Read more.
e-Games have become increasingly important in supporting the development of children with language delays. However, most existing educational games were not designed using usability guidelines tailored to the specific needs of this group. While various general and game-specific guidelines exist, they often have limitations. Some are too broad, others only address limited features of e-Games, and many fail to consider needs relevant to children with speech and language challenges. Therefore, this paper introduced a new collection of usability guidelines, called eGLD (e-Game for Language Delay), specifically designed for evaluating and improving educational games for children with language delays. The guidelines were created based on Quinones et al.’s methodology, which involves seven stages from the exploratory phase to the refining phase. eGLD consists of 19 guidelines and 131 checklist items that are user-friendly and applicable, addressing diverse features of e-Games for treating language delay in children. To conduct the first validation of eGLD, an experiment was carried out on two popular e-Games, “MITA” and “Speech Blubs”, by comparing the usability issues identified using eGLD with those identified by Nielsen and GUESS (Game User Experience Satisfaction Scale) guidelines. The experiment revealed that eGLD detected a greater number of usability issues, including critical ones, demonstrating its potential effectiveness in assessing and enhancing the usability of e-Games for children with language delay. Based on this validation, the guidelines were refined, and a second round of validation is planned to further ensure their reliability and applicability. Full article
(This article belongs to the Special Issue Video Games: Learning, Emotions, and Motivation)
Show Figures

Figure 1

22 pages, 4293 KiB  
Article
Speech-Based Parkinson’s Detection Using Pre-Trained Self-Supervised Automatic Speech Recognition (ASR) Models and Supervised Contrastive Learning
by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi
Bioengineering 2025, 12(7), 728; https://doi.org/10.3390/bioengineering12070728 - 1 Jul 2025
Viewed by 713
Abstract
Diagnosing Parkinson’s disease (PD) through speech analysis is a promising area of research, as speech impairments are often one of the early signs of the disease. This study investigates the efficacy of fine-tuning pre-trained Automatic Speech Recognition (ASR) models, specifically Wav2Vec 2.0 and [...] Read more.
Diagnosing Parkinson’s disease (PD) through speech analysis is a promising area of research, as speech impairments are often one of the early signs of the disease. This study investigates the efficacy of fine-tuning pre-trained Automatic Speech Recognition (ASR) models, specifically Wav2Vec 2.0 and HuBERT, for PD detection using transfer learning. These models, pre-trained on large unlabeled datasets, can be capable of learning rich speech representations that capture acoustic markers of PD. The study also proposes the integration of a supervised contrastive (SupCon) learning approach to enhance the models’ ability to distinguish PD-specific features. Additionally, the proposed ASR-based features were compared against two common acoustic feature sets: mel-frequency cepstral coefficients (MFCCs) and the extended Geneva minimalistic acoustic parameter set (eGeMAPS) as a baseline. We also employed a gradient-based method, Grad-CAM, to visualize important speech regions contributing to the models’ predictions. The experiments, conducted using the NeuroVoz dataset, demonstrated that features extracted from the pre-trained ASR models exhibited superior performance compared to the baseline features. The results also reveal that the method integrating SupCon consistently outperforms traditional cross-entropy (CE)-based models. Wav2Vec 2.0 and HuBERT with SupCon achieved the highest F1 scores of 90.0% and 88.99%, respectively. Additionally, their AUC scores in the ROC analysis surpassed those of the CE models, which had comparatively lower AUCs, ranging from 0.84 to 0.89. These results highlight the potential of ASR-based models as scalable, non-invasive tools for diagnosing and monitoring PD, offering a promising avenue for the early detection and management of this debilitating condition. Full article
Show Figures

Figure 1

Back to TopTop