MDPI - Publisher of Open Access Journals

12 pages, 3240 KiB

Open AccessArticle

AI-Driven Data Analysis for Asthma Risk Prediction

by Meng-Han Chen, Guanling Lee and Lun-Ping Hung

Healthcare 2025, 13(7), 774; https://doi.org/10.3390/healthcare13070774 - 31 Mar 2025

Cited by 1 | Viewed by 901

Background: Asthma is a well-known otolaryngological and immunological disorder that affects patients worldwide. Currently, the primary diagnosis relies on a combination of clinical history, physical examination findings consistent with asthma, and objective evidence of reversible airflow obstruction. However, the diagnostic process can be [...] Read more.

Background: Asthma is a well-known otolaryngological and immunological disorder that affects patients worldwide. Currently, the primary diagnosis relies on a combination of clinical history, physical examination findings consistent with asthma, and objective evidence of reversible airflow obstruction. However, the diagnostic process can be invasive and time-consuming, which limits clinical efficiency and accessibility. Objectives: In this study, an AI-based prediction system was developed, leveraging voice changes caused by respiratory contraction due to asthma to create a machine learning (ML)-based clinical decision support system. Methods: A total of 1500 speech samples—comprising high-pitch, normal-pitch, and low-pitch recitations of the phonemes [i, a, u]—were used. Long-Term Average Spectrum (LTAS) and Single-Frequency Filtering Cepstral Coefficients (SFCCs) were extracted as features for classification. Seven machine learning algorithms were employed to assess the feasibility of asthma prediction. Results: The Decision Tree, CNN, and LSTM models achieved average accuracies above 0.8, with results of 0.88, 0.80, and 0.84, respectively. Observational results indicate that the Decision Tree model performed best for high-pitch phonemes, whereas the LSTM model outperformed others in normal-pitch and low-pitch phonemes. Additionally, to validate model efficiency and enhance interpretability, feature importance analysis and overall average spectral analysis were applied. Conclusions: This study aims to provide medical clinicians with accurate and reliable decision-making support, improving the efficiency of asthma diagnosis through AI-driven acoustic analysis. Full article

► Show Figures

Figure 1

24 pages, 4425 KiB

Open AccessBrief Report

Transcranial Magnetic Stimulation Facilitates Neural Speech Decoding

by Lindy Comstock, Vinícius Rezende Carvalho, Claudia Lainscsek, Aria Fallah and Terrence J. Sejnowski

Brain Sci. 2024, 14(9), 895; https://doi.org/10.3390/brainsci14090895 - 2 Sep 2024

Cited by 1 | Viewed by 1615

Abstract

Transcranial magnetic stimulation (TMS) has been widely used to study the mechanisms that underlie motor output. Yet, the extent to which TMS acts upon the cortical neurons implicated in volitional motor commands and the focal limitations of TMS remain subject to debate. Previous [...] Read more.

Transcranial magnetic stimulation (TMS) has been widely used to study the mechanisms that underlie motor output. Yet, the extent to which TMS acts upon the cortical neurons implicated in volitional motor commands and the focal limitations of TMS remain subject to debate. Previous research links TMS to improved subject performance in behavioral tasks, including a bias in phoneme discrimination. Our study replicates this result, which implies a causal relationship between electro-magnetic stimulation and psychomotor activity, and tests whether TMS-facilitated psychomotor activity recorded via electroencephalography (EEG) may thus serve as a superior input for neural decoding. First, we illustrate that site-specific TMS elicits a double dissociation in discrimination ability for two phoneme categories. Next, we perform a classification analysis on the EEG signals recorded during TMS and find a dissociation between the stimulation site and decoding accuracy that parallels the behavioral results. We observe weak to moderate evidence for the alternative hypothesis in a Bayesian analysis of group means, with more robust results upon stimulation to a brain region governing multiple phoneme features. Overall, task accuracy was a significant predictor of decoding accuracy for phoneme categories (F(1,135) = 11.51, p < 0.0009) and individual phonemes (F(1,119) = 13.56, p < 0.0003), providing new evidence for a causal link between TMS, neural function, and behavior. Full article

(This article belongs to the Special Issue Language, Communication and the Brain)

► Show Figures

Figure 1

10 pages, 585 KiB

Open AccessTechnical Note

Text-Independent Phone-to-Audio Alignment Leveraging SSL (TIPAA-SSL) Pre-Trained Model Latent Representation and Knowledge Transfer

by Noé Tits, Prernna Bhatnagar and Thierry Dutoit

Acoustics 2024, 6(3), 772-781; https://doi.org/10.3390/acoustics6030042 - 29 Aug 2024

Cited by 1 | Viewed by 1870

Abstract

In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model [...] Read more.

In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained using forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages. Full article

(This article belongs to the Special Issue Developments in Acoustic Phonetic Research)

► Show Figures

Figure 1

17 pages, 3932 KiB

Open AccessArticle

Wireless Mouth Motion Recognition System Based on EEG-EMG Sensors for Severe Speech Impairments

by Kee S. Moon, John S. Kang, Sung Q. Lee, Jeff Thompson and Nicholas Satterlee

Sensors 2024, 24(13), 4125; https://doi.org/10.3390/s24134125 - 25 Jun 2024

Cited by 4 | Viewed by 2805

Abstract

This study aims to demonstrate the feasibility of using a new wireless electroencephalography (EEG)–electromyography (EMG) wearable approach to generate characteristic EEG-EMG mixed patterns with mouth movements in order to detect distinct movement patterns for severe speech impairments. This paper describes a method for [...] Read more.

This study aims to demonstrate the feasibility of using a new wireless electroencephalography (EEG)–electromyography (EMG) wearable approach to generate characteristic EEG-EMG mixed patterns with mouth movements in order to detect distinct movement patterns for severe speech impairments. This paper describes a method for detecting mouth movement based on a new signal processing technology suitable for sensor integration and machine learning applications. This paper examines the relationship between the mouth motion and the brainwave in an effort to develop nonverbal interfacing for people who have lost the ability to communicate, such as people with paralysis. A set of experiments were conducted to assess the efficacy of the proposed method for feature selection. It was determined that the classification of mouth movements was meaningful. EEG-EMG signals were also collected during silent mouthing of phonemes. A few-shot neural network was trained to classify the phonemes from the EEG-EMG signals, yielding classification accuracy of 95%. This technique in data collection and processing bioelectrical signals for phoneme recognition proves a promising avenue for future communication aids. Full article

(This article belongs to the Special Issue Advances in Mobile Sensing for Smart Healthcare)

► Show Figures

Figure 1

15 pages, 987 KiB

Open AccessArticle

Differentiation between Normal Cognition and Subjective Cognitive Decline in Older Adults Using Discrepancy Scores Derived from Neuropsychological Tests

by Ramón López-Higes, Susana Rubio-Valdehita, Sara M. Fernandes and Pedro F. S. Rodrigues

Geriatrics 2024, 9(3), 83; https://doi.org/10.3390/geriatrics9030083 - 19 Jun 2024

Cited by 3 | Viewed by 1996

Abstract

Several studies have reported subtle differences in cognition between individuals with subjective cognitive decline (SCD) compared to those with normal cognition. This study aimed to (i) identify these differences using discrepancy scores (e.g., categorial–phonemic verbal fluency performance) derived from neuropsychological tests in three [...] Read more.

Several studies have reported subtle differences in cognition between individuals with subjective cognitive decline (SCD) compared to those with normal cognition. This study aimed to (i) identify these differences using discrepancy scores (e.g., categorial–phonemic verbal fluency performance) derived from neuropsychological tests in three cognitive domains (memory: Wechsler’s Word List and Digits; executive functions: Stroop and verbal fluency; and language: BNT and ECCO_Senior) and (ii) determine which discrepancy scores are significant for classification. Seventy-five older adults were included: 32 who were labeled SCD+ (age 71.50 ± 5.29), meeting Jessen et al.’s criteria, and 43 in the normal cognition group (SCD−; age 69.81 ± 4.62). Both groups completed a protocol including screening and the specified neuropsychological tests. No differences were found between the groups in their age, education, episodic memory, global cognitive state, or mood. Significant differences between the groups were observed regarding the discrepancy scores derived from BNT (naming) and ECCO_Senior (sentence comprehension). These scores accurately classified participants (71.6%), with ECCO_Senior having a primary role. ROC curves indicated a poor-to-fair model quality or diagnostic accuracy (AUC__BNT = 0.690; AUC__ECCO = 0.722). In conclusion, discrepancy scores in the language domain are important for distinguishing between individuals with SCD and normal cognition, complementing previous findings in this domain. However, given their relatively poor diagnostic accuracy, they should be used with caution as part of a more detailed neuro-psychological assessment. Full article

► Show Figures

Figure 1

9 pages, 909 KiB

Open AccessArticle

Hearing and Language Skills in Children Using Hearing Aids: Experimental Intervention Study

by Luana Speck Polli Burigo, Anna Quialheiro, Karina Mary de Paiva, Thaiana Vargas dos Santos, Luciele Kauana Woide, Luciana Berwanger Cigana, Janaina Massignani and Patricia Haas

J. Pers. Med. 2024, 14(4), 372; https://doi.org/10.3390/jpm14040372 - 30 Mar 2024

Viewed by 2140

Abstract

Introduction: Hearing loss in childhood compromises a child’s auditory, linguistic, and social skill development. Stimulation and early intervention through therapy and the use of personal sound amplification devices (PSAPs) are important for improving communication. Purpose: To verify the effectiveness of speech therapy intervention [...] Read more.

Introduction: Hearing loss in childhood compromises a child’s auditory, linguistic, and social skill development. Stimulation and early intervention through therapy and the use of personal sound amplification devices (PSAPs) are important for improving communication. Purpose: To verify the effectiveness of speech therapy intervention on the auditory and linguistic skills of Brazilian children aged between 6 and 8 years using PSAPs. Methods: Experimental study analyzing the intervention process in children aged between 6 and 8 years with mild to severe bilateral hearing loss and prelingual deafness who are PSAP users. Diagnostic information was analyzed, and assessments and interventions were carried out using the Glendonald Auditory Screening Procedure (GASP), a phoneme discrimination test with figures (TFDF), an expressive language category classification test, and an Infant-Toddler Meaningful Auditory Integration Scale (IT-MAIS) questionnaire. Results: Sixteen children participated in the study; they were divided into a control group (CG) of six children and an intervention group (IG) of ten children. All research subjects underwent two protocol application sessions, and the IG underwent six speech therapy intervention sessions. In the IT-MAIS, the CG had a 9% increase in score, and the IG had an increase of 3% after intervention. The TFDF obtained a 5% increase in the IG in terms of phonemic discrimination ability. The expressive language category classification tests and GASP were considered not sensitive enough to modify the parameters of auditory and linguistic skills. Conclusions: The study found a significant improvement amongst the IG in the TFDF protocol and an increase in IT-MAIS scores in both groups. Full article

(This article belongs to the Section Methodology, Drug and Device Discovery)

► Show Figures

Figure 1

25 pages, 2228 KiB

Open AccessArticle

Improving Text-Independent Forced Alignment to Support Speech-Language Pathologists with Phonetic Transcription

by Ying Li, Bryce Johannas Wohlan, Duc-Son Pham, Kit Yan Chan, Roslyn Ward, Neville Hennessey and Tele Tan

Sensors 2023, 23(24), 9650; https://doi.org/10.3390/s23249650 - 6 Dec 2023

Cited by 1 | Viewed by 3230

Abstract

Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their [...] Read more.

Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their effectiveness. Method: We introduce a novel, text-independent forced alignment model that autonomously recognises individual phonemes and their boundaries, addressing these limitations. Our approach leverages an advanced, pre-trained wav2vec 2.0 model to segment speech into tokens and recognise them automatically. To accurately identify phoneme boundaries, we utilise an unsupervised segmentation tool, UnsupSeg. Labelling of segments employs nearest-neighbour classification with wav2vec 2.0 labels, before connectionist temporal classification (CTC) collapse, determining class labels based on maximum overlap. Additional post-processing, including overfitting cleaning and voice activity detection, is implemented to enhance segmentation. Results: We benchmarked our model against existing methods using the TIMIT dataset for normal speakers and, for the first time, evaluated its performance on the TORGO dataset containing SSD speakers. Our model demonstrated competitive performance, achieving a harmonic mean score of 76.88% on TIMIT and 70.31% on TORGO. Implications: This research presents a significant advancement in the assessment and diagnosis of SSDs, offering a more objective and less biased approach than traditional methods. Our model’s effectiveness, particularly with SSD speakers, opens new avenues for research and clinical application in speech pathology. Full article

(This article belongs to the Special Issue Artificial Intelligence in Medical Sensors II)

► Show Figures

Figure 1

14 pages, 2523 KiB

Open AccessArticle

AI Enhancements for Linguistic E-Learning System

by Jueting Liu, Sicheng Li, Chang Ren, Yibo Lyu, Tingting Xu, Zehua Wang and Wei Chen

Appl. Sci. 2023, 13(19), 10758; https://doi.org/10.3390/app131910758 - 27 Sep 2023

Cited by 3 | Viewed by 3140

Abstract

E-learning systems have been considerably developed after the COVID-19 pandemic. In our previous work, we developed a linguistic interactive E-learning system for phonetic transcription learning. In this paper, we propose three artificial-intelligence-based enhancements to this system from different aspects. Compared with the original [...] Read more.

E-learning systems have been considerably developed after the COVID-19 pandemic. In our previous work, we developed a linguistic interactive E-learning system for phonetic transcription learning. In this paper, we propose three artificial-intelligence-based enhancements to this system from different aspects. Compared with the original system, the first enhancement is a disordered speech classification module; this module is driven by the MFCC-CNN model, which aims to distinguish disordered speech and nondisordered speech. The accuracy of the classification is about 83%. The second enhancement is a grapheme-to-phoneme converter. This converter is based on the transformer model and designed for teachers to better generate IPA words from the regular written text. Compared with other G2P models, our transformer-based G2P model provides outstanding PER and WER performance. The last part of this paper focuses on a Tacotron2-based IPA-to-speech synthesis system, this deep learning-based TTS system can help teacher generate high-quality speech sounds from IPA characters which significantly improve the functionality of our original system. All of these three enhancements are related to the phonetic transcription process. and this work not only provides a better experience for the users of this system but also explores the utilization of artificial intelligence technologies in the E-learning field and linguistic field. Full article

(This article belongs to the Special Issue Artificial Intelligence Technologies for Education: Advancements, Challenges, and Impacts)

► Show Figures

Figure 1

19 pages, 4308 KiB

Open AccessArticle

Speaker Profiling Based on the Short-Term Acoustic Features of Vowels

by Mohammad Ali Humayun, Junaid Shuja and Pg Emeroylariffion Abas

Technologies 2023, 11(5), 119; https://doi.org/10.3390/technologies11050119 - 7 Sep 2023

Cited by 3 | Viewed by 2348

Abstract

Speech samples can provide valuable information regarding speaker characteristics, including their social backgrounds. Accent variations with speaker backgrounds reflect corresponding acoustic features of speech, and these acoustic variations can be analyzed to assist in tracking down criminals from speech samples available as forensic [...] Read more.

Speech samples can provide valuable information regarding speaker characteristics, including their social backgrounds. Accent variations with speaker backgrounds reflect corresponding acoustic features of speech, and these acoustic variations can be analyzed to assist in tracking down criminals from speech samples available as forensic evidence. Speech accent identification has recently received significant consideration in the speech forensics research community. However, most works have utilized long-term temporal modelling of acoustic features for accent classification and disregarded the stationary acoustic characteristics of particular phoneme articulations. This paper analyzes short-term acoustic features extracted from a central time window of English vowel speech segments for accent discrimination. Various feature computation techniques have been compared for the accent classification task. It has been found that using spectral features as an input gives better performance than using cepstral features, with the lower filters contributing more significantly to the classification task. Moreover, detailed analysis has been presented for time window durations and frequency bin resolution to compute short-term spectral features concerning accent discrimination. Using longer time durations generally requires higher frequency resolution to optimize classification performance. These results are significant, as they show the benefits of using spectral features for speaker profiling despite the popularity of cepstral features for other speech-related tasks. Full article

(This article belongs to the Section Information and Communication Technologies)

► Show Figures

Figure 1

20 pages, 7377 KiB

Open AccessArticle

Automatic Detection System for Velopharyngeal Insufficiency Based on Acoustic Signals from Nasal and Oral Channels

by Yu Zhang, Jing Zhang, Wen Li, Heng Yin and Ling He

Diagnostics 2023, 13(16), 2714; https://doi.org/10.3390/diagnostics13162714 - 21 Aug 2023

Cited by 3 | Viewed by 1841

Abstract

Velopharyngeal insufficiency (VPI) is a type of pharyngeal function dysfunction that causes speech impairment and swallowing disorder. Speech therapists play a key role on the diagnosis and treatment of speech disorders. However, there is a worldwide shortage of experienced speech therapists. Artificial intelligence-based [...] Read more.

Velopharyngeal insufficiency (VPI) is a type of pharyngeal function dysfunction that causes speech impairment and swallowing disorder. Speech therapists play a key role on the diagnosis and treatment of speech disorders. However, there is a worldwide shortage of experienced speech therapists. Artificial intelligence-based computer-aided diagnosing technology could be a solution for this. This paper proposes an automatic system for VPI detection at the subject level. It is a non-invasive and convenient approach for VPI diagnosis. Based on the principle of impaired articulation of VPI patients, nasal- and oral-channel acoustic signals are collected as raw data. The system integrates the symptom discriminant results at the phoneme level. For consonants, relative prominent frequency description and relative frequency distribution features are proposed to discriminate nasal air emission caused by VPI. For hypernasality-sensitive vowels, a cross-attention residual Siamese network (CARS-Net) is proposed to perform automatic VPI/non-VPI classification at the phoneme level. CARS-Net embeds a cross-attention module between the two branches to improve the VPI/non-VPI classification model for vowels. We validate the proposed system on a self-built dataset, and the accuracy reaches 98.52%. This provides possibilities for implementing automatic VPI diagnosis. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

12 pages, 306 KiB

Open AccessArticle

Latent Regression Bayesian Network for Speech Representation

by Liang Xu, Yue Zhao, Xiaona Xu, Yigang Liu and Qiang Ji

Electronics 2023, 12(15), 3342; https://doi.org/10.3390/electronics12153342 - 4 Aug 2023

Viewed by 1378

Abstract

In this paper, we present a novel approach for speech representation using latent regression Bayesian networks (LRBN) to address the issue of poor performance in low-resource language speech systems. LRBN, a lightweight unsupervised learning model, learns data distribution and high-level features, unlike computationally [...] Read more.

In this paper, we present a novel approach for speech representation using latent regression Bayesian networks (LRBN) to address the issue of poor performance in low-resource language speech systems. LRBN, a lightweight unsupervised learning model, learns data distribution and high-level features, unlike computationally expensive large models, such as Wav2vec 2.0. To evaluate the effectiveness of LRBN in learning speech representations, we conducted experiments on five different low-resource languages and applied them to two downstream tasks: phoneme classification and speech recognition. Our experimental results demonstrate that LRBN outperforms prevailing speech representation methods in both tasks, highlighting its potential in the realm of speech representation learning for low-resource languages. Full article

(This article belongs to the Special Issue Machine Learning and Deep Learning Based Pattern Recognition)

► Show Figures

Figure 1

17 pages, 1125 KiB

Open AccessArticle

Investigations on the Optimal Estimation of Speech Envelopes for the Two-Stage Speech Enhancement

by Yanjue Song and Nilesh Madhu

Sensors 2023, 23(14), 6438; https://doi.org/10.3390/s23146438 - 16 Jul 2023

Cited by 2 | Viewed by 1736

Abstract

Using the source-filter model of speech production, clean speech signals can be decomposed into an excitation component and an envelope component that is related to the phoneme being uttered. Therefore, restoring the envelope of degraded speech during speech enhancement can improve the intelligibility [...] Read more.

Using the source-filter model of speech production, clean speech signals can be decomposed into an excitation component and an envelope component that is related to the phoneme being uttered. Therefore, restoring the envelope of degraded speech during speech enhancement can improve the intelligibility and quality of output. As the number of phonemes in spoken speech is limited, they can be adequately represented by a correspondingly limited number of envelopes. This can be exploited to improve the estimation of speech envelopes from a degraded signal in a data-driven manner. The improved envelopes are then used in a second stage to refine the final speech estimate. Envelopes are typically derived from the linear prediction coefficients (LPCs) or from the cepstral coefficients (CCs). The improved envelope is obtained either by mapping the degraded envelope onto pre-trained codebooks (classification approach) or by directly estimating it from the degraded envelope (regression approach). In this work, we first investigate the optimal features for envelope representation and codebook generation by a series of oracle tests. We demonstrate that CCs provide better envelope representation compared to using the LPCs. Further, we demonstrate that a unified speech codebook is advantageous compared to the typical codebook that manually splits speech and silence as separate entries. Next, we investigate low-complexity neural network architectures to map degraded envelopes to the optimal codebook entry in practical systems. We confirm that simple recurrent neural networks yield good performance with a low complexity and number of parameters. We also demonstrate that with a careful choice of the feature and architecture, a regression approach can further improve the performance at a lower computational cost. However, as also seen from the oracle tests, the benefit of the two-stage framework is now chiefly limited by the statistical noise floor estimate, leading to only a limited improvement in extremely adverse conditions. This highlights the need for further research on joint estimation of speech and noise for optimum enhancement. Full article

(This article belongs to the Special Issue Machine Learning and Signal Processing Based Acoustic Sensors)

► Show Figures

Figure 1

27 pages, 7503 KiB

Open AccessArticle

Covariants of Gemination in Eastern Andalusian Spanish: /t/ following Underlying /s/, /k/, /p/ and /ks/

by Alfredo Herrero de Haro and John Hajek

Languages 2023, 8(2), 99; https://doi.org/10.3390/languages8020099 - 29 Mar 2023

Cited by 1 | Viewed by 4183

Abstract

In Eastern Andalusian Spanish, consonants are deleted in syllable-final position, triggering regular gemination of a following consonant, even across word boundaries. This paper investigates five underlying phonemic contexts involving /t/, including singleton /t/ and four different underlying /C+t/ sequences that typically surface as [...] Read more.

In Eastern Andalusian Spanish, consonants are deleted in syllable-final position, triggering regular gemination of a following consonant, even across word boundaries. This paper investigates five underlying phonemic contexts involving /t/, including singleton /t/ and four different underlying /C+t/ sequences that typically surface as [t:], by analyzing how durational and formant differences vary depending on the presence and identity of the preceding underlying consonant. Following the acoustic and statistical analyses of 444 instances of /ˈeta/, /ˈesta/, /ˈekta/, /ˈepta/ and /ˈeksta/, a Discriminant Function Analysis shows that differences in the total duration of /t/ and in the duration of the closure of /t/ are the strongest cues to distinguishing singletons from geminated consonants, with 91.9% and 90.6% accurate classifications, respectively. Cues indicating which specific consonants have been deleted before /t/ are much less robust and more varied in nature. It is unclear, however, whether this outcome is due to different compensation strategies in each case or whether they are affected by some kind of underlying coarticulatory effect. Given that gemination in this language variety is the result of regular /C₁C₂/ to [Cː] assimilation, and that its underlying phonemic status has not been demonstrated, Eastern Andalusian Spanish is unusual amongst languages studied with respect to gemination, making this study typologically interesting. Full article

► Show Figures

Figure 1

24 pages, 5307 KiB

Open AccessArticle

Rule-Based Embedded HMMs Phoneme Classification to Improve Qur’anic Recitation Recognition

by Ammar Mohammed Ali Alqadasi, Mohd Shahrizal Sunar, Sherzod Turaev, Rawad Abdulghafor, Md Sah Hj Salam, Abdulaziz Ali Saleh Alashbi, Ali Ahmed Salem and Mohammed A. H. Ali

Electronics 2023, 12(1), 176; https://doi.org/10.3390/electronics12010176 - 30 Dec 2022

Cited by 8 | Viewed by 4080

Abstract

Phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. A mispronunciation of Arabic short vowels or long vowels can change the meaning of a complete sentence. However, correctly distinguishing phonemes with vowels in Quranic recitation (the [...] Read more.

Phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. A mispronunciation of Arabic short vowels or long vowels can change the meaning of a complete sentence. However, correctly distinguishing phonemes with vowels in Quranic recitation (the Holy book of Muslims) is still a challenging problem even for state-of-the-art classification methods, where the duration of the phonemes is considered one of the important features in Quranic recitation, which is called Medd, which means that the phoneme lengthening is governed by strict rules. These features of recitation call for an additional classification of phonemes in Qur’anic recitation due to that the phonemes classification based on Arabic language characteristics is insufficient to recognize Tajweed rules, including the rules of Medd. This paper introduces a Rule-Based Phoneme Duration Algorithm to improve phoneme classification in Qur’anic recitation. The phonemes of the Qur’anic dataset contain 21 Ayats collected from 30 reciters and are carefully analyzed from a baseline HMM-based speech recognition model. Using the Hidden Markov Model with tied-state triphones, a set of phoneme classification models optimized based on duration is constructed and integrated into a Quranic phoneme classification method. The proposed algorithm achieved outstanding accuracy, ranging from 99.87% to 100% according to the Medd type. The obtained results of the proposed algorithm will contribute significantly to Qur’anic recitation recognition models. Full article

(This article belongs to the Special Issue Signal and Image Processing Applications in Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 1537 KiB

Open AccessArticle

Imaginary Speech Recognition Using a Convolutional Network with Long-Short Memory

by Ana-Luiza Rusnac and Ovidiu Grigore

Appl. Sci. 2022, 12(22), 11873; https://doi.org/10.3390/app122211873 - 21 Nov 2022

Cited by 6 | Viewed by 3247

Abstract

In recent years, a lot of researchers’ attentions were concentrating on imaginary speech understanding, decoding, and even recognition. Speech is a complex mechanism, which involves multiple brain areas in the process of production, planning, and precise control of a large number of muscles [...] Read more.

In recent years, a lot of researchers’ attentions were concentrating on imaginary speech understanding, decoding, and even recognition. Speech is a complex mechanism, which involves multiple brain areas in the process of production, planning, and precise control of a large number of muscles and articulation involved in the actual utterance. This paper proposes an intelligent imaginary speech recognition system of eleven different utterances, seven phonemes, and four words from the Kara One database. We showed, during our research, that the feature space of the cross-covariance in frequency domain offers a better perspective of the imaginary speech by computing LDA for 2D representation of the feature space, in comparison to cross-covariance in the time domain and the raw signals without any processing. In the classification stage, we used a CNNLSTM neural network and obtained a performance of 43% accuracy for all eleven different utterances. The developed system was meant to be a subject’s shared system. We also showed that, using the channels corresponding to the anatomical structures of the brain involved in speech production, i.e., Broca area, primary motor cortex, and secondary motor cortex, 93% of information is preserved, obtaining 40% accuracy by using 29 electrodes out of the initial 62. Full article

(This article belongs to the Special Issue Applied Artificial Intelligence (AI))

► Show Figures

Figure 1

Search Results (30)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (30)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI