Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (10)

Search Parameters:
Keywords = spoken term detection

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
41 pages, 849 KB  
Article
HEUXIVA: A Set of Heuristics for Evaluating User eXperience with Voice Assistants
by Daniela Quiñones, Luis Felipe Rojas, Camila Serrá, Jessica Ramírez, Viviana Barrientos and Sandra Cano
Appl. Sci. 2025, 15(20), 11178; https://doi.org/10.3390/app152011178 - 18 Oct 2025
Viewed by 782
Abstract
Voice assistants have become increasingly common in everyday devices such as smartphones and smart speakers. Improving their user experience (UX) is crucial to ensuring usability, acceptance, and long-term effectiveness. Heuristic evaluation is a widely used method for UX evaluation due to its efficiency [...] Read more.
Voice assistants have become increasingly common in everyday devices such as smartphones and smart speakers. Improving their user experience (UX) is crucial to ensuring usability, acceptance, and long-term effectiveness. Heuristic evaluation is a widely used method for UX evaluation due to its efficiency in detecting problems quickly and at low cost. Nonetheless, existing usability/UX heuristics were not designed to address the specific challenges of voice-based interaction, which relies on spoken dialog and auditory feedback. To overcome this limitation, we developed HEUXIVA, a set of 13 heuristics specifically developed for evaluating UX with voice assistants. The proposal was created through a structured methodology and refined in two iterations. We validated HEUXIVA through heuristic evaluations, expert judgment, and user testing. The results offer preliminary but consistent evidence supporting the effectiveness of HEUXIVA in identifying UX issues specific to the voice assistant “Google Nest Mini”. Experts described the heuristics as clear, practical, and easy to use. They also highlighted their usefulness in evaluating interaction features and supporting the overall UX evaluation process. HEUXIVA therefore provides designers, researchers, and practitioners with a specialized tool to improve the quality of voice assistant interfaces and improve user satisfaction. Full article
(This article belongs to the Special Issue Emerging Technologies in Innovative Human–Computer Interactions)
Show Figures

Figure 1

17 pages, 4114 KB  
Article
Biomimetic Computing for Efficient Spoken Language Identification
by Gaurav Kumar and Saurabh Bhardwaj
Biomimetics 2025, 10(5), 316; https://doi.org/10.3390/biomimetics10050316 - 14 May 2025
Cited by 1 | Viewed by 1263
Abstract
Spoken Language Identification (SLID)-based applications have become increasingly important in everyday life, driven by advancements in artificial intelligence and machine learning. Multilingual countries utilize the SLID method to facilitate speech detection. This is accomplished by determining the language of the spoken parts using [...] Read more.
Spoken Language Identification (SLID)-based applications have become increasingly important in everyday life, driven by advancements in artificial intelligence and machine learning. Multilingual countries utilize the SLID method to facilitate speech detection. This is accomplished by determining the language of the spoken parts using language recognizers. On the other hand, when working with multilingual datasets, the presence of multiple languages that have a shared origin presents a significant challenge for accurately classifying languages using automatic techniques. Further, one more challenge is the significant variance in speech signals caused by factors such as different speakers, content, acoustic settings, language differences, changes in voice modulation based on age and gender, and variations in speech patterns. In this study, we introduce the DBODL-MSLIS approach, which integrates biomimetic optimization techniques inspired by natural intelligence to enhance language classification. The proposed method employs Dung Beetle Optimization (DBO) with Deep Learning, simulating the beetle’s foraging behavior to optimize feature selection and classification performance. The proposed technique integrates speech preprocessing, which encompasses pre-emphasis, windowing, and frame blocking, followed by feature extraction utilizing pitch, energy, Discrete Wavelet Transform (DWT), and Zero crossing rate (ZCR). Further, the selection of features is performed by DBO algorithm, which removes redundant features and helps to improve efficiency and accuracy. Spoken languages are classified using Bayesian optimization (BO) in conjunction with a long short-term memory (LSTM) network. The DBODL-MSLIS technique has been experimentally validated using the IIIT Spoken Language dataset. The results indicate an average accuracy of 95.54% and an F-score of 84.31%. This technique surpasses various other state-of-the-art models, such as SVM, MLP, LDA, DLA-ASLISS, HMHFS-IISLFAS, GA base fusion, and VGG-16. We have evaluated the accuracy of our proposed technique against state-of-the-art biomimetic computing models such as GA, PSO, GWO, DE, and ACO. While ACO achieved up to 89.45% accuracy, our Bayesian Optimization with LSTM outperformed all others, reaching a peak accuracy of 95.55%, demonstrating its effectiveness in enhancing spoken language identification. The suggested technique demonstrates promising potential for practical applications in the field of multi-lingual voice processing. Full article
Show Figures

Figure 1

22 pages, 2347 KB  
Article
Dementia Detection from Speech: What If Language Models Are Not the Answer?
by Mondher Bouazizi, Chuheng Zheng, Siyuan Yang and Tomoaki Ohtsuki
Information 2024, 15(1), 2; https://doi.org/10.3390/info15010002 - 19 Dec 2023
Cited by 13 | Viewed by 4428
Abstract
A growing focus among scientists has been on researching the techniques of automatic detection of dementia that can be applied to the speech samples of individuals with dementia. Leveraging the rapid advancements in Deep Learning (DL) and Natural Language Processing (NLP), these techniques [...] Read more.
A growing focus among scientists has been on researching the techniques of automatic detection of dementia that can be applied to the speech samples of individuals with dementia. Leveraging the rapid advancements in Deep Learning (DL) and Natural Language Processing (NLP), these techniques have shown great potential in dementia detection. In this context, this paper proposes a method for dementia detection from the transcribed speech of subjects. Unlike conventional methods that rely on advanced language models to address the ability of the subject to make coherent and meaningful sentences, our approach relies on the center of focus of the subjects and how it changes over time as the subject describes the content of the cookie theft image, a commonly used image for evaluating one’s cognitive abilities. To do so, we divide the cookie theft image into regions of interest, and identify, in each sentence spoken by the subject, which regions are being talked about. We employed a Long Short-Term Memory (LSTM) neural network to learn different patterns of dementia subjects and control ones and used it to perform a 10-fold cross validation-based classification. Our experimental results on the Pitt corpus from the DementiaBank resulted in a 82.9% accuracy at the subject level and 81.0% at the sample level. By employing data-augmentation techniques, the accuracy at both levels was increased to 83.6% and 82.1%, respectively. The performance of our proposed method outperforms most of the conventional methods, which reach, at best, an accuracy equal to 81.5% at the subject level. Full article
(This article belongs to the Section Information Applications)
Show Figures

Figure 1

14 pages, 379 KB  
Article
Q8VaxStance: Dataset Labeling System for Stance Detection towards Vaccines in Kuwaiti Dialect
by Hana Alostad, Shoug Dawiek and Hasan Davulcu
Big Data Cogn. Comput. 2023, 7(3), 151; https://doi.org/10.3390/bdcc7030151 - 15 Sep 2023
Cited by 5 | Viewed by 2866
Abstract
The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the [...] Read more.
The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset fills this gap and provides a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it contributes to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42,815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, for the total number of generated labels, the difference between using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant, indicating that it generates more labels than when using English in both the labels and prompt. The best accuracy achieved in our experiments in terms of the Macro-F1 values was found when using keyword and hashtag detection labeling functions in conjunction with zero-shot model labeling functions, specifically in experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. Experiment KHZSLF-EE4 was able to label 42,270 tweets, while experiment KHZSLF-EA1 was able to label 42,764 tweets. Finally, the average value of annotation agreement between the generated labels and human labels ranges between 0.61 and 0.64, which is considered a good level of agreement. Full article
Show Figures

Figure 1

16 pages, 4781 KB  
Article
A Multi-Layer Holistic Approach for Cursive Text Recognition
by Muhammad Umair, Muhammad Zubair, Farhan Dawood, Sarim Ashfaq, Muhammad Shahid Bhatti, Mohammad Hijji and Abid Sohail
Appl. Sci. 2022, 12(24), 12652; https://doi.org/10.3390/app122412652 - 9 Dec 2022
Cited by 13 | Viewed by 3290
Abstract
Urdu is a widely spoken and narrated language in several South-Asian countries and communities worldwide. It is relatively hard to recognize Urdu text compared to other languages due to its cursive writing style. The Urdu text script belongs to a non-Latin cursive family [...] Read more.
Urdu is a widely spoken and narrated language in several South-Asian countries and communities worldwide. It is relatively hard to recognize Urdu text compared to other languages due to its cursive writing style. The Urdu text script belongs to a non-Latin cursive family script like Arabic, Hindi and Chinese. Urdu is written in several writing styles, among which ‘Nastaleeq’ is the most popular and widely used font style. A gap still poses a challenge for localization/detection and recognition of Urdu Nastaleeq text as it follows modified version of Arabic script. This research study presents a methodology to recognize and classify Urdu text in Nastaleeq font, regardless of the text position in the image. The proposed solution is comprised of a two-step methodology. In the first step, text detection is performed using the Connected Component Analysis (CCA) and Long Short-Term Memory Neural Network (LSTM). In the second step, a hybrid Convolution Neural Network and Recurrent Neural Network (CNN-RNN) architecture is deployed to recognize the detected text. The image containing Urdu text is binarized and segmented to produce a single-line text image fed to the hybrid CNN-RNN model, which recognizes the text and saves it in a text file. The proposed technique outperforms the existing ones by achieving an overall accuracy of 97.47%. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

22 pages, 1141 KB  
Article
Amharic Speech Search Using Text Word Query Based on Automatic Sentence-like Segmentation
by Getnet Mezgebu Brhanemeskel, Solomon Teferra Abate, Tewodros Alemu Ayall and Abegaz Mohammed Seid
Appl. Sci. 2022, 12(22), 11727; https://doi.org/10.3390/app122211727 - 18 Nov 2022
Cited by 3 | Viewed by 5002
Abstract
More than 7000 languages are spoken in the world today. Amharic is one of the languages spoken in the East African country Ethiopia. A lot of speech data is being made every day in different languages as machines are getting better at processing [...] Read more.
More than 7000 languages are spoken in the world today. Amharic is one of the languages spoken in the East African country Ethiopia. A lot of speech data is being made every day in different languages as machines are getting better at processing and have improved storing capacity. However, searching for a particular word with its respective time frame inside a given audio file is a challenge. Since Amharic has its own distinguishing characteristics, such as glottal, palatal, and labialized consonants, it is not possible to directly use models that are developed for other languages. A popular approach in developing systems for searching particular information in speech involves using an automatic speech recognition (ASR) module that generates the text version of the speech where the word or phrase is searched based on text query. However, it is not possible to transcribe a long audio file without segmentation, which in turn affects the performance of the ASR module. In this paper, we are reporting our investigation on the effects of manual and automatic speech segmentation of Amharic audio files in a spiritual domain. We have used manual segmentation as a baseline for our investigation and found out that sentence-like automatic segmentation resulted in a word error rate (WER) closer to the WER achieved on the manually segmented test speech. Based on the experimental results, we propose Amharic speech search using text word query (ASSTWQ) based on automatic sentence-like segmentation. Since we have achieved lower WER using the previously developed speech corpus, which is in a broadcast news domain, together with the in-domain speech corpus, we recommend using both in- and out-domain speech corpora to develop the Amharic ASR module. The performance of the proposed ASR is a WER of 53% that needs further improvement. Combining two language models (LMs) developed using training text from the two different domains (spiritual and broadcast news) allowed a WER reduction from 53% to 46%. Therefore, we have developed two ASSTWQ systems using the two ASR modules with WERs of 53% and 46%. Full article
(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)
Show Figures

Figure 1

19 pages, 911 KB  
Article
Rethinking the Methods and Algorithms for Inner Speech Decoding and Making Them Reproducible
by Foteini Simistira Liwicki, Vibha Gupta, Rajkumar Saini, Kanjar De and Marcus Liwicki
NeuroSci 2022, 3(2), 226-244; https://doi.org/10.3390/neurosci3020017 - 19 Apr 2022
Cited by 13 | Viewed by 6107
Abstract
This study focuses on the automatic decoding of inner speech using noninvasive methods, such as Electroencephalography (EEG). While inner speech has been a research topic in philosophy and psychology for half a century, recent attempts have been made to decode nonvoiced [...] Read more.
This study focuses on the automatic decoding of inner speech using noninvasive methods, such as Electroencephalography (EEG). While inner speech has been a research topic in philosophy and psychology for half a century, recent attempts have been made to decode nonvoiced spoken words by using various brain–computer interfaces. The main shortcomings of existing work are reproducibility and the availability of data and code. In this work, we investigate various methods (using Convolutional Neural Network (CNN), Gated Recurrent Unit (GRU), Long Short-Term Memory Networks (LSTM)) for the detection task of five vowels and six words on a publicly available EEG dataset. The main contributions of this work are (1) subject dependent vs. subject-independent approaches, (2) the effect of different preprocessing steps (Independent Component Analysis (ICA), down-sampling and filtering), and (3) word classification (where we achieve state-of-the-art performance on a publicly available dataset). Overall we achieve a performance accuracy of 35.20% and 29.21% when classifying five vowels and six words, respectively, in a publicly available dataset, using our tuned iSpeech-CNN architecture. All of our code and processed data are publicly available to ensure reproducibility. As such, this work contributes to a deeper understanding and reproducibility of experiments in the area of inner speech detection. Full article
Show Figures

Figure 1

39 pages, 686 KB  
Article
The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses
by Javier Tejedor, Doroteo T. Toledano, Jose M. Ramirez, Ana R. Montalvo and Juan Ignacio Alvarez-Trejos
Appl. Sci. 2021, 11(18), 8519; https://doi.org/10.3390/app11188519 - 14 Sep 2021
Cited by 3 | Viewed by 3940
Abstract
The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search [...] Read more.
The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search query (which can include one or more words). On the other hand, query-by-example spoken term detection (QbE STD) aims to retrieve speech data given an acoustic representation of a search query. This is the first paper that presents an internationally open multi-domain evaluation for SoS in Spanish that includes both STD and QbE STD tasks. The evaluation was carefully designed so that several post-evaluation analyses of the main results could be carried out. The evaluation tasks aim to retrieve the speech files that contain the queries, providing their start and end times and a score that reflects how likely the detection within the given time intervals and speech file is. Three different speech databases in Spanish that comprise different domains were employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the SPARL20 database, which contains Spanish parliament sessions. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the evaluation results and some detailed post-evaluation analyses based on specific query properties (in-vocabulary/out-of-vocabulary queries, single-word/multi-word queries and native/foreign queries). The most novel features of the submitted systems are a data augmentation technique for the STD task and an end-to-end system for the QbE STD task. The obtained results suggest that there is clearly room for improvement in the SoS task and that performance is highly sensitive to changes in the data domain. Full article
Show Figures

Figure 1

18 pages, 451 KB  
Article
The Melody of Speech: What the Melodic Perception of Speech Reveals about Language Performance and Musical Abilities
by Markus Christiner, Christine Gross, Annemarie Seither-Preisler and Peter Schneider
Languages 2021, 6(3), 132; https://doi.org/10.3390/languages6030132 - 2 Aug 2021
Cited by 12 | Viewed by 7983
Abstract
Research has shown that melody not only plays a crucial role in music but also in language acquisition processes. Evidence has been provided that melody helps in retrieving, remembering, and memorizing new language material, while relatively little is known about whether individuals who [...] Read more.
Research has shown that melody not only plays a crucial role in music but also in language acquisition processes. Evidence has been provided that melody helps in retrieving, remembering, and memorizing new language material, while relatively little is known about whether individuals who perceive speech as more melodic than others also benefit in the acquisition of oral languages. In this investigation, we wanted to show which impact the subjective melodic perception of speech has on the pronunciation of unfamiliar foreign languages. We tested 86 participants for how melodic they perceived five unfamiliar languages, for their ability to repeat and pronounce the respective five languages, for their musical abilities, and for their short-term memory (STM). The results revealed that 59 percent of the variance in the language pronunciation tasks could be explained by five predictors: the number of foreign languages spoken, short-term memory capacity, tonal aptitude, melodic singing ability, and how melodic the languages appeared to the participants. Group comparisons showed that individuals who perceived languages as more melodic performed significantly better in all language tasks than those who did not. However, even though we expected musical measures to be related to the melodic perception of foreign languages, we could only detect some correlations to rhythmical and tonal musical aptitude. Overall, the findings of this investigation add a new dimension to language research, which shows that individuals who perceive natural languages to be more melodic than others also retrieve and pronounce utterances more accurately. Full article
32 pages, 755 KB  
Article
Quantization and Deployment of Deep Neural Networks on Microcontrollers
by Pierre-Emmanuel Novac, Ghouthi Boukli Hacene, Alain Pegatoquet, Benoît Miramond and Vincent Gripon
Sensors 2021, 21(9), 2984; https://doi.org/10.3390/s21092984 - 23 Apr 2021
Cited by 154 | Viewed by 28578
Abstract
Embedding Artificial Intelligence onto low-power devices is a challenging task that has been partly overcome with recent advances in machine learning and hardware design. Presently, deep neural networks can be deployed on embedded targets to perform different tasks such as speech recognition, object [...] Read more.
Embedding Artificial Intelligence onto low-power devices is a challenging task that has been partly overcome with recent advances in machine learning and hardware design. Presently, deep neural networks can be deployed on embedded targets to perform different tasks such as speech recognition, object detection or Human Activity Recognition. However, there is still room for optimization of deep neural networks onto embedded devices. These optimizations mainly address power consumption, memory and real-time constraints, but also an easier deployment at the edge. Moreover, there is still a need for a better understanding of what can be achieved for different use cases. This work focuses on quantization and deployment of deep neural networks onto low-power 32-bit microcontrollers. The quantization methods, relevant in the context of an embedded execution onto a microcontroller, are first outlined. Then, a new framework for end-to-end deep neural networks training, quantization and deployment is presented. This framework, called MicroAI, is designed as an alternative to existing inference engines (TensorFlow Lite for Microcontrollers and STM32Cube.AI). Our framework can indeed be easily adjusted and/or extended for specific use cases. Execution using single precision 32-bit floating-point as well as fixed-point on 8- and 16 bits integers are supported. The proposed quantization method is evaluated with three different datasets (UCI-HAR, Spoken MNIST and GTSRB). Finally, a comparison study between MicroAI and both existing embedded inference engines is provided in terms of memory and power efficiency. On-device evaluation is done using ARM Cortex-M4F-based microcontrollers (Ambiq Apollo3 and STM32L452RE). Full article
Show Figures

Figure 1

Back to TopTop