Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (13)

Search Parameters:
Keywords = direct speech translation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 1508 KB  
Article
Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks
by Stefanie Amiruzzaman, Md Amiruzzaman, Raga Mouni Batchu, James Dracup, Alexander Pham, Benjamin Crocker, Linh Ngo and M. Ali Akber Dewan
Computers 2026, 15(1), 20; https://doi.org/10.3390/computers15010020 - 4 Jan 2026
Viewed by 274
Abstract
This study presents a real-time, bidirectional system for translating American Sign Language (ASL) to and from English using computer vision and transformer-based models to enhance accessibility for deaf and hard of hearing users. Leveraging publicly available sign language and text–to-gloss datasets, the system [...] Read more.
This study presents a real-time, bidirectional system for translating American Sign Language (ASL) to and from English using computer vision and transformer-based models to enhance accessibility for deaf and hard of hearing users. Leveraging publicly available sign language and text–to-gloss datasets, the system integrates MediaPipe-based holistic landmark extraction with CNN- and transformer-based architectures to support translation across video, text, and speech modalities within a web-based interface. In the ASL-to-English direction, the sign-to-gloss model achieves a 25.17% word error rate (WER) on the RWTH-PHOENIX-Weather 2014T benchmark, which is competitive with recent continuous sign language recognition systems, and the gloss-level translation attains a ROUGE-L score of 79.89, indicating strong preservation of sign content and ordering. In the reverse English-to-ASL direction, the English-to-Gloss transformer trained on ASLG-PC12 achieves a ROUGE-L score of 96.00, demonstrating high-fidelity gloss sequence generation suitable for landmark-based ASL animation. These results highlight a favorable accuracy-efficiency trade-off achieved through compact model architectures and low-latency decoding, supporting practical real-time deployment. Full article
(This article belongs to the Section AI-Driven Innovations)
Show Figures

Figure 1

14 pages, 738 KB  
Opinion
Envisioning the Future of Machine Learning in the Early Detection of Neurodevelopmental and Neurodegenerative Disorders via Speech and Language Biomarkers
by Georgios P. Georgiou
Acoustics 2025, 7(4), 72; https://doi.org/10.3390/acoustics7040072 - 10 Nov 2025
Cited by 1 | Viewed by 1345
Abstract
Speech and language offer a rich, non-invasive window into brain health. Advances in machine learning (ML) have enabled increasingly accurate detection of neurodevelopmental and neurodegenerative disorders through these modalities. This paper envisions the future of ML in the early detection of neurodevelopmental disorders [...] Read more.
Speech and language offer a rich, non-invasive window into brain health. Advances in machine learning (ML) have enabled increasingly accurate detection of neurodevelopmental and neurodegenerative disorders through these modalities. This paper envisions the future of ML in the early detection of neurodevelopmental disorders like autism spectrum disorder and attention-deficit/hyperactivity disorder, and neurodegenerative disorders, such as Parkinson’s disease and Alzheimer’s disease, through speech and language biomarkers. We explore the current landscape of ML techniques, including deep learning and multimodal approaches, and review their applications across various conditions, highlighting both successes and inherent limitations. Our core contribution lies in outlining future trends across several critical dimensions. These include the enhancement of data availability and quality, the evolution of models, the development of multilingual and cross-cultural models, the establishment of regulatory and clinical translation frameworks, and the creation of hybrid systems enabling human–artificial intelligence (AI) collaboration. Finally, we conclude with a vision for future directions, emphasizing the potential integration of ML-driven speech diagnostics into public health infrastructure, the development of patient-specific explainable AI, and its synergistic combination with genomics and brain imaging for holistic brain health assessment. Overcoming substantial hurdles in validation, generalization, and clinical adoption, the field is poised to shift toward ubiquitous, accessible, and highly personalized tools for early diagnosis. Full article
(This article belongs to the Special Issue Artificial Intelligence in Acoustic Phonetics)
Show Figures

Figure 1

21 pages, 471 KB  
Review
Long Short-Term Memory Networks: A Comprehensive Survey
by Moez Krichen and Alaeddine Mihoub
AI 2025, 6(9), 215; https://doi.org/10.3390/ai6090215 - 5 Sep 2025
Cited by 8 | Viewed by 5752
Abstract
Long Short-Term Memory (LSTM) networks have revolutionized the field of deep learning, particularly in applications that require the modeling of sequential data. Originally designed to overcome the limitations of traditional recurrent neural networks (RNNs), LSTMs effectively capture long-range dependencies in sequences, making them [...] Read more.
Long Short-Term Memory (LSTM) networks have revolutionized the field of deep learning, particularly in applications that require the modeling of sequential data. Originally designed to overcome the limitations of traditional recurrent neural networks (RNNs), LSTMs effectively capture long-range dependencies in sequences, making them suitable for a wide array of tasks. This survey aims to provide a comprehensive overview of LSTM architectures, detailing their unique components, such as cell states and gating mechanisms, which facilitate the retention and modulation of information over time. We delve into the various applications of LSTMs across multiple domains, including the following: natural language processing (NLP), where they are employed for language modeling, machine translation, and sentiment analysis; time series analysis, where they play a critical role in forecasting tasks; and speech recognition, significantly enhancing the accuracy of automated systems. By examining these applications, we illustrate the versatility and robustness of LSTMs in handling complex data types. Additionally, we explore several notable variants and improvements of the standard LSTM architecture, such as Bidirectional LSTMs, which enhance context understanding, and Stacked LSTMs, which increase model capacity. We also discuss the integration of Attention Mechanisms with LSTMs, which have further advanced their performance in various tasks. Despite their strengths, LSTMs face several challenges, including high Computational Complexity, extensive Data Requirements, and difficulties in training, which can hinder their practical implementation. This survey addresses these limitations and provides insights into ongoing research aimed at mitigating these issues. In conclusion, we highlight recent advances in LSTM research and propose potential future directions that could lead to enhanced performance and broader applicability of LSTM networks. This survey serves as a foundational resource for researchers and practitioners seeking to understand the current landscape of LSTM technology and its future trajectory. Full article
Show Figures

Figure 1

21 pages, 1118 KB  
Review
Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines
by Yutong Liu, Qingquan Sun and Dhruvi Rajeshkumar Kapadia
AI 2025, 6(7), 158; https://doi.org/10.3390/ai6070158 - 15 Jul 2025
Cited by 2 | Viewed by 9773
Abstract
This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into [...] Read more.
This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into low-level control signals, supporting semantic planning and enabling adaptive execution. Systems like SayTap improve gait stability through LLM-generated contact patterns, while TrustNavGPT achieves a 5.7% word error rate (WER) under noisy voice-guided conditions by modeling user uncertainty. Frameworks such as MapGPT, LLM-Planner, and 3D-LOTUS++ integrate multi-modal data—including vision, speech, and proprioception—for robust planning and real-time recovery. We also highlight the use of physics-informed neural networks (PINNs) to model object deformation and support precision in contact-rich manipulation tasks. To bridge the gap between simulation and real-world deployment, we synthesize best practices from benchmark datasets (e.g., RH20T, Open X-Embodiment) and training pipelines designed for one-shot imitation learning and cross-embodiment generalization. Additionally, we analyze deployment trade-offs across cloud, edge, and hybrid architectures, emphasizing latency, scalability, and privacy. The survey concludes with a multi-dimensional taxonomy and cross-domain synthesis, offering design insights and future directions for building intelligent, human-aligned robotic systems powered by LLMs. Full article
Show Figures

Figure 1

25 pages, 7813 KB  
Article
Deep Learning-Based Speech Recognition and LabVIEW Integration for Intelligent Mobile Robot Control
by Kai-Chao Yao, Wei-Tzer Huang, Hsi-Huang Hsieh, Teng-Yu Chen, Wei-Sho Ho, Jiunn-Shiou Fang and Wei-Lun Huang
Actuators 2025, 14(5), 249; https://doi.org/10.3390/act14050249 - 15 May 2025
Cited by 1 | Viewed by 2121
Abstract
This study implemented an innovative system that trains a speech recognition model based on the DeepSpeech2 architecture using Python for voice control of a robot on the LabVIEW platform. First, a speech recognition model based on the DeepSpeech2 architecture was trained using a [...] Read more.
This study implemented an innovative system that trains a speech recognition model based on the DeepSpeech2 architecture using Python for voice control of a robot on the LabVIEW platform. First, a speech recognition model based on the DeepSpeech2 architecture was trained using a large speech dataset, enabling it to accurately transcribe voice commands. Then, this model was integrated with the LabVIEW graphical user interface and the myRIO controller. By leveraging LabVIEW’s graphical programming environment, the system processed voice commands, translated them into control signals, and directed the robot’s movements accordingly. Experimental results demonstrate that the system not only accurately recognizes various voice commands, but also controls the robot’s behavior in real time, showing high practicality and reliability. This study addresses the limitations inherent in conventional voice control methods, demonstrates the potential of integrating deep learning technology with industrial control platforms, and presents a novel approach for robotic voice control. Full article
(This article belongs to the Section Actuators for Robotics)
Show Figures

Figure 1

20 pages, 2690 KB  
Article
Creating a Parallel Corpus for the Kazakh Sign Language and Learning
by Aigerim Yerimbetova, Bakzhan Sakenov, Madina Sambetbayeva, Elmira Daiyrbayeva, Ulmeken Berzhanova and Mohamed Othman
Appl. Sci. 2025, 15(5), 2808; https://doi.org/10.3390/app15052808 - 5 Mar 2025
Cited by 2 | Viewed by 3183
Abstract
Kazakh Sign Language (KSL) is a crucial communication tool for individuals with hearing and speech impairments. Deep learning, particularly Transformer models, offers a promising approach to improving accessibility in education and communication. This study analyzes the syntactic structure of KSL, identifying its unique [...] Read more.
Kazakh Sign Language (KSL) is a crucial communication tool for individuals with hearing and speech impairments. Deep learning, particularly Transformer models, offers a promising approach to improving accessibility in education and communication. This study analyzes the syntactic structure of KSL, identifying its unique grammatical features and deviations from spoken Kazakh. A custom parser was developed to convert Kazakh text into KSL glosses, enabling the creation of a large-scale parallel corpus. Using this resource, a Transformer-based machine translation model was trained, achieving high translation accuracy and demonstrating the feasibility of this approach for enhancing communication accessibility. The research highlights key challenges in sign language processing, such as the limited availability of annotated data. Future work directions include the integration of video data and the adoption of more comprehensive evaluation metrics. This paper presents a methodology for constructing a parallel corpus through gloss annotations, contributing to advancements in sign language translation technology. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

22 pages, 11349 KB  
Article
A Novel Bi-Dual Inference Approach for Detecting Six-Element Emotions
by Xiaoping Huang, Yujian Zhou and Yajun Du
Appl. Sci. 2023, 13(17), 9957; https://doi.org/10.3390/app13179957 - 3 Sep 2023
Cited by 1 | Viewed by 1481
Abstract
In recent years, there has been rapid development in machine learning for solving artificial intelligence tasks in various fields, including translation, speech, and image processing. These AI tasks are often interconnected rather than independent. One specific type of relationship is known as structural [...] Read more.
In recent years, there has been rapid development in machine learning for solving artificial intelligence tasks in various fields, including translation, speech, and image processing. These AI tasks are often interconnected rather than independent. One specific type of relationship is known as structural duality, which exists between multiple pairs of artificial intelligence tasks. The concept of dual learning has gained significant attention in the fields of machine learning, computer vision, and natural language processing. Dual learning involves using primitive tasks (mapping from domains X to Y) and dual tasks (mapping from domains Y to X) to enhance the performance of both tasks. In this study, we propose a general framework called Bi-Dual Inference by combining the principles of dual inference and dual learning. Our framework generates multiple dual models and a primal model by utilizing two dual tasks: sentiment analysis of input text and sentence generation of sentiment labels. We create these model pairs (primal model f, dual model g) by employing different initialization seeds and data access sequences. Each primal and dual model is considered as a distinct LSTM model. By reasoning about a single task with multiple similar models in the same direction, our framework achieves improved classification results. To validate the effectiveness of our proposed model, we conduct experiments on two datasets, namely NLPCC2013 and NLPCC2014. The results demonstrate that our model outperforms the optimal baseline model in terms of the F1 score, achieving an improvement of approximately 5%. Additionally, we provide parameter values for our proposed model, including model iteration analysis, α parameter analysis, λ parameter analysis, batch size analysis, training sentence length analysis, and hidden layer size setting. These experimental results further confirm the effectiveness of our proposed model. Full article
Show Figures

Figure 1

18 pages, 1102 KB  
Review
Changing the Tendency to Integrate the Senses
by Saul I. Quintero, Ladan Shams and Kimia Kamal
Brain Sci. 2022, 12(10), 1384; https://doi.org/10.3390/brainsci12101384 - 13 Oct 2022
Cited by 11 | Viewed by 3541 | Correction
Abstract
Integration of sensory signals that emanate from the same source, such as the visual of lip articulations and the sound of the voice of a speaking individual, can improve perception of the source signal (e.g., speech). Because momentary sensory inputs are typically corrupted [...] Read more.
Integration of sensory signals that emanate from the same source, such as the visual of lip articulations and the sound of the voice of a speaking individual, can improve perception of the source signal (e.g., speech). Because momentary sensory inputs are typically corrupted with internal and external noise, there is almost always a discrepancy between the inputs, facing the perceptual system with the problem of determining whether the two signals were caused by the same source or different sources. Thus, whether or not multisensory stimuli are integrated and the degree to which they are bound is influenced by factors such as the prior expectation of a common source. We refer to this factor as the tendency to bind stimuli, or for short, binding tendency. In theory, the tendency to bind sensory stimuli can be learned by experience through the acquisition of the probabilities of the co-occurrence of the stimuli. It can also be influenced by cognitive knowledge of the environment. The binding tendency varies across individuals and can also vary within an individual over time. Here, we review the studies that have investigated the plasticity of binding tendency. We discuss the protocols that have been reported to produce changes in binding tendency, the candidate learning mechanisms involved in this process, the possible neural correlates of binding tendency, and outstanding questions pertaining to binding tendency and its plasticity. We conclude by proposing directions for future research and argue that understanding mechanisms and recipes for increasing binding tendency can have important clinical and translational applications for populations or individuals with a deficiency in multisensory integration. Full article
(This article belongs to the Special Issue The Neural Basis of Multisensory Plasticity)
Show Figures

Figure 1

16 pages, 25741 KB  
Article
Speech GAU: A Single Head Attention for Mandarin Speech Recognition for Air Traffic Control
by Shiyu Zhang, Jianguo Kong, Chao Chen, Yabin Li and Haijun Liang
Aerospace 2022, 9(8), 395; https://doi.org/10.3390/aerospace9080395 - 22 Jul 2022
Cited by 11 | Viewed by 3122
Abstract
The rise of end-to-end (E2E) speech recognition technology in recent years has overturned the design pattern of cascading multiple subtasks in classical speech recognition and achieved direct mapping of speech input signals to text labels. In this study, a new E2E framework, ResNet–GAU–CTC, [...] Read more.
The rise of end-to-end (E2E) speech recognition technology in recent years has overturned the design pattern of cascading multiple subtasks in classical speech recognition and achieved direct mapping of speech input signals to text labels. In this study, a new E2E framework, ResNet–GAU–CTC, is proposed to implement Mandarin speech recognition for air traffic control (ATC). A deep residual network (ResNet) utilizes the translation invariance and local correlation of a convolutional neural network (CNN) to extract the time-frequency domain information of speech signals. A gated attention unit (GAU) utilizes a gated single-head attention mechanism to better capture the long-range dependencies of sequences, thus attaining a larger receptive field and contextual information, as well as a faster training convergence rate. The connectionist temporal classification (CTC) criterion eliminates the need for forced frame-level alignments. To address the problems of scarce data resources and unique pronunciation norms and contexts in the ATC field, transfer learning and data augmentation techniques were applied to enhance the robustness of the network and improve the generalization ability of the model. The character error rate (CER) of our model was 11.1% on the expanded Aishell corpus, and it decreased to 8.0% on the ATC corpus. Full article
Show Figures

Figure 1

24 pages, 580 KB  
Article
Cascade or Direct Speech Translation? A Case Study
by Thierry Etchegoyhen, Haritz Arzelus, Harritxu Gete, Aitor Alvarez, Iván G. Torre, Juan Manuel Martín-Doñas, Ander González-Docasal and Edson Benites Fernandez
Appl. Sci. 2022, 12(3), 1097; https://doi.org/10.3390/app12031097 - 21 Jan 2022
Cited by 16 | Viewed by 6174
Abstract
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text or speech in a target language. Leveraging on deep learning approaches to natural language [...] Read more.
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text or speech in a target language. Leveraging on deep learning approaches to natural language processing, recent studies have explored the potential of direct end-to-end neural modelling to perform the speech translation task. Though several benefits may come from end-to-end modelling, such as a reduction in latency and error propagation, the comparative merits of each approach still deserve detailed evaluations and analyses. In this work, we compared state-of-the-art cascade and direct approaches on the under-resourced Basque–Spanish language pair, which features challenging phenomena such as marked differences in morphology and word order. This case study thus complements other studies in the field, which mostly revolve around the English language. We describe and analysed in detail the mintzai-ST corpus, prepared from the sessions of the Basque Parliament, and evaluated the strengths and limitations of cascade and direct speech translation models trained on this corpus, with variants exploiting additional data as well. Our results indicated that, despite significant progress with end-to-end models, which may outperform alternatives in some cases in terms of automated metrics, a cascade approach proved optimal overall in our experiments and manual evaluations. Full article
Show Figures

Figure 1

35 pages, 3387 KB  
Article
Gender in Unilingual and Mixed Speech of Spanish Heritage Speakers in The Netherlands
by Ivo Boers, Bo Sterken, Brechje van Osch, M. Carmen Parafita Couto, Janet Grijzenhout and Deniz Tat
Languages 2020, 5(4), 68; https://doi.org/10.3390/languages5040068 - 4 Dec 2020
Cited by 7 | Viewed by 5068
Abstract
This study examines heritage speakers of Spanish in The Netherlands regarding their production of gender in both their languages (Spanish and Dutch) as well as their gender assignment strategies in code-switched constructions. A director-matcher task was used to elicit unilingual and mixed speech [...] Read more.
This study examines heritage speakers of Spanish in The Netherlands regarding their production of gender in both their languages (Spanish and Dutch) as well as their gender assignment strategies in code-switched constructions. A director-matcher task was used to elicit unilingual and mixed speech from 21 participants (aged 8 to 52, mean = 17). The nominal domain consisting of a determiner, noun, and adjective was targeted in three modes: (i) Unilingual Spanish mode, (ii) unilingual Dutch mode, and (iii) code-switched mode in both directions (Dutch to Spanish and Spanish to Dutch). The production of gender in both monolingual modes was deviant from the respective monolingual norms, especially in Dutch, the dominant language of the society. In the code-switching mode, evidence was found for the gender default strategy (common in Dutch, masculine in Spanish), the analogical gender strategy (i.e., the preference to assign the gender of the translation equivalent) as well as two thus far unattested strategies involving a combination of a default gender and the use of a non-prototypical word order. External factors such as age of onset of bilingualism, amount of exposure and use of both languages had an effect on both gender accuracy in the monolingual modes and assignment strategies in the code-switching modes. Full article
(This article belongs to the Special Issue Contemporary Advances in Linguistic Research on Heritage Spanish)
Show Figures

Figure 1

17 pages, 7225 KB  
Article
Improved Arabic–Chinese Machine Translation with Linguistic Input Features
by Fares Aqlan, Xiaoping Fan, Abdullah Alqwbani and Akram Al-Mansoub
Future Internet 2019, 11(1), 22; https://doi.org/10.3390/fi11010022 - 19 Jan 2019
Cited by 11 | Viewed by 6411
Abstract
This study presents linguistically augmented models of phrase-based statistical machine translation (PBSMT) using different linguistic features (factors) on the top of the source surface form. The architecture addresses two major problems occurring in machine translation, namely the poor performance of direct translation from [...] Read more.
This study presents linguistically augmented models of phrase-based statistical machine translation (PBSMT) using different linguistic features (factors) on the top of the source surface form. The architecture addresses two major problems occurring in machine translation, namely the poor performance of direct translation from a highly-inflected and morphologically complex language into morphologically poor languages, and the data sparseness issue, which becomes a significant challenge under low-resource conditions. We use three factors (lemma, part-of-speech tags, and morphological features) to enrich the input side with additional information to improve the quality of direct translation from Arabic to Chinese, considering the importance and global presence of this language pair as well as the limitation of work on machine translation between these two languages. In an effort to deal with the issue of the out of vocabulary (OOV) words and missing words, we propose the best combination of factors and models based on alternative paths. The proposed models were compared with the standard PBSMT model which represents the baseline of this work, and two enhanced approaches tokenized by a state-of-the-art external tool that has been proven to be useful for Arabic as a morphologically rich and complex language. The experiment was performed with a Moses decoder on freely available data extracted from a multilingual corpus from United Nation documents (MultiUN). Results of a preliminary evaluation in terms of BLEU scores show that the use of linguistic features on the Arabic side considerably outperforms baseline and tokenized approaches, the system can consistently reduce the OOV rate as well. Full article
(This article belongs to the Section Big Data and Augmented Intelligence)
Show Figures

Figure 1

17 pages, 597 KB  
Article
Patterns of Short-Term Phonetic Interference in Bilingual Speech
by Šárka Šimáčková and Václav Jonáš Podlipský
Languages 2018, 3(3), 34; https://doi.org/10.3390/languages3030034 - 24 Aug 2018
Cited by 4 | Viewed by 5539
Abstract
Previous research indicates that alternating between a bilingual’s languages during speech production can lead to short-term increases in cross-language phonetic interaction. However, discrepancies exist between the reported L1–L2 effects in terms of direction and magnitude, and sometimes the effects are not found at [...] Read more.
Previous research indicates that alternating between a bilingual’s languages during speech production can lead to short-term increases in cross-language phonetic interaction. However, discrepancies exist between the reported L1–L2 effects in terms of direction and magnitude, and sometimes the effects are not found at all. The present study focused on L1 interference in L2, examining Voice Onset Time (VOT) of English voiceless stops produced by L1-dominant Czech-English bilinguals—interpreter trainees highly proficient in L2-English. We tested two hypotheses: (1) switching between languages induces an immediate increase in L1 interference during code-switching; and (2) due to global language co-activation, an increase in L1-to-L2 interference occurs when bilinguals interpret (translate) a message from L1 into L2 even if they do not produce L1 speech. Fourteen bilinguals uttered L2-English sentences under three conditions: L2-only, code-switching into L2, and interpreting into L2. Against expectation, the results showed that English VOT in the bilingual tasks tended to be longer and less Czech-like compared to the English-only task. This contradicts an earlier finding of L2 VOT converging temporarily towards L1 VOT values for comparable bilingual tasks performed by speakers from the same bilingual population. Participant-level inspection of our data suggests that besides language-background differences, individual language-switching strategies contribute to discrepancies between studies. Full article
(This article belongs to the Special Issue Interdisciplinary Perspectives on Code-Switching)
Show Figures

Figure 1

Back to TopTop