Submit to Electronics Review for Electronics Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Modeling of Multimodal Speech Recognition and Language Processing

Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Electronic Multimedia".

Deadline for manuscript submissions: closed (15 April 2025) | Viewed by 6128

Share This Special Issue

Special Issue Editors

Dr. Xiaoxue Gao

E-Mail Website
Guest Editor

Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore
Interests: automatic lyrics transcription; speech recognition; speech-to-singing conversion; singing information processing; music information retrieval and multi-modal processing.

Dr. Xinyuan Qian

E-Mail Website
Guest Editor

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
Interests: multi-modal fusion; speaker localization and tracking; speech-related topics

Dr. Ruijie Tao

E-Mail Website
Guest Editor

Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore
Interests: multi-modal processing; speaker recognition; active speaker detection; self-supervised learning

Prof. Dr. Malu Zhang

E-Mail Website
Guest Editor

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Interests: artificial intelligence; artificial neural networks; brain-inspired computational intelligence
Special Issues, Collections and Topics in MDPI journals

Dr. Zhaojie Luo

E-Mail Website
Guest Editor

The Institute of Scientific and Industrial Research, Osaka University, Suita 565-0871, Japan
Interests: voice conversion; speech synthesis; facial expression recognition; multimodal emotion recognition; statistical signal processing

Special Issue Information

Dear Colleagues,

This Special Issue, ‘Modeling of Multimodal Speech Recognition and Language Processing,’ aims to delve into the rapidly evolving landscape of automatic speech recognition (ASR) and language processing. It seeks to collate papers exploring innovative approaches that bridge the gap between human speech comprehension and computational interpretation, as well as that emphasize the development of novel techniques to enhance ASR and language modeling, particularly in challenging environments such as diverse noisy settings, multimodal contexts, and multi-lingual speech recognition.

By concentrating on challenging real-world scenarios, we encourage researchers to push the boundaries of existing knowledge and contribute ground-breaking solutions to the field. Furthermore, this Special Issue is designed to provide a comprehensive resource for researchers, both newcomers and experts, by presenting cutting-edge research, methodologies, and insights that are directly applicable to real-world ASR and language processing challenges.

In relation to the existing approaches, this Special Issue seeks to build upon the foundation laid by prior research in ASR and language processing, as well as extend and enhance the existing literature by focusing on emerging challenges, such as multimodal recognition and security concerns, that have gained prominence in recent years. By addressing these gaps in the literature, we aim to offer a forward-looking perspective on ASR and language processing, showcasing practical solutions and insights that align with contemporary demands. Researchers can expect to find valuable references and inspiration to address the most pressing issues in the field, making this Special Issue a pivotal addition to the existing body of work.

Topics of interests include, but are not limited to, the following:

Robust speech recognition;
Language modeling;
Multi-lingual speech recognition;
Audio-visual speech recognition;
Fast decoding techniques;
Representation learning for audio, text, or/and vision;
Speaker recognition for speech recognition;
Audio security and adversarial attacks on speech recognition models;
Large speech models;
Large language models for speech recognition.

Dr. Xiaoxue Gao
Dr. Xinyuan Qian
Dr. Ruijie Tao
Dr. Malu Zhang
Dr. Zhaojie Luo
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

robust speech recognition
novel approaches for speech recognition
multi-lingual speech recognition
audio-visual speech recognition
language modelling
self-supervised learning for speech processing
representation learning for audio, language and vision

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (2 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

17 pages, 1941 KB

Open AccessEditor’s ChoiceArticle

MMER-LMF: Multi-Modal Emotion Recognition in Lightweight Modality Fusion

by Eun-Hee Kim, Myung-Jin Lim and Ju-Hyun Shin

Electronics 2025, 14(11), 2139; https://doi.org/10.3390/electronics14112139 - 24 May 2025

Cited by 6 | Viewed by 2541

Abstract

Recently, multimodal approaches that combine various modalities have been attracting attention to recognizing emotions more accurately. Although multimodal fusion delivers strong performance, it is computationally intensive and difficult to handle in real time. In addition, there is a fundamental lack of large-scale emotional datasets for learning. In particular, Korean emotional datasets have fewer resources available than English-speaking datasets, thereby limiting the generalization capability of emotion recognition models. In this study, we propose a more lightweight modality fusion method, MMER-LMF, to overcome the lack of Korean emotional datasets and improve emotional recognition performance while reducing model training complexity. To this end, we suggest three algorithms that fuse emotion scores based on the reliability of each model, including text emotion scores extracted using a pre-trained large-scale language model and video emotion scores extracted based on a 3D CNN model. Each algorithm showed similar classification performance except for slight differences in disgust emotion performance with confidence-based weight adjustment, correlation coefficient utilization, and the Dempster–Shafer Theory-based combination method. The accuracy was 80% and the recall was 79%, which is higher than 58% using text modality and 72% using video modality. This is a superior result in terms of learning complexity and performance compared to previous studies using Korean datasets. Full article

(This article belongs to the Special Issue Modeling of Multimodal Speech Recognition and Language Processing)

► Show Figures

Figure 1

15 pages, 1256 KB

Open AccessArticle

Testing in Noise Based on the First Adaptive Matrix Sentence Test in Slovak Language

by Eva Kiktová, Rudolph Sock and Peter Getlík

Electronics 2024, 13(3), 602; https://doi.org/10.3390/electronics13030602 - 1 Feb 2024

Cited by 1 | Viewed by 1992

Abstract

This study deals with an acoustic perceptual test performed on the basis of adaptive matrix tests, which represent a modern and reliable tool that can be used not only in perceptual phonetics but also for detecting problems related to hearing. The tests used, based on the first Slovak adaptive matrix, provided extensive test material, which was evaluated through a series of tests implemented according to ICRA (International Collegium of Rehabilitative Audiology) guidelines. Healthy listeners took part in the tests, and, during the tests, they listened to prepared sentence stimuli simultaneously with noise. Out of a total number of 30 tests, 15 tests met the demanding criteria. The tests were evaluated from the point of view of the word recognition score, the slope of the psychometric curve function, and also the threshold values corresponding to word recognition at the levels of 20%, 50%, and 80%. We also investigated and compared the impact of two different testing strategies (open and closed test format) and also the impact of experience or unfamiliarity with the test routine used. The created tests achieved SRT50 = −7.03 ± 0.79 dB and a slope of 13.13 ± 1.60%/dB. Full article

(This article belongs to the Special Issue Modeling of Multimodal Speech Recognition and Language Processing)

► Show Figures

Journal Menu

Journal Browser

Modeling of Multimodal Speech Recognition and Language Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (2 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI