Submit to Special Issue Submit Abstract to Special Issue Review for Sensors Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis

Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 15 February 2026 | Viewed by 7854

Share This Special Issue

Special Issue Editors

Dr. Kele Xu

E-Mail Website
Guest Editor

National Key Laboratory of Parallel and Distributed Processing; National University of Defense Technology, Changsha, China
Interests: acoustic signal processing; machine learning; and intelligent software systems

Prof. Dr. Rui Liu

E-Mail Website
Guest Editor

College of Computer Science, Inner Mongolia University, Hohhot 010031, China
Interests: acoustic signal processing; speech synthesis; human-machine conversation

Prof. Dr. Maoshen Jia

E-Mail Website
Guest Editor

School of Information Science and Technology; Beijing University of Technology; Beijing 100124, China
Interests: speech and audio coding; multichannel audio signal processing; and array signal processing
Special Issues, Collections and Topics in MDPI journals

Dr. Dawei Feng

E-Mail Website
Guest Editor

National Key Laboratory of Parallel and Distributed Processing, National University of Defense Technology, Changsha, China
Interests: distributed computing; machine learning; intelligent software systems

Special Issue Information

Dear Colleagues,

The intersection of technology and acoustics has ushered in a new era of innovation in signal processing. This Special Issue, "Advances in Automatic Speech Recognition, Audio, and Underwater Acoustic Signal Analysis", is dedicated to exploring the latest breakthroughs in these dynamic fields. With a focus on applying advanced algorithms, machine learning, and sensor technologies, we aim to present a comprehensive view of the current state of research and its potential impact on future developments.

We invite scholars, researchers, and industry experts to contribute their insights, fostering a multidisciplinary dialogue that propels the field forward. The themes include:

Automatic Speech Recognition;
Audio Signal Acquisition;
Audio Signal Processing;
Audio and Underwater Acoustic Signal Recognition and Classification;
Machine Learning and Deep Learning algorithm Application in Audio Signal Analysis;
Safety and Privacy;
Acceleration and Deployment of Audio Signal Processing Algorithms.

The Special Issue, "Advances in Automatic Speech Recognition, Audio, and Underwater Acoustic Signal Analysis", is inherently linked to "Sensors", focusing on the critical role of sensor technology in capturing and processing acoustic signals. It explores the application of advanced algorithms and machine learning to enhance signal recognition and classification, emphasizing the importance of sensor data quality for effective audio analysis. Addressing safety, privacy, and algorithm deployment, the issue underscores the multidisciplinary innovation driven by sensor advancements in acoustic signal processing.

Dr. Kele Xu
Prof. Dr. Rui Liu
Prof. Dr. Maoshen Jia
Dr. Dawei Feng
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

automatic speech recognition
audio signal acquisition
audio signal processing
audio and underwater acoustic signal recognition and classification
machine learning and deep learning algorithm application in audio signal analysis
safety and privacy
acceleration and deployment of audio signal processing algorithms

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (5 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

16 pages, 468 KB

Open AccessArticle

Deflationary Extraction Transformer for Speech Separation with Unknown Number of Talkers

by Sangwon Lee, Han-Gyu Kim and Gil-Jin Jang

Sensors 2025, 25(16), 4905; https://doi.org/10.3390/s25164905 - 8 Aug 2025

Viewed by 526

Abstract

Most speech separation techniques require knowing the number of talkers mixed in an input, which is not always available in real situations. To address this problem, we present a novel speech separation method that automatically finds the number of talkers in input mixture recordings. The proposed method extracts the voices of individual talkers one by one in a deflationary manner and stops the extraction sequence when a predefined termination criterion is satisfied. The backbone separation model is built based on the transformer architecture with permutation-invariant training to avoid ambiguity in identifying talkers at the output. The experimental results on the Libri5Mix and Libri10Mix datasets show that the proposed method without the number of talkers as input significantly outperforms state-of-the-art models that are provided with the number of talkers. Full article

(This article belongs to the Special Issue Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis)

► Show Figures

Figure 1

15 pages, 1359 KB

Open AccessArticle

Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition

by Lusheng Zhang, Shie Wu and Zhongxun Wang

Sensors 2025, 25(14), 4288; https://doi.org/10.3390/s25144288 - 9 Jul 2025

Viewed by 919

Abstract

Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments and Tacotron-2 synthesis, injects adversarial phoneme variants into both transcripts and their aligned audio segments, enlarging pronunciation diversity. Concurrently, a semantic-aware SpecAugment scheme exploits wav2vec 2.0 attention heat maps and keyword boundaries to adaptively mask informative time–frequency regions; a reinforcement-learning controller tunes the masking schedule online, forcing the model to rely on a wider context. On the Common Voice Cantonese 50 h subset, the combined strategy reduces the character error rate (CER) from 26.17% to 16.88% with wav2vec 2.0 and from 38.83% to 23.55% with Zipformer. At 100 h, the CER further drops to 4.27% and 2.32%, yielding relative gains of 32–44%. Ablation studies confirm that phoneme-level and masking components provide complementary benefits. The framework offers a practical, model-independent path toward accurate ASR for Cantonese and other low-resource tonal languages. This paper presents an intelligent sensing-oriented modeling framework for speech signals, which is suitable for deployment on edge or embedded systems to process input from audio sensors (e.g., microphones) and shows promising potential for voice-interactive terminal applications. Full article

(This article belongs to the Special Issue Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis)

► Show Figures

Figure 1

19 pages, 457 KB

Open AccessArticle

Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment

by Chen Shen, Lu Zhao, Cejin Fu, Bote Gan and Zhenlong Du

Sensors 2025, 25(13), 3973; https://doi.org/10.3390/s25133973 - 26 Jun 2025

Viewed by 1527

Abstract

Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger’s speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization. Full article

(This article belongs to the Special Issue Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis)

► Show Figures

Figure 1

20 pages, 5438 KB

Open AccessArticle

Separation of Simultaneous Speakers with Acoustic Vector Sensor

by Józef Kotus and Grzegorz Szwoch

Sensors 2025, 25(5), 1509; https://doi.org/10.3390/s25051509 - 28 Feb 2025

Viewed by 901

Abstract

This paper presents a method of sound source separation in live audio signals, based on sound intensity analysis. Sound pressure signals recorded with an acoustic vector sensor are analyzed, and the spectral distribution of sound intensity in two dimensions is calculated. Spectral components of the analyzed signal are selected based on the calculated source direction, which leads to a spatial filtration of the sound. The experiments were performed with test signals convolved with impulse responses of a real sensor, recorded for a varying sound source position. The experiments evaluated the proposed method’s ability to separate sound sources, depending on their position, spectral content, and signal-to-noise ratio, especially when multiple sources are active at the same time. The obtained results are presented and discussed. The proposed algorithm provided signal-to-distortion ratio (SDR) values 10–12 dB, and Short-Time Objective Intelligibility Measure (STOI) values in the range 0.86–0.94, an increase by 0.15–0.30 compared with the unprocessed speech signal. The proposed method is intended for applications in automated speech recognition systems, speaker diarization, and separation in the concurrent speech scenarios, using a small acoustic sensor. Full article

(This article belongs to the Special Issue Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis)

► Show Figures

Figure 1

15 pages, 1603 KB

Open AccessArticle

Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition

by Changhwan Go, Young Han Lee, Taewoo Kim, Nam In Park and Chanjun Chun

Sensors 2024, 24(19), 6213; https://doi.org/10.3390/s24196213 - 25 Sep 2024

Viewed by 3072

Abstract

Speaker recognition is a technology that identifies the speaker in an input utterance by extracting speaker-distinguishable features from the speech signal. Speaker recognition is used for system security and authentication; therefore, it is crucial to extract unique features of the speaker to achieve high recognition rates. Representative methods for extracting these features include a classification approach, or utilizing contrastive learning to learn the speaker relationship between representations and then using embeddings extracted from a specific layer of the model. This paper introduces a framework for developing robust speaker recognition models through contrastive learning. This approach aims to minimize the similarity to hard negative samples—those that are genuine negatives, but have extremely similar features to the positives, leading to potential mistaken. Specifically, our proposed method trains the model by estimating hard negative samples within a mini-batch during contrastive learning, and then utilizes a cross-attention mechanism to determine speaker agreement for pairs of utterances. To demonstrate the effectiveness of our proposed method, we compared the performance of a deep learning model trained with a conventional loss function utilized in speaker recognition with that of a deep learning model trained using our proposed method, as measured by the equal error rate (EER), an objective performance metric. Our results indicate that when trained with the voxceleb2 dataset, the proposed method achieved an EER of 0.98% on the voxceleb1-E dataset and 1.84% on the voxceleb1-H dataset. Full article

(This article belongs to the Special Issue Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis)

► Show Figures

Journal Menu

Journal Browser

Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (5 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI