Applied Sciences

Research

27 pages, 15968 KiB

Open AccessArticle

MPFM-VC: A Voice Conversion Algorithm Based on Multi-Dimensional Perception Flow Matching

by Yanze Wang, Xuming Han, Shuai Lv, Ting Zhou and Yali Chu

Appl. Sci. 2025, 15(10), 5503; https://doi.org/10.3390/app15105503 - 14 May 2025

Viewed by 1552

Voice conversion (VC) is an advanced technology that enables the transformation of raw speech into high-quality audio resembling the target speaker’s voice while preserving the original linguistic content and prosodic patterns. In this study, we propose a voice conversion algorithm, Multi-Dimensional Perception Flow [...] Read more.

Voice conversion (VC) is an advanced technology that enables the transformation of raw speech into high-quality audio resembling the target speaker’s voice while preserving the original linguistic content and prosodic patterns. In this study, we propose a voice conversion algorithm, Multi-Dimensional Perception Flow Matching (MPFM-VC). Unlike traditional approaches that directly generate waveform outputs, MPFM-VC models the evolutionary trajectory of mel spectrograms with a flow-matching framework and incorporates a multi-dimensional feature perception network to enhance the stability and quality of speech synthesis. Additionally, we introduce a content perturbation method during training to improve the model’s generalization ability and reduce inference-time artifacts. To further increase speaker similarity, an adversarial training mechanism on speaker embeddings is employed to achieve effective disentanglement between content and speaker identity representations, thereby enhancing the timbre consistency of the converted speech. Experimental results for both speech and singing voice conversion tasks show that MPFM-VC achieves competitive performance compared to existing state-of-the-art VC models in both subjective and objective evaluation metrics. The synthesized speech shows improved naturalness, clarity, and timbre fidelity in both objective and subjective evaluations, suggesting the potential effectiveness of the proposed approach. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

24 pages, 5922 KiB

Open AccessArticle

Age Prediction from Korean Speech Data Using Neural Networks with Diverse Voice Features

by Hayeon Ku, Jiho Lee, Minseo Lee, Seulgi Kim and Janghyeok Yoon

Appl. Sci. 2025, 15(3), 1337; https://doi.org/10.3390/app15031337 - 27 Jan 2025

Viewed by 1801

Abstract

A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have [...] Read more.

A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have been conducted have limitations, such as the absence of specific age groups or detailed age categories. Therefore, this study proposes an optimal combination of speech features and deep-learning models to recognize detailed age groups using a large Korean-speech dataset. From the speech dataset, recorded by individuals ranging from their teens to their 50s, four speech features were extracted: the Mel spectrogram, log-Mel spectrogram, Mel-frequency cepstral coefficients (MFCCs), and ΔMFCCs. Using these speech features, four deep-learning models were trained: ResNet-50, 1D-CNN, 2D-CNN, and a vision transformer. A performance comparison of speech feature-extraction methods and models indicated that MFCCs + ΔMFCCs was the best for both sexes when trained on the 1D-CNN model; it achieved an accuracy of 88.16% for males and 81.95% for females. The results of this study are expected to contribute to the future development of Korean speaker-recognition systems. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

26 pages, 3823 KiB

Open AccessArticle

Enhanced Conformer-Based Speech Recognition via Model Fusion and Adaptive Decoding with Dynamic Rescoring

by Junhao Geng, Dongyao Jia, Zihao He, Nengkai Wu and Ziqi Li

Appl. Sci. 2024, 14(24), 11583; https://doi.org/10.3390/app142411583 - 11 Dec 2024

Viewed by 2116

Abstract

Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause [...] Read more.

Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause instability and long response times, hindering AI’s competitiveness. Therefore, addressing these technical bottlenecks is critical for advancing national scientific progress and global information infrastructure. In this paper, we propose improvements to the model structure fusion and decoding algorithms. First, based on the Conformer network and its variants, we introduce a weighted fusion method using training loss as an indicator, adjusting the weights, thresholds, and other related parameters of the fused models to balance the contributions of different model structures, thereby creating a more robust and generalized model that alleviates overfitting and local optima. Second, for the decoding phase, we design a dynamic adaptive decoding method that combines traditional decoding algorithms such as connectionist temporal classification and attention-based models. This ensemble approach enables the system to adapt to different acoustic environments, improving its robustness and overall performance. Additionally, to further optimize the decoding process, we introduce a penalty function mechanism as a regularization technique to reduce the model’s dependence on a single decoding approach. The penalty function limits the weights of decoding strategies to prevent over-reliance on any single decoder, thus enhancing the model’s generalization. Finally, we validate our model on the Librispeech dataset, a large-scale English speech corpus containing approximately 1000 h of audio data. Experimental results demonstrate that the proposed method achieves word error rates (WERs) of 3.92% and 4.07% on the development and test sets, respectively, significantly improving over single-model and traditional decoding methods. Notably, the method reduces WER by approximately 0.4% on complex datasets compared to several advanced mainstream models, underscoring its superior robustness and adaptability in challenging acoustic environments. The effectiveness of the proposed method in addressing overfitting and improving accuracy and efficiency during the decoding phase was validated, highlighting its significance in advancing speech recognition technology. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

17 pages, 896 KiB

Open AccessArticle

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

by Qing Zhou, Xiaona Xu and Yue Zhao

Appl. Sci. 2024, 14(15), 6834; https://doi.org/10.3390/app14156834 - 5 Aug 2024

Viewed by 2127

Abstract

Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder [...] Read more.

Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder for Tibetan speech synthesis. Specifically, we introduce the mixture alignment FastSpeech2 method to correct errors caused by hard alignment in the original FastSpeech2 method. This new method employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby improving alignment accuracy between text and speech and enhancing the naturalness and intelligibility of the synthesized speech. Additionally, we integrate pitch and energy information into the model, further enhancing overall synthesis quality. Furthermore, Tibetan has relatively smaller text-to-audio datasets compared to widely studied languages. To address these limited resources, we employ a transfer learning approach to pre-train the model with data from resource-rich languages. Subsequently, this pre-trained mixture alignment FastSpeech2 model is fine-tuned for Tibetan speech synthesis. Experimental results demonstrate that the mixture alignment FastSpeech2 model produces higher-quality speech compared to the original FastSpeech2 model, particularly when pre-trained on an English dataset, resulting in further improvements in clarity and naturalness. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

14 pages, 3407 KiB

Open AccessArticle

An Audio Copy-Move Forgery Localization Model by CNN-Based Spectral Analysis

by Wei Zhao, Yujin Zhang, Yongqi Wang and Shiwen Zhang

Appl. Sci. 2024, 14(11), 4882; https://doi.org/10.3390/app14114882 - 4 Jun 2024

Cited by 1 | Viewed by 2226

Abstract

In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, [...] Read more.

In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, and heuristic similarity thresholds lack robustness. Existing deep learning methods extract features from audio and then use neural networks for binary classification, lacking the ability to locate forged segments. Therefore, for locating audio copy-move forgery segments, we have improved deep learning methods and proposed a robust localization model by CNN-based spectral analysis. In the localization model, the Feature Extraction Module extracts deep features from Mel-spectrograms, while the Correlation Detection Module automatically decides on the correlation between these deep features. Finally, the Mask Decoding Module visually locates the forged segments. Experimental results show that compared to existing methods, the localization model improves the detection accuracy of audio copy-move forgery by 3.0–6.8%and improves the average detection accuracy of forged audio with post-processing attacks such as noise, filtering, resampling, and MP3 compression by over 7.0%. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

18 pages, 3890 KiB

Open AccessArticle

Pyramid Feature Attention Network for Speech Resampling Detection

by Xinyu Zhou, Yujin Zhang, Yongqi Wang, Jin Tian and Shaolun Xu

Appl. Sci. 2024, 14(11), 4803; https://doi.org/10.3390/app14114803 - 1 Jun 2024

Cited by 1 | Viewed by 1271

Abstract

Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great [...] Read more.

Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great significance. For speech resampling detection, most of the previous works used traditional methods of feature extraction and classification to distinguish original speech from forged speech. In view of the powerful ability of deep learning to extract features, this paper converts the speech signal into a spectrogram with time-frequency characteristics, and uses the feature pyramid network (FPN) with the Squeeze and Excitation (SE) attention mechanism to learn speech resampling features. The proposed method combines the low-level location information and the high-level semantic information, which dramatically improves the detection performance of speech resampling. Experiments were carried out on a resampling corpus made on the basis of the TIMIT dataset. The results indicate that the proposed method significantly improved the detection accuracy of various resampled speech. For the tampered speech with a resampling factor of 0.9, the detection accuracy is increased by nearly 20%. In addition, the robustness test demonstrates that the proposed model has strong resistance to MP3 compression, and the overall performance is better than the existing methods. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

13 pages, 324 KiB

Open AccessArticle

Branch-Transformer: A Parallel Branch Architecture to Capture Local and Global Features for Language Identification

by Zeen Li, Shuanghong Liu, Zhihua Fang and Liang He

Appl. Sci. 2024, 14(11), 4681; https://doi.org/10.3390/app14114681 - 29 May 2024

Cited by 1 | Viewed by 1958

Abstract

Currently, an increasing number of people are opting to use transformer models or conformer models for language identification, achieving outstanding results. Among them, transformer models based on self-attention can only capture global information, lacking finer local details. There are also approaches that employ [...] Read more.

Currently, an increasing number of people are opting to use transformer models or conformer models for language identification, achieving outstanding results. Among them, transformer models based on self-attention can only capture global information, lacking finer local details. There are also approaches that employ conformer models by concatenating convolutional neural networks and transformers to capture both local and global information. However, this static single-branch architecture is difficult to interpret and modify, and it incurs greater inference difficulty and computational costs compared to dual-branch models. Therefore, in this paper, we propose a novel model called Branch-transformer (B-transformer). In contrast to traditional transformers, it consists of parallel dual-branch structures. One branch utilizes self-attention to capture global information, while the other employs a Convolutional Gated Multi-Layer Perceptron (cgMLP) module to extract local information. We also investigate various fusion methods for integrating global and local information and experimentally validate the effectiveness of our approach on the NIST LRE 2017 dataset. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

14 pages, 1802 KiB

Open AccessArticle

Wav2wav: Wave-to-Wave Voice Conversion

by Changhyeon Jeong, Hyung-pil Chang, In-Chul Yoo and Dongsuk Yook

Appl. Sci. 2024, 14(10), 4251; https://doi.org/10.3390/app14104251 - 17 May 2024

Cited by 3 | Viewed by 2603

Abstract

Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms [...] Read more.

Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms to be useful in these various applications. Deep learning-based voice conversion algorithms, which have been showing promising results recently, generally consist of three modules: a feature extractor, feature converter, and vocoder. The feature extractor accepts the waveform as the input and extracts speech feature vectors for further processing. These speech feature vectors are later synthesized back into waveforms by the vocoder. The feature converter module performs the actual voice conversion; therefore, many previous studies separately focused on improving this module. These works combined the separately trained vocoder to synthesize the final waveform. Since the feature converter and the vocoder are trained independently, the output of the converter may not be compatible with the input of the vocoder, which causes performance degradation. Furthermore, most voice conversion algorithms utilize mel-spectrogram-based speech feature vectors without modification. These feature vectors have performed well in a variety of speech-processing areas but could be further optimized for voice conversion tasks. To address these problems, we propose a novel wave-to-wave (wav2wav) voice conversion method that integrates the feature extractor, the feature converter, and the vocoder into a single module and trains the system in an end-to-end manner. We evaluated the efficiency of the proposed method using the VCC2018 dataset. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Deep Learning for Speech, Image and Language Processing

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (8 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI