applsci-logo

Journal Browser

Journal Browser

Deep Learning for Speech, Image and Language Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 30 September 2025 | Viewed by 11163

Special Issue Editor


E-Mail Website1 Website2
Guest Editor
Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
Interests: deep learning; machine learning; artificial intelligence; speech processing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Deep learning is an essential technology in various application area. It started to show performances comparable to that of humans in audio, image, video, and natural language processing applications. Recently, spoken language translation and multimodal large language model technologies provide new interfaces and methods for inputting, manipulating, and generating text, sound, image, and video using computers. This special issue will be dedicated to the state-of-the-art research articles as well as tutorials and reviews in the field of deep learning for speech, image, and language processing.

Prof. Dr. Dongsuk Yook
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech processing
  • image processing
  • signal processing
  • language processing
  • applications and theories of deep learning

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

24 pages, 5922 KiB  
Article
Age Prediction from Korean Speech Data Using Neural Networks with Diverse Voice Features
by Hayeon Ku, Jiho Lee, Minseo Lee, Seulgi Kim and Janghyeok Yoon
Appl. Sci. 2025, 15(3), 1337; https://doi.org/10.3390/app15031337 - 27 Jan 2025
Viewed by 891
Abstract
A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have [...] Read more.
A person’s voice serves as an indicator of age, as it changes with anatomical and physiological influences throughout their life. Although age prediction is a subject of interest across various disciplines, age-prediction studies using Korean voices are limited. The few studies that have been conducted have limitations, such as the absence of specific age groups or detailed age categories. Therefore, this study proposes an optimal combination of speech features and deep-learning models to recognize detailed age groups using a large Korean-speech dataset. From the speech dataset, recorded by individuals ranging from their teens to their 50s, four speech features were extracted: the Mel spectrogram, log-Mel spectrogram, Mel-frequency cepstral coefficients (MFCCs), and ΔMFCCs. Using these speech features, four deep-learning models were trained: ResNet-50, 1D-CNN, 2D-CNN, and a vision transformer. A performance comparison of speech feature-extraction methods and models indicated that MFCCs + ΔMFCCs was the best for both sexes when trained on the 1D-CNN model; it achieved an accuracy of 88.16% for males and 81.95% for females. The results of this study are expected to contribute to the future development of Korean speaker-recognition systems. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

26 pages, 3823 KiB  
Article
Enhanced Conformer-Based Speech Recognition via Model Fusion and Adaptive Decoding with Dynamic Rescoring
by Junhao Geng, Dongyao Jia, Zihao He, Nengkai Wu and Ziqi Li
Appl. Sci. 2024, 14(24), 11583; https://doi.org/10.3390/app142411583 - 11 Dec 2024
Viewed by 1290
Abstract
Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause [...] Read more.
Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause instability and long response times, hindering AI’s competitiveness. Therefore, addressing these technical bottlenecks is critical for advancing national scientific progress and global information infrastructure. In this paper, we propose improvements to the model structure fusion and decoding algorithms. First, based on the Conformer network and its variants, we introduce a weighted fusion method using training loss as an indicator, adjusting the weights, thresholds, and other related parameters of the fused models to balance the contributions of different model structures, thereby creating a more robust and generalized model that alleviates overfitting and local optima. Second, for the decoding phase, we design a dynamic adaptive decoding method that combines traditional decoding algorithms such as connectionist temporal classification and attention-based models. This ensemble approach enables the system to adapt to different acoustic environments, improving its robustness and overall performance. Additionally, to further optimize the decoding process, we introduce a penalty function mechanism as a regularization technique to reduce the model’s dependence on a single decoding approach. The penalty function limits the weights of decoding strategies to prevent over-reliance on any single decoder, thus enhancing the model’s generalization. Finally, we validate our model on the Librispeech dataset, a large-scale English speech corpus containing approximately 1000 h of audio data. Experimental results demonstrate that the proposed method achieves word error rates (WERs) of 3.92% and 4.07% on the development and test sets, respectively, significantly improving over single-model and traditional decoding methods. Notably, the method reduces WER by approximately 0.4% on complex datasets compared to several advanced mainstream models, underscoring its superior robustness and adaptability in challenging acoustic environments. The effectiveness of the proposed method in addressing overfitting and improving accuracy and efficiency during the decoding phase was validated, highlighting its significance in advancing speech recognition technology. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

17 pages, 896 KiB  
Article
Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2
by Qing Zhou, Xiaona Xu and Yue Zhao
Appl. Sci. 2024, 14(15), 6834; https://doi.org/10.3390/app14156834 - 5 Aug 2024
Viewed by 1667
Abstract
Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder [...] Read more.
Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder for Tibetan speech synthesis. Specifically, we introduce the mixture alignment FastSpeech2 method to correct errors caused by hard alignment in the original FastSpeech2 method. This new method employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby improving alignment accuracy between text and speech and enhancing the naturalness and intelligibility of the synthesized speech. Additionally, we integrate pitch and energy information into the model, further enhancing overall synthesis quality. Furthermore, Tibetan has relatively smaller text-to-audio datasets compared to widely studied languages. To address these limited resources, we employ a transfer learning approach to pre-train the model with data from resource-rich languages. Subsequently, this pre-trained mixture alignment FastSpeech2 model is fine-tuned for Tibetan speech synthesis. Experimental results demonstrate that the mixture alignment FastSpeech2 model produces higher-quality speech compared to the original FastSpeech2 model, particularly when pre-trained on an English dataset, resulting in further improvements in clarity and naturalness. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

14 pages, 3407 KiB  
Article
An Audio Copy-Move Forgery Localization Model by CNN-Based Spectral Analysis
by Wei Zhao, Yujin Zhang, Yongqi Wang and Shiwen Zhang
Appl. Sci. 2024, 14(11), 4882; https://doi.org/10.3390/app14114882 - 4 Jun 2024
Cited by 1 | Viewed by 1792
Abstract
In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, [...] Read more.
In audio copy-move forgery forensics, existing traditional methods typically first segment audio into voiced and silent segments, then compute the similarity between voiced segments to detect and locate forged segments. However, audio collected in noisy environments is difficult to segment and manually set, and heuristic similarity thresholds lack robustness. Existing deep learning methods extract features from audio and then use neural networks for binary classification, lacking the ability to locate forged segments. Therefore, for locating audio copy-move forgery segments, we have improved deep learning methods and proposed a robust localization model by CNN-based spectral analysis. In the localization model, the Feature Extraction Module extracts deep features from Mel-spectrograms, while the Correlation Detection Module automatically decides on the correlation between these deep features. Finally, the Mask Decoding Module visually locates the forged segments. Experimental results show that compared to existing methods, the localization model improves the detection accuracy of audio copy-move forgery by 3.0–6.8%and improves the average detection accuracy of forged audio with post-processing attacks such as noise, filtering, resampling, and MP3 compression by over 7.0%. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

18 pages, 3890 KiB  
Article
Pyramid Feature Attention Network for Speech Resampling Detection
by Xinyu Zhou, Yujin Zhang, Yongqi Wang, Jin Tian and Shaolun Xu
Appl. Sci. 2024, 14(11), 4803; https://doi.org/10.3390/app14114803 - 1 Jun 2024
Cited by 1 | Viewed by 975
Abstract
Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great [...] Read more.
Speech forgery and tampering, increasingly facilitated by advanced audio editing software, pose significant threats to the integrity and privacy of digital speech avatars. Speech resampling is a post-processing operation of various speech-tampering means, and the forensic detection of speech resampling is of great significance. For speech resampling detection, most of the previous works used traditional methods of feature extraction and classification to distinguish original speech from forged speech. In view of the powerful ability of deep learning to extract features, this paper converts the speech signal into a spectrogram with time-frequency characteristics, and uses the feature pyramid network (FPN) with the Squeeze and Excitation (SE) attention mechanism to learn speech resampling features. The proposed method combines the low-level location information and the high-level semantic information, which dramatically improves the detection performance of speech resampling. Experiments were carried out on a resampling corpus made on the basis of the TIMIT dataset. The results indicate that the proposed method significantly improved the detection accuracy of various resampled speech. For the tampered speech with a resampling factor of 0.9, the detection accuracy is increased by nearly 20%. In addition, the robustness test demonstrates that the proposed model has strong resistance to MP3 compression, and the overall performance is better than the existing methods. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

13 pages, 324 KiB  
Article
Branch-Transformer: A Parallel Branch Architecture to Capture Local and Global Features for Language Identification
by Zeen Li, Shuanghong Liu, Zhihua Fang and Liang He
Appl. Sci. 2024, 14(11), 4681; https://doi.org/10.3390/app14114681 - 29 May 2024
Viewed by 1419
Abstract
Currently, an increasing number of people are opting to use transformer models or conformer models for language identification, achieving outstanding results. Among them, transformer models based on self-attention can only capture global information, lacking finer local details. There are also approaches that employ [...] Read more.
Currently, an increasing number of people are opting to use transformer models or conformer models for language identification, achieving outstanding results. Among them, transformer models based on self-attention can only capture global information, lacking finer local details. There are also approaches that employ conformer models by concatenating convolutional neural networks and transformers to capture both local and global information. However, this static single-branch architecture is difficult to interpret and modify, and it incurs greater inference difficulty and computational costs compared to dual-branch models. Therefore, in this paper, we propose a novel model called Branch-transformer (B-transformer). In contrast to traditional transformers, it consists of parallel dual-branch structures. One branch utilizes self-attention to capture global information, while the other employs a Convolutional Gated Multi-Layer Perceptron (cgMLP) module to extract local information. We also investigate various fusion methods for integrating global and local information and experimentally validate the effectiveness of our approach on the NIST LRE 2017 dataset. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

14 pages, 1802 KiB  
Article
Wav2wav: Wave-to-Wave Voice Conversion
by Changhyeon Jeong, Hyung-pil Chang, In-Chul Yoo and Dongsuk Yook
Appl. Sci. 2024, 14(10), 4251; https://doi.org/10.3390/app14104251 - 17 May 2024
Cited by 2 | Viewed by 2251
Abstract
Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms [...] Read more.
Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms to be useful in these various applications. Deep learning-based voice conversion algorithms, which have been showing promising results recently, generally consist of three modules: a feature extractor, feature converter, and vocoder. The feature extractor accepts the waveform as the input and extracts speech feature vectors for further processing. These speech feature vectors are later synthesized back into waveforms by the vocoder. The feature converter module performs the actual voice conversion; therefore, many previous studies separately focused on improving this module. These works combined the separately trained vocoder to synthesize the final waveform. Since the feature converter and the vocoder are trained independently, the output of the converter may not be compatible with the input of the vocoder, which causes performance degradation. Furthermore, most voice conversion algorithms utilize mel-spectrogram-based speech feature vectors without modification. These feature vectors have performed well in a variety of speech-processing areas but could be further optimized for voice conversion tasks. To address these problems, we propose a novel wave-to-wave (wav2wav) voice conversion method that integrates the feature extractor, the feature converter, and the vocoder into a single module and trains the system in an end-to-end manner. We evaluated the efficiency of the proposed method using the VCC2018 dataset. Full article
(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)
Show Figures

Figure 1

Back to TopTop