Previous Article in Journal
Hybrid CNN-NLP Model for Detecting LSB Steganography in Digital Images
Previous Article in Special Issue
Real-Time Large-Scale Intrusion Detection and Prevention System (IDPS) CICIoT Dataset Traffic Assessment Based on Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios

1
Department of Automation Engineering, National Formosa University, Huwei 632301, Taiwan
2
Smart Machinery and Intelligent Manufacturing Research Center, National Formosa University, Huwei 632301, Taiwan
3
The Doctoral Degree Program in Smart Industry Technology Research and Development, National Formosa University, Huwei 632301, Taiwan
4
Department of Electronics Engineering, National Yunlin University of Science and Technology, Douliu 640301, Taiwan
5
Master’s Program, Department of Electrical Engineering, National Formosa University, Huwei 632301, Taiwan
6
Department of Electrical Engineering, National Formosa University, Huwei 632301, Taiwan
*
Author to whom correspondence should be addressed.
Appl. Syst. Innov. 2025, 8(4), 108; https://doi.org/10.3390/asi8040108
Submission received: 21 June 2025 / Revised: 18 July 2025 / Accepted: 30 July 2025 / Published: 31 July 2025
(This article belongs to the Special Issue Advancements in Deep Learning and Its Applications)

Abstract

As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation method based on XText-to-Speech (XTTS) synthesis to tackle the challenges of small-sample, multi-class speech recognition, using profanity as a case study to achieve high-accuracy keyword recognition. Two models were therefore evaluated: a CNN model (Proposed-I) and a CNN-Transformer hybrid model (Proposed-II). Proposed-I leverages local feature extraction, improving accuracy on a real human speech (RHS) test set from 55.35% without augmentation to 80.36% with XTTS-enhanced data. Proposed-II integrates CNN’s local feature extraction with Transformer’s long-range dependency modeling, further boosting test set accuracy to 88.90% while reducing the parameter count by approximately 41%, significantly enhancing computational efficiency. Compared to a previously proposed incremental architecture, the Proposed-II model achieves an 8.49% higher accuracy while reducing parameters by about 98.81% and MACs by about 98.97%, demonstrating exceptional resource efficiency. By utilizing XTTS and public corpora to generate a novel keyword speech dataset, this study enhances sample diversity and reduces reliance on large-scale original speech data. Experimental analysis reveals that an optimal synthetic-to-real speech ratio of 1:5 significantly improves the overall system accuracy, effectively addressing data scarcity. Additionally, the Proposed-I and Proposed-II models achieve accuracies of 97.54% and 98.66%, respectively, in distinguishing real from synthetic speech, demonstrating their strong potential for speech security and anti-spoofing applications.
Keywords: small-sample learning; data augmentation; insult speech recognition; speech generation; deep learning small-sample learning; data augmentation; insult speech recognition; speech generation; deep learning

Share and Cite

MDPI and ACS Style

Lai, S.-C.; Zhu, Y.-C.; Wang, S.-T.; Chang, Y.-C.; Hung, Y.-H.; Tang, J.-K.; Tsai, W.-K. XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios. Appl. Syst. Innov. 2025, 8, 108. https://doi.org/10.3390/asi8040108

AMA Style

Lai S-C, Zhu Y-C, Wang S-T, Chang Y-C, Hung Y-H, Tang J-K, Tsai W-K. XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios. Applied System Innovation. 2025; 8(4):108. https://doi.org/10.3390/asi8040108

Chicago/Turabian Style

Lai, Shin-Chi, Yi-Chang Zhu, Szu-Ting Wang, Yen-Ching Chang, Ying-Hsiu Hung, Jhen-Kai Tang, and Wen-Kai Tsai. 2025. "XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios" Applied System Innovation 8, no. 4: 108. https://doi.org/10.3390/asi8040108

APA Style

Lai, S.-C., Zhu, Y.-C., Wang, S.-T., Chang, Y.-C., Hung, Y.-H., Tang, J.-K., & Tsai, W.-K. (2025). XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios. Applied System Innovation, 8(4), 108. https://doi.org/10.3390/asi8040108

Article Metrics

Back to TopTop