XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios
Abstract
1. Introduction
- We propose a generative corpus method based on XTTS [6], combining preprocessing with minimal raw data. This approach reduces reliance on large raw corpora and leverages diverse voices from public datasets, enhancing sample variety and significantly boosting keyword recognition performance;
- Through experimental data analysis, we investigate the impact of training, validation, and test set ratios on model performance, further optimizing the proportion of synthetic versus real speech to determine the best configuration;
- The proposed novel data augmentation technique expands the dataset multiple fold, improving system accuracy by 33.55% and demonstrating its effectiveness in addressing data scarcity through speech synthesis;
- By integrating real and synthetic corpora, we not only alleviate data shortages but also enable discrimination between synthetic and real speech within the same model architecture, achieving high-accuracy true–false speech identification.
2. Proposed Methods
2.1. Data Augmentation
2.2. Preprocessing
2.3. Proposed Classification Models
2.3.1. CNN-Based Algorithm (Proposed-I)
2.3.2. CNN-Transformer-Based Algorithm (Proposed-II)
2.4. Sliding Window and Confidence Score Statistics
2.5. Ethical Considerations
3. Experimental Results and Comparisons of Various Algorithms
3.1. Real Human Speech Dataset (RHS Dataset)
3.2. Experimental Setup
3.3. Performance Evaluation on RHS Dataset for the Proposed-I Algorithm
3.4. Performance Evaluation with Mixed Speech for the Proposed-I Algorithm
3.5. Performance Evaluation with RHS Dataset for the Proposed-I Algorithm
3.6. Evaluation Metrics
3.7. Sliding Window and Confidence Score Statistics Integration
3.8. Cross-Dataset Performance Evaluation for the Proposed-I Algorithm
3.9. Performance Evaluation on RHS Dataset for the Proposed-II Algorithm
3.10. Real vs. Synthetic Speech Identification
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AP | Average Precision |
AUROC | Area Under the Receiver Operating Characteristic Curve |
CNN | Convolutional Neural Network |
dB | Decibels |
DCGAN | Deep Convolutional Generative Adversarial Network |
DCT | Discrete Cosine Transform |
FFT | Fast Fourier Transform |
FN | False Negative |
FP | False Positive |
FPR | False Positive Rate |
HMM | Hidden Markov Model |
IQR | Interquartile Range |
LAS | Listen, Attend, and Spell |
MACs | Multiply–Accumulate Operations |
mAP | mean Average Precision |
MFCC | Mel-Frequency Cepstrum Coefficients |
RHS | Real Human Speech |
RMS | Root Mean Square |
RNN | Recurrent Neural Network |
ROC | Receiver Operating Characteristic |
TN | True Negative |
TP | True Positive |
TPR | True Positive Rate |
VAD | Voice Activity Detection |
XTTS | XText-to-Speech |
References
- Bonet, D.; Cámbara, G.; López, F.; Gómez, P.; Segura, C.; Luque, J. Speech Enhancement for Wake-Up-Word Detection in Voice Assistants. arXiv 2021. [Google Scholar] [CrossRef]
- Supriya, N.; Surya, S.; Kiran, K.N. Voice Controlled Smart Home for Disabled. In Proceedings of the 2024 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), Bengaluru, India, 24–25 January 2024; pp. 1–4. [Google Scholar] [CrossRef]
- Alatawi, H.S.; Alhothali, A.M.; Moria, K.M. Detecting White Supremacist Hate Speech Using Domain Specific Word Embedding with Deep Learning and BERT. IEEE Access 2021, 9, 106363–106374. [Google Scholar] [CrossRef]
- Zhu, Y.C.; Hung, Y.H.; Chang, Y.C.; Tang, J.K.; Tsai, W.K.; Lai, S.C. Speech Abusive Language Detection System Using MFCC Speech Feature Extraction and Convolutional Neural Network. In Proceedings of the IEEE International Conference on Consumer Technology–Pacific 2025 (ICCT-Pacific 2025), Matsue, Japan, 29–31 March 2025. [Google Scholar] [CrossRef]
- Qin, Z.; Zhao, W.; Yu, X.; Sun, X. OpenVoice: Versatile Instant Voice Cloning. arXiv 2023, arXiv:2312.01479. [Google Scholar] [CrossRef]
- Casanova, E.; Davis, K.; Gölge, E.; Göknar, G.; Gulea, I.; Hart, L.; Aljafari, A.; Meyer, J.; Morais, R.; Olayemi, S.; et al. XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model 2024. arXiv 2024, arXiv:2406.04904. [Google Scholar]
- Liao, S.; Wang, Y.; Li, T.; Cheng, Y.; Zhang, R.; Zhou, R.; Xing, Y. Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis 2024. arXiv 2024, arXiv:2411.01156. [Google Scholar]
- Wenger, E.; Bronckers, M.; Cianfarani, C.; Cryan, J.; Sha, A.; Zheng, H.; Zhao, B.Y. “Hello, It’s Me”: Deep Learning-Based Speech Synthesis Attacks in the Real World. In Proceedings of the ACM Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 235–251. [Google Scholar] [CrossRef]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR Corpus Based on Public Domain Audio Books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar] [CrossRef]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An Open-Source Mandarin Speech Corpus and a Speech Recognition Baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2017, Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar] [CrossRef]
- Robert, J. Pydub. Available online: https://github.com/jiaaro/pydub (accessed on 14 January 2025).
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.; Mcvicar, M.; Battenberg, E.; Nieto, O. Librosa: Audio and Music Signal Analysis in Python. In Proceedings of the Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–24. [Google Scholar] [CrossRef]
- Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. AISHELL-3: A Multi-Speaker Mandarin TTS Corpus and the Baselines 2021. arXiv 2020, arXiv:2010.11567. [Google Scholar]
- Sharma, G.; Umapathy, K.; Krishnan, S. Trends in Audio Signal Feature Extraction Methods. Appl. Acoust. 2020, 158, 107020. [Google Scholar] [CrossRef]
- EBU. Tech 3341 Loudness Metering: “Ebu Mode” Metering to Supplement EBU R 128 Loudness Normalization; EBU: Geneva, Switzerland, 2023. [Google Scholar]
- Wiseman, D. Py-Webrtcvad. Available online: https://github.com/wiseman/py-webrtcvad (accessed on 15 January 2025).
- Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv 2018, arXiv:1811.12808. [Google Scholar] [CrossRef]
- Kim, B.; Lee, M.; Lee, J.; Kim, Y.; Hwang, K. Query-by-Example on-Device Keyword Spotting. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019—Proceedings, Singapore, 14–18 December 2019; pp. 532–538. [Google Scholar] [CrossRef]
- ASR-SCKwsptSC: A Scripted Chinese Keyword-Spotting Speech Corpus—MagicHub. Available online: https://magichub.com/datasets/mandarin-chinese-scripted-speech-corpus-keyword-spotting-2/ (accessed on 15 June 2025).
- Ghandoura, A.; Hjabo, F.; Al Dakkak, O. Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting. Eng. Appl. Artif. Intell. 2021, 102, 104267. [Google Scholar] [CrossRef]
- Galić, J.; Marković, B.; Grozdić, Đ.; Popović, B.; Šajić, S. Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering. Appl. Sci. 2024, 14, 8223. [Google Scholar] [CrossRef]
- Bahmei, B.; Birmingham, E.; Arzanpour, S. CNN-RNN and Data Augmentation Using Deep Convolutional Generative Adversarial Network for Environmental Sound Classification. IEEE Signal Process. Lett. 2022, 29, 682–686. [Google Scholar] [CrossRef]
- Seo, D.; Oh, H.S.; Jung, Y. Wav2KWS: Transfer Learning from Speech Representations for Keyword Spotting. IEEE Access 2021, 9, 80682–80691. [Google Scholar] [CrossRef]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar] [CrossRef]
- Rezaul, K.M.; Jewel, M.; Islam, M.S.; Siddiquee, K.N.e.A.; Barua, N.; Rahman, M.A.; Shan-A-Khuda, M.; Sulaiman, R.B.; Shaikh, M.S.I.; Hamim, M.A.; et al. Enhancing Audio Classification Through MFCC Feature Extraction and Data Augmentation with CNN and RNN Models. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 37–53. [Google Scholar] [CrossRef]
Input Size: 80 Frames × 39 Coefficients | |
Hidden Layer | |
Layer 1 | Conv2D: filter: 32, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 2 | Conv2D: filter: 32, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 3 | Conv2D: filter: 64, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 4 | Conv2D: filter: 32, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 5 | Conv2D: filter: 16, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 6 | Conv2D: filter: 3, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 7 | Flatten |
Layer 8 | Fully connected layer (3) |
Output: 16 Class, Softmax |
Input Size: 80 Frames × 39 Coefficients | |
Hidden Layer | |
Layer 1 | Conv2D: filter: 24, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 2 | Conv2D: filter: 48, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 3 | MaxPool2D: 2 × 2 |
Layer 4 | Conv2D: filter: 24, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 5 | Conv2D: filter: 4, kernel: 3 × 3, strides: 1 × 1, ReLU |
Layer 6 | Permute |
Layer 7 | Reshape |
Layer 8 | Proj |
Layer 9 | Transformer: layer: 3, d_model: 64, Nhead: 8 |
Layer 10 | Reshape |
Layer 11 | Dropout: p = 0.1 |
Layer 12 | Fully connected layer (3) |
Output: 16 Class, Softmax |
Hokkien Profanity | Pinyin | Target | Maximum Duration | Meaning |
---|---|---|---|---|
幹你老北 | gan4 ni3 lao3 bei3 | dad1 | <1.28 s | F**k your father |
塞你老北 | sai1 ni3 lao3 bei3 | dad2 | <1.44 s | |
去他爸的 | qu4 ta1 ba4 de | dad3 | <0.96 s | |
操你爸 | cao4 ni3 ba4 | dad4 | <1.15 s | |
幹你娘 | gan4 ni3 niang2 | mom1 | <1.47 s | F**k your mother |
塞你老母 | sai1 ni3 lao3 mu3 | mom2 | <1.37 s | |
去他媽的 | qu4 ta1 ma1 de | mom3 | <1.25 s | |
操你媽 | cao4 ni3 ma1 | mom4 | <1.30 s | |
幹你阿嬤 | gan4 ni3 a1 ma4 | gm1 | <1.44 s | F**k your grandmother |
塞你阿嬤 | sai1 ni3 a1 ma4 | gm2 | <1.22 s | |
去他奶奶的 | qu4 ta1 nai3 nai3 de | gm3 | <1.15 s | |
操你阿嬤 | cao4 ni3 a1 ma4 | gm4 | <1.18 s | |
幹你阿公 | gan4 ni3 a1 gong1 | gp1 | <1.41 s | F**k your grandfather |
塞你阿公 | sai1 ni3 a1 gong1 | gp2 | <1.24 s | |
去他爺爺的 | qu4 ta1 ye2 ye2 de | gp3 | <1.25 s | |
操你阿公 | cao4 ni3 a1 gong1 | gp4 | <1.13 s |
Hyperparameters | Value |
---|---|
Batch Size | 8 |
Epochs | 150 |
Learning Rate Scheduler | CosineAnnealingLR |
Initial Learning Rate (LR) | |
Loss Function | CrossEntropyLoss |
Dataset Ratio | Loss | Accuracy | Improvement |
---|---|---|---|
5:3:2 | 2.319 | 55.35% | - |
6:2:2 | 2.288 | 58.48% | 3.13% |
7:1:2 | 2.258 | 61.33% | 5.98% |
RHS Dataset Ratio | RHS vs. Synthetic Ratio | Loss | Accuracy | Improvement |
---|---|---|---|---|
5:3:2 | 1:1 | 2.292 | 58.18% | 2.83% |
1:2 | 2.262 | 61.01% | 5.66% | |
1:3 | 2.271 | 60.27% | 4.92% | |
1:4 | 2.252 | 62.50% | 7.15% | |
7:1:2 | 1:1 | 2.251 | 62.50% | 7.15% |
1:2 | 2.208 | 66.37% | 11.02% | |
1:3 | 2.218 | 64.88% | 9.53% | |
1:4 | 2.211 | 65.92% | 10.57% | |
1:5 | 2.205 | 66.97% | 11.62% | |
1:6 | 2.245 | 62.80% | 7.45% | |
1:7 | 2.253 | 62.05% | 6.70% | |
1:8 | 2.244 | 63.09% | 7.74% | |
1:9 | 2.194 | 67.86% | 12.51% | |
1:10 | 2.192 | 68.15% | 12.80% | |
1:11 | 2.219 | 65.17% | 9.82% |
TTS Method | RHS vs. Synthetic Ratio | ||||
---|---|---|---|---|---|
1:1 | 1:2 | 1:3 | 1:4 | 1:5 | |
OpenVoice [5] | 60.57% | 61.61% | 59.82% | 60.42% | 59.67% |
XTTS [6] | 62.50% | 66.37% | 64.88% | 65.92% | 66.97% |
Fish-Speech [7] | 57.74% | 55.80% | 57.29% | 57.89% | 56.55% |
Dataset | Classes | Accuracy (No Augmentation) | Accuracy (With Augmentation) |
---|---|---|---|
Qualcomm [19] | 4 | 80.95% | 97.54% |
ASR-SCKwsptSC [20] | 18 | 65.74% | 80.20% |
Arabic Speech Command [21] | 16 | 69.1% | 79.95% |
Google Speech Command [10] | 16 | 37.8% | 57.7% |
RHS [4] | 16 | 55.35% | 80.36% |
Year | Augmentation Method | Classification Model | Parameter | MACs | Accuracy | F1 Score | mAP |
---|---|---|---|---|---|---|---|
2019 [25] | SpecAugment | LAS | 3,640,657 | 88.97 M | 82.43% | 0.8232 | 0.8937 |
2021 [24] | TTS + Librosa | Wav2Vec 2.0 + CNN | 2,756,016 | 5.68 G | 87.50% | 0.8764 | 0.9431 |
2022 [23] | DCGAN | CNN + RNN | 20,408,016 | 15.47 G | 67.58% | 0.6646 | 0.7526 |
2024 [22] | Librosa | HMM | 6350 | 0.72 M | 58.55% | 0.5290 | 0.1101 |
2024 [26] | Librosa | CNN (Aug_model) | 9,964,016 | 4.36 G | 80.41% | 0.8043 | 0.8717 |
RNN (Aug_model) | 415,088 | 1.08 G | 61.42% | 0.5855 | 0.6732 | ||
CNN | 5,021,776 | 4.20 G | 74.96% | 0.7486 | 0.8131 | ||
RNN | 216,456 | 657.36 M | 58.90% | 0.5475 | 0.6482 | ||
2025 [4] | - | CNN | 117,360 | 11.17 M | 52.06% | 0.5206 | 0.5535 |
Proposed | - | CNN | 201,363 | 80.27 M | 55.35% | 0.5044 | 0.4630 |
Proposed-I | XTTS | CNN | 201,363 | 80.27 M | 80.36% | 0.8001 | 0.8619 |
Proposed-II | XTTS | CNN + Transformer | 118,516 | 44.81 M | 88.90% | 0.8887 | 0.8951 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lai, S.-C.; Zhu, Y.-C.; Wang, S.-T.; Chang, Y.-C.; Hung, Y.-H.; Tang, J.-K.; Tsai, W.-K. XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios. Appl. Syst. Innov. 2025, 8, 108. https://doi.org/10.3390/asi8040108
Lai S-C, Zhu Y-C, Wang S-T, Chang Y-C, Hung Y-H, Tang J-K, Tsai W-K. XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios. Applied System Innovation. 2025; 8(4):108. https://doi.org/10.3390/asi8040108
Chicago/Turabian StyleLai, Shin-Chi, Yi-Chang Zhu, Szu-Ting Wang, Yen-Ching Chang, Ying-Hsiu Hung, Jhen-Kai Tang, and Wen-Kai Tsai. 2025. "XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios" Applied System Innovation 8, no. 4: 108. https://doi.org/10.3390/asi8040108
APA StyleLai, S.-C., Zhu, Y.-C., Wang, S.-T., Chang, Y.-C., Hung, Y.-H., Tang, J.-K., & Tsai, W.-K. (2025). XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios. Applied System Innovation, 8(4), 108. https://doi.org/10.3390/asi8040108