AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots
Abstract
1. Introduction
- We design a novel post-processing framework based on conditional GANs, integrating a Speech Feature Enhancement Module (SEM) with global normalization (GN) to stabilize adversarial training. This architecture significantly improves Mel spectrogram prediction accuracy and adaptability to diverse speech characteristics.
- We propose a global normalization strategy to replace traditional batch normalization, eliminating noise accumulation in speech feature parameters (SFPs) and ensuring consistent feature distributions during training. This enhances stability and generalizability across home scenarios.
- We introduce the Sigmoid-Weighted Mean Absolute Error (SWMAE) loss, which dynamically weights spectral reconstruction errors to balance outlier robustness and fine-grained Mel spectrogram restoration, enabling precise speech feature-to-spectrogram mapping.
- We validate AirSpeech on LJSpeech and AISHELL3 datasets, demonstrating state-of-the-art performance on the AISHELL3 dataset and competitive results on the LJSpeech dataset in terms of objective metrics (e.g., 0.558 SSIM, 8.76 MCD) and subjective MOSs (4.27), demonstrating strong potential for providing high-quality interactions in diverse smart home applications.
2. Related Work
3. Method
3.1. Speech Feature Enhancement Module (SEM)
3.1.1. The Structure and Principle of SEM
3.1.2. A More Stable Adversarial Training Process
3.2. Feature Transformation Component (FTC)
3.3. Training Loss Terms
4. Experiment
4.1. Datasets and Speech Feature Selection
4.2. Hyper-Parameter Analysis
4.3. Comparison of TTS Performance
4.4. Ablation Study
5. Conclusions and Outlook
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
- Łan’cucki, A. Fastpitch: Parallel text-to-speech with pitch prediction. In Proceedings of the ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6588–6592. [Google Scholar]
- Chen, M.; Tan, X.; Li, B.; Liu, Y.; Qin, T.; Zhao, S.; Liu, T.-Y. Adaspeech: Adaptive text to speech for custom voice. arXiv 2021, arXiv:2103.00993. [Google Scholar] [CrossRef]
- Liu, S.; Su, D.; Yu, D. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. arXiv 2022, arXiv:2201.11972. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: New York, NY, USA, 2018; pp. 4779–4783. [Google Scholar]
- Lim, D.; Jang, W.; Park, H.; Kim, B.; Yoon, J. JDI-T: Jointly trained duration informed transformer for text-to-speech without explicit alignment. arXiv 2020, arXiv:2005.07799. [Google Scholar]
- Huang, W.-C.; Hayashi, T.; Wu, Y.-C.; Kameoka, H.; Toda, T. Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining. arXiv 2019, arXiv:1912.06813. [Google Scholar]
- Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of the International Conference on Learning Representations, ICLR, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1094–1099. [Google Scholar]
- Wang, L.; Yu, Z.; Gao, S.; Mao, C.; Huang, Y. Dets: End-to-end singlestage text-to-speech via hierarchical diffusion gan models. In Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 10916–10920. [Google Scholar]
- Song, K.; Zhang, Y.; Lei, Y.; Cong, J.; Li, H.; Xie, L.; He, G.; Bai, J. DSPGAN: A Gan-based universal vocoder for high-fidelity tts by time-frequency domain supervision from dsp. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Du, C.; Yu, K. Phone-level prosody modelling with gmm-based mdn for diverse and controllable speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 190–201. [Google Scholar] [CrossRef]
- Xin, D.; Adavanne, S.; Ang, F.; Kulkarni, A.; Takamichi, S.; Saruwatari, H. Improving speech prosody of audiobook text-to-speech synthesis with acoustic and textual contexts. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Li, J.; Meng, Y.; Li, C.; Wu, Z.; Meng, H.; Weng, C.; Su, D. Enhancing speaking styles in conversational text-to-speech synthesis with graph-based multi-modal context modeling. In Proceedings of the ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 7917–7921. [Google Scholar]
- Zhang, Y.-J.; Song, W.; Yue, Y.; Zhang, Z.; Wu, Y.; He, X. Masked-speech: Context-aware speech synthesis with masking strategy. arXiv 2022, arXiv:2211.06170. [Google Scholar]
- Valle, R.; Santos, J.F.; Shih, K.J.; Badlani, R.; Catanzaro, B. High-acoustic fidelity text to speech synthesis with fine-grained control of speech attributes. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Peng, K.; Ping, W.; Song, Z.; Zhao, K. Non-autoregressive neural text-to-speech. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 7586–7598. [Google Scholar]
- Kim, J.; Kim, S.; Kong, J.; Yoon, S. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 2020, 33, 8067–8077. [Google Scholar]
- Lu, H.; Wu, Z.; Wu, X.; Li, X.; Kang, S.; Liu, X.; Meng, H. Vaenar-tts: Variational auto-encoder based non-autoregressive text-to-speech synthesis. arXiv 2021, arXiv:2107.03298. [Google Scholar]
- Guan, W.; Li, T.; Li, Y.; Huang, H.; Hong, Q.; Li, L. Interpretable style transfer for text-to-speech with controlVAE and diffusion bridge. arXiv 2023, arXiv:2306.04301. [Google Scholar]
- Yang, D.; Yu, J.; Wang, H.; Wang, W.; Weng, C.; Zou, Y.; Yu, D. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Trans-Actions Audio Speech Lang. Process. 2023, 31, 1720–1733. [Google Scholar] [CrossRef]
- Choi, W.-G.; Kim, S.-J.; Kim, T.; Chang, J.-H. Prior-free guided tts: An improved and efficient diffusion-based text-guided speech synthesis. Proc. Interspeech 2023, 2023, 4289–4293. [Google Scholar]
- Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 5530–5540. [Google Scholar]
- Guo, H.; Lu, H.; Wu, X.; Meng, H. A multi-scale time-frequency spectrogram discriminator for gan-based non-autoregressive tts. arXiv 2022, arXiv:2203.01080. [Google Scholar]
- Yang, J.; Bae, J.-S.; Bak, T.; Kim, Y.; Cho, H.-Y. Ganspeech: Adversarial training for high-fidelity multi-speaker speech synthesis. arXiv 2021, arXiv:2106.15153. [Google Scholar]
- Deng, Y.; Zhou, L.; Yi, Y.; Liu, S.; He, L. Prosody-aware speecht5 for expressive neural tts. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. FastSpeech: Fast, Robust and Controllable Text to Speech. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 3165–3174. [Google Scholar]
- Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
- Prenger, R.; Valle, R.; Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019; pp. 3617–3621. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. Hifi-Gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
- Casanova, E.; Weber, J.; Shulby, C.D.; Junior, A.C.; Go, E.; Ponti, M.A. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 2709–2720. [Google Scholar]
- Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; Zhou, J. Qwen-audio: Advancing universal audio understanding via unified largescale audio-language models. arXiv 2023, arXiv:2311.07919. [Google Scholar]
- Min, D.; Lee, D.B.; Yang, E.; Hwang, S.J. Meta-style-speech: Multispeaker adaptive text-to-speech generation. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 7748–7759. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Yang, Z.; Wu, Z.; Jia, J. Speaker characteristics guided speech synthesis. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]
- Pamisetty, G.; Easow, R.A.; Gupta, K.; Murty, K.S.R. Stream-tts: A lowlatency text-to-speech using kolmogorov-arnold networks for streaming speech applications. In Proceedings of the ICASSP 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–2. [Google Scholar]
- Ellinas, N.; Vamvoukakis, G.; Markopoulos, K.; Chalamandaris, A.; Maniati, G.; Kakoulidis, P.; Raptis, S.; Sung, J.S.; Park, H.; Tsiakoulis, P. High quality streaming speech synthesis with low, sentence-length-independent latency. arXiv 2021, arXiv:2111.09052. [Google Scholar]
- Mehta, S.; Tu, R.; Beskow, J.; Sze’kely, E.; Henter, G.E. Matcha-tts: A fast tts architecture with conditional flow matching. In Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 11341–11345. [Google Scholar]
- Anastassiou, P.; Chen, J.; Chen, J.; Chen, Y.; Chen, Z.; Chen, Z.; Cong, J.; Deng, L.; Ding, C.; Gao, L.; et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv 2024, arXiv:2406.02430. [Google Scholar]
- Kumar, K.; Kumar, R.; De Boissiere, T.; Gestin, L.; Teoh, W.Z.; De Brebisson, A.; Bengio, Y.; Courville, A.C. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 14881–14892. [Google Scholar]
- Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ito, K. The Lj Speech Dataset. 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 22 May 2025).
- Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
- Tan, X.; Chen, J.; Liu, H.; Cong, J.; Zhang, C.; Liu, Y.; Wang, X.; Leng, Y.; Yi, Y.; He, L.; et al. Natural speech: End-to-end text-to-speech synthesis with human-level quality. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4234–4245. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Niu, Z.; Ma, Z.; Deng, K.; Wang, C.; Zhao, J.; Yu, K.; Chen, X. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 6255–6271. [Google Scholar]




| Model | Dataset | SSIM (↑) | MCD (↓) | F0 RMSE (↓) | STOI (↑) | PESQ (↑) | MOS (↑) |
|---|---|---|---|---|---|---|---|
| FastSpeech2 | LJSpeech | 0.502 | 10.76 | 46.62 | 0.792 | 1.302 | 3.93 ± 0.08 |
| AISHELL3 | 0.465 | 15.84 | 60.68 | 0.705 | 1.343 | 3.82 ± 0.07 | |
| FastPitch | LJSpeech | 0.511 | 10.65 | 47.01 | 0.779 | 1.307 | 3.95 ± 0.07 |
| AISHELL3 | 0.461 | 15.47 | 61.62 | 0.711 | 1.349 | 3.79 ± 0.08 | |
| AdaSPeech | LJSpeech | 0.541 | 9.91 | 47.61 | 0.781 | 1.315 | 4.04 ± 0.07 |
| AISHELL3 | 0.484 | 14.99 | 60.33 | 0.707 | 1.352 | 3.94 ± 0.06 | |
| DiffGAN-TTS | LJSpeech | 0.533 | 9.74 | 45.83 | 0.807 | 1.311 | 4.01 ± 0.07 |
| AISHELL3 | 0.479 | 14.82 | 60.29 | 0.725 | 1.355 | 3.87 ± 0.06 | |
| NaturalSpeech | LJSpeech | 0.565 | 8.50 | 42.50 | 0.895 | 1.350 | 4.56 ± 0.13 |
| AISHELL3 | 0.495 | 14.05 | 58.90 | 0.765 | 1.375 | 4.09 ± 0.07 | |
| F5-TTS | LJSpeech | 0.545 | 9.20 | 45.10 | 0.860 | 1.325 | 4.18 ± 006 |
| AISHELL3 | 0.488 | 14.30 | 59.50 | 0.750 | 1.360 | 4.02 ± 008 | |
| AirSpeech | LJSpeech | 0.558 | 8.76 | 43.96 | 0.885 | 1.336 | 4.27 ± 0.06 |
| AISHELL3 | 0.501 | 13.74 | 58.17 | 0.781 | 1.388 | 4.13 ± 0.07 |
| Model | Params | RTF |
|---|---|---|
| FastSpeech2 | 25.54 M | 0.0058 |
| GANSpeech | 25.54 M | 0.0058 |
| DiffSpeech | 44.43 M | 0.2224 |
| DiffGAN-TTS | 32.81 M | 0.0105 |
| NaturalSpeech | 28.7 M | 0.013 |
| F5-TTS | 335.8 M | 0.15 |
| AirSpeech | 29.32 M | 0.0086 |
| Model | LJSpeech | AISHELL3 |
|---|---|---|
| SSIM/CMOS | SSIM/CMOS | |
| AirSpeech | 0.558/0 | 0.501/0 |
| with MAE | 0.551/−0.07 | 0.493/−0.08 |
| with BN | 0.545/−0.13 | 0.484/−0.17 |
| w/o SEM | 0.547/−0.11 | 0.489/−0.12 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Qin, X.; Pan, F.; Gao, J.; Huang, S.; Sun, Y.; Zhong, X. AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots. Electronics 2026, 15, 239. https://doi.org/10.3390/electronics15010239
Qin X, Pan F, Gao J, Huang S, Sun Y, Zhong X. AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots. Electronics. 2026; 15(1):239. https://doi.org/10.3390/electronics15010239
Chicago/Turabian StyleQin, Xiugong, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun, and Xiao Zhong. 2026. "AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots" Electronics 15, no. 1: 239. https://doi.org/10.3390/electronics15010239
APA StyleQin, X., Pan, F., Gao, J., Huang, S., Sun, Y., & Zhong, X. (2026). AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots. Electronics, 15(1), 239. https://doi.org/10.3390/electronics15010239

