AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots

Qin, Xiugong; Pan, Fenghu; Gao, Jing; Huang, Shilong; Sun, Yichen; Zhong, Xiao

doi:10.3390/electronics15010239

Open AccessArticle

AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots

by

Xiugong Qin

^1,*,

Fenghu Pan

²,

Jing Gao

¹,

Shilong Huang

¹

,

Yichen Sun

^1,*

and

Xiao Zhong

¹

Beijing Research Institute of Automation for Machinery Industry Co., Ltd., Beijing 100120, China

²

Yanqi Lake Institute of Basic Manufacturing Technology Research Co., Ltd., Beijing 100044, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(1), 239; https://doi.org/10.3390/electronics15010239

Submission received: 12 November 2025 / Revised: 17 December 2025 / Accepted: 30 December 2025 / Published: 5 January 2026

Download

Browse Figures

Versions Notes

Abstract

Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications.

Keywords:

text-to-speech; home service robot; acoustic model; Mel-spectrogram; generative adversarial networks

1. Introduction

In the speech interaction system of home intelligent service robots, the evolution of neural Text-to-Speech (TTS) technology directly impacts human–computer interaction experiences. The current mainstream two-stage TTS framework generates an intermediate acoustic representation (typically a Mel spectrogram) via an acoustic model (AM) and synthesizes waveform signals using a vocoder. This architecture faces unique challenges in home environments. To meet the real-time response requirements of service robots, non-autoregressive models (NAR-TTS) [1,2,3,4] have been shown to significantly enhance inference speed through parallel processing advantages, thereby avoiding the potential response delays associated with autoregressive models (AR-TTS) [5]. This improvement is particularly critical for smart home control commands that require rapid feedback. However, the simplification of temporal dependencies inherent in the NAR mechanism may result in a decrease in prosodic naturalness, which can manifest as insufficient emotional expression or a mechanical tone in complex interactive dialogs. To address these quality degradation issues, recent innovations in joint training paradigms [6] demonstrate potential for single-stage alignment learning without pretrained duration extractors, yet struggle to maintain precise phoneme alignment under multi-speaker scenarios. Although NAR-TTS methods address the inference speed [7] and stability [8] issues of AR-TTS approaches, the elimination of the autoregressive mechanism often leads to a reduction in the quality of the synthesized Mel spectrograms. Recent advancements in hierarchical diffusion GAN architectures [9] have demonstrated potential in improving one-to-many mapping capabilities through latent variable prediction, yet their application in multi-scenario home environments remains underexplored. Additionally, the distribution mismatch between the imperfectly predicted and the ground-truth Mel spectrograms emerges as a primary factor contributing to the degradation of TTS model performance. While novel vocoder designs employing time frequency domain supervision [10] show promise in bridging this gap through DSP-guided feature alignment, their integration with modern acoustic models requires further optimization. Moreover, challenges such as maintaining speech clarity and expressiveness across different speakers are commonly encountered in home environments.

Typically, two approaches are employed to mitigate this issue. The first involves bolstering conditions beyond textual or phonemic encoding, incorporating factors like prosody [11,12], semantics [13,14], and speech features [1,2]. This strategy enables the AM to more accurately predict Mel spectrograms, resulting in higher speech quality and naturalness [15]. Recent work on attention distillation mechanisms [16] provides enhanced alignment stability through layer-wise refinement, though it introduces computational overhead from parallel attention masking. Another approach integrates generative models into the AM, including Glow [17], VAE [18,19], and diffusion models [20,21], which have showcased outstanding performance across various domains. The emergence of adversarial duration predictors with hierarchical denoising mechanisms [9] has shown particular efficacy in enhancing prosody diversity, though their computational complexity poses deployment challenges. Studies employing forward attention mechanisms [6] reveal improved monotonic alignment through positional encoding priors, yet exhibit sensitivity to audio quality variations in practical applications. Recent innovations in universal vocoder design [10] address speaker/style generalization through hybrid DSP-neural architectures, yet struggle with preserving emotional expressiveness. However, despite enhancing synthesized Mel spectrograms, these methods are framework-specific and challenging to adapt to diverse Non-Autoregressive (AM) models, limiting deployment flexibility. Generative Adversarial Networks (GANs) exhibit promising potential in rectifying low-quality Mel spectrograms. For instance, training the vocoder [22] as an enhancement model that learns audio generation from smoothed Mel spectrograms has shown efficacy. Some studies have utilized adversarial training by integrating a discriminator into the AM to discern generated Mel spectrograms from ground truth ones, narrowing the disparity between them [23,24]. The development of joint convolutional discriminators with multi-resolution analysis [9] further refines this paradigm, though stability issues persist in cross-domain applications. However, this approach necessitates extensive hyperparameter tuning and experimentation to balance the generator and discriminator, particularly when handling the complex acoustic variations inherent in multi-role home environments.

Utilizing adversarial learning and dynamic feature alignment, we propose AirSpeech, a lightweight non-autoregressive acoustic model tailored for home intelligent service robots, achieving superior robustness in speech synthesis under domestic noise. Our contributions are as follows:

We design a novel post-processing framework based on conditional GANs, integrating a Speech Feature Enhancement Module (SEM) with global normalization (GN) to stabilize adversarial training. This architecture significantly improves Mel spectrogram prediction accuracy and adaptability to diverse speech characteristics.
We propose a global normalization strategy to replace traditional batch normalization, eliminating noise accumulation in speech feature parameters (SFPs) and ensuring consistent feature distributions during training. This enhances stability and generalizability across home scenarios.
We introduce the Sigmoid-Weighted Mean Absolute Error (SWMAE) loss, which dynamically weights spectral reconstruction errors to balance outlier robustness and fine-grained Mel spectrogram restoration, enabling precise speech feature-to-spectrogram mapping.
We validate AirSpeech on LJSpeech and AISHELL3 datasets, demonstrating state-of-the-art performance on the AISHELL3 dataset and competitive results on the LJSpeech dataset in terms of objective metrics (e.g., 0.558 SSIM, 8.76 MCD) and subjective MOSs (4.27), demonstrating strong potential for providing high-quality interactions in diverse smart home applications.

2. Related Work

Text-to-Speech (TTS) models based on deep learning have become mainstream by jointly optimizing text feature extraction and acoustic modeling [25,26], thereby significantly enhancing the naturalness and fluency of synthesized speech. For instance, FastSpeech 2 [1] introduces phoneme duration prediction and prosody modeling to generate speech that aligns more closely with the natural conversational rhythms encountered in home environments. Recent advancements in single-stage architectures, such as hierarchical diffusion GANs [9], further improve naturalness and prosody diversity by directly modeling latent variables through conditional adversarial learning, enabling one-to-many mappings for diverse pitch and rhythm variations. The emergence of jointly trained duration-informed transformers [6] demonstrates the feasibility of eliminating explicit alignment dependencies through auxiliary CTC losses and forward attention mechanisms, though its multi-speaker generalization requires further investigation. Subsequent innovations introduced conditional flow matching frameworks [17] to simplify training dynamics through linear optimal transport paths, achieving high-fidelity synthesis with minimal inference steps. Meanwhile, encoder–decoder architectures combining rotational positional embeddings and monotonic alignment constraints enhanced temporal coherence while reducing memory usage. In combination with vocoder technology [27,28,29], including robust universal vocoders guided by digital signal processing (DSP) principles [10], the fidelity of the speech is further improved, ensuring high-quality output even under low-resource conditions such as noisy domestic settings. These vocoders leverage time-frequency domain supervision from DSP-synthesized waveforms to mitigate spectral mismatches and artifacts.

In the domain of personalized and adaptive speech generation, techniques such as transfer learning [22,30] and meta-learning enable the rapid generation of voices that match the timbre and intonation preferences of individual household members, even with limited target user data or in a zero-shot learning scenario. Recent approaches employ self-distillation strategies to disentangle speaker characteristics through contrastive speech pairs without architectural modifications, enabling flexible voice conversion. Non-autoregressive paradigms like ParaNet [16] achieve parallel spectrogram prediction through iterative attention refinement and positional encoding priors, though they introduce computational overhead due to attention masking operations. Multimodal foundation models [31] further achieve in-context style adaptation through cross-attention mechanisms between speech prompts and text inputs. This supports multi-user personalized recognition and response. To meet the multilingual demands of smart homes, speech synthesis models based on multitask learning can facilitate functions [32] such as code-switching between Chinese and English or dialectal transitions, thereby accommodating the diverse interaction habits of different family members. Moreover, by integrating environmental sensor data (e.g., noise levels, time, and geographic location), the system can adaptively adjust parameters such as volume, speaking rate, and emotional expression, for example, automatically switching to a soft mode during nighttime.

In the realm of emotion- and intent-driven speech synthesis, the incorporation of mechanisms such as emotion embedding vectors [33] or conditional Generative Adversarial Networks (cGANs) [9,34] enables the model to modulate prosody and emotional intensity based on the affective content of the dialog (e.g., comforting, alerting, entertaining). Emerging methods integrate reinforcement learning with speech emotion recognition modules to optimize expressiveness through reward shaping. For instance, the system can generate speech with a rapid, high-pitched tone when broadcasting urgent alerts, and a softer, slower intonation when narrating a story. By integrating a semantic understanding module, the model achieves context-based style transfer, dynamically adjusting the formality or warmth of the broadcast according to the user’s query (e.g., weather updates, scheduling). Approaches like diffusion-based duration predictors [9] enhance prosody diversity by stochastically modeling phoneme durations, while adversarial training frameworks [10] improve robustness to unseen speakers and speaking styles. The development of VAE-based vocoders [16] offers an alternative pathway for parallel waveform generation without distillation requirements, though it faces challenges in matching the audio quality of teacher-student frameworks.

Furthermore, considering the resource constraints typical of home robotics, research has focused on model compression techniques—such as knowledge distillation and quantization-aware training—and streaming synthesis architectures [35,36] to reduce inference latency to within 200 ms while maintaining sound quality, thereby fulfilling real-time interaction requirements. Recent work on efficient diffusion-GAN hybrids [9] demonstrates high-fidelity synthesis with minimal denoising steps, balancing quality and computational efficiency. Non-autoregressive architectures [37] eliminate sequential dependency through parallel latent code prediction, while 1D U-Net decoders with factorized attention [38] optimize memory usage for edge deployment. Innovations in layer-wise attention distillation [16] enhance alignment stability for feed-forward transformers through multi-stage refinement, though require careful tuning of positional encoding rates to prevent spectral artifacts.

In this work, we integrate a dedicated encoder that encapsulates fine-grained feature encoding to extend the speech feature space beyond pitch and energy. By introducing multidimensional speech feature parameters as generation conditions, our approach significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments.

3. Method

The architecture of AirSpeech is depicted in Figure 1. AirSpeech is built upon the pre-trained acoustic model and incorporates a Feature Transformation Component (FTC) and a Speech Feature Enhancement Module (SEM). Specifically, the FTC includes a Fine-grained Feature Encoder (FFE), a Single-head Self-attention Block (SSB), and a Decoder with a structure that is the inverse of the encoder. SEM is a generative model based on condition GAN (cGAN) that enhances its inputs through adversarial training.

3.1. Speech Feature Enhancement Module (SEM)

3.1.1. The Structure and Principle of SEM

Given fixed speech feature parameters, there are multiple possible Mel spectrograms. The SFPS closer to the ground truth distribution allows the FTC to generate higher quality

{M e l}_{S}

. Thus, we design SEM for enhancing SFP_S.

As shown in Figure 1c, the generator is a fully convolutional neural network. The generator up-samples SFP_S through five transposed convolution layers and a standard convolution layer, and generates

{S F P}_{E S}

. We use LReLU as the activation function of the generator. The discriminator distinguishes the ground truth sequences from SFP_ES. Residual blocks are used to streamline the discriminator learning process. These residual blocks allow the discriminator to concentrate on subtle variations between the generated sample and the ground truth sample rather than having to learn the complete mapping. As in [29,39], we do not give noise to the generator as an additional input.

3.1.2. A More Stable Adversarial Training Process

The normalization of the discriminator affects the stability of the adversarial training process [40]. In speech synthesis, Batch Normalization (BN) is frequently utilized to solve the issue of the discriminator having difficulty in making judgments based on the global features of the samples. BN ensures that all features in the hidden layer outputs are normalized on the same scale by introducing learnable scaling and shifting factors. However, this dynamic normalization leads to noise accumulation in SFPS and

{M e l}_{S}

over training time in the discriminator, thereby impacting discrimination accuracy and training stability. SFPs possess definite physical meanings and absolute value ranges. BN forces the feature distribution of each mini batch to approximate a standard normal distribution, which disrupts the physical consistency of these acoustic features and introduces stochastic noise dependent on batch composition. Unlike Spectral Normalization (SN) or Weight Normalization (WN), which constrain model weights to satisfy Lipschitz continuity, our objective is to stabilize the data manifold itself. So, we propose a global normalization (GN) approach to address this issue:

b = a / \sqrt[ρ]{\frac{1}{N} \sum_{i = 1}^{N} {(|a_{i}|)}^{ρ}}

(1)

where a and b represent the original and normalized feature vector; a denotes the ith value of a; N denotes the total number of frequency points. ρ is a positive integer parameter. All parameters in Equation (1) are static, allowing the discriminator to observe consistent feature distributions at different training stages and facilitating the propagation of global information back to the generator. This contributes to the improvement of adversarial training stability.

3.2. Feature Transformation Component (FTC)

To catch acoustic feature information in the SFPES, we introduce a FFE. We add speaker embedding into SFP_ES to preserve speaker information in multi-speaker tasks. Specifically, the speaker embedding is a learnable vector retrieved from a lookup table conditioned on the speaker ID. It represents the global acoustic characteristics of the target speaker and is concatenated with the input features to control the timbre of the synthesized speech. The FFE takes SFP_ES as inputs and generates the speech feature parameter encoders hf. As shown in Figure 1b, FFE utilizes three residual block stacks for down-sampling [41], with each stack comprising data from SFP_ES. We configure one-dimensional convolutions in FFE. The layer-hopping connection in the residual block allows the input signal to propagate from a lower layer to a higher layer. We retain the layer-hopping connection not only to prevent the gradient explosion and vanishing associated with deepening the network but also to preserve the temporal information in SFP_ES and restore it during decoding. FFE utilizes group normalization, which is insensitive to batch sizes, to enhance adaptability to small batches.

Introducing

{M e l}_{R S}

in the

h_{f}

can reduce decoding complexity and improve the accuracy of decoder prediction results. To establish the alignment mapping between the

{M e l}_{R S}

and the

h_{f}

, we employ the Single-head Self-attention Block (SSB) to fuse their features. First, we project

{M e l}_{R S}

to the same dimension as

h_{f}

and align its length via linear interpolation. Then, the single-head self-attention mechanism is utilized to process the combined features. By calculating the attention weights across the sequence, the SSB effectively mixes the rough spectral context from

{M e l}_{R S}

with the fine-grained details of

h_{f}

while preserving the original lengths. The output of SSB is summed with

h_{f}

and fed into the decoder.

3.3. Training Loss Terms

The model is trained with a loss function combining SEM and FTC. We set LSEM, the least-squares loss, because it encourages generated samples to approach the decision boundary of the discriminator as closely as possible. We also calculate the L1 distance between the generated samples and the ground truth samples as the reconstruction loss.

The Mean Absolute Error (MAE) exhibits inferior performance when dealing with outliers. Thus, the LFTC is an adapted version of the MAE loss, which we refer to as the Sigmoid Weighted Mean Absolute Error (SWMAE):

L_{F T C} = \frac{1}{N} \sum_{i = 1}^{N} |(y_{t_{i}} - y_{p_{i}}) [σ (y_{t_{i}} - y_{p_{i}}) - 1]|

(2)

where

y_{t_{i}}

represents the real Mel_GT sequence,

y_{p_{i}}

represents the predicted Mel_S sequence, and N denotes the total number of frequency points. By introducing a weighting factor σ in MAE, scaling large differences between Mel_GT and Mel_S enhances the robustness to outliers. Simultaneously, this error-driven mechanism automatically amplifies the focus on fine-grained spectral textures that are typically prone to over-smoothing. This allows the model to effectively balance the reconstruction of high-energy formants and delicate spectral details without relying on rigid frequency-dependent constraints, resulting in outputs with higher intelligibility. Since the total loss is:

L = L_{S E M} + γ L_{F T C}

(3)

where γ is a hyperparameter of the proposed model that will be analyzed in Section 4.

4. Experiment

4.1. Datasets and Speech Feature Selection

We conducted experiments on two public speech corpora: the English LJSpeech [42] and the Mandarin AISHELL-3 [43]. The audios of both datasets have a sampling rate of 22,050 Hz. To extract the Mel spectrogram, we employed Short-Time Fourier Transform (STFT) with a filter window length of 1024, a hop size of 256, and an 80-channel Mel filter bank to calculate the Mel spectrogram. During the model training process, we used the Adam optimizer with

β_{1} = 0.9

,

β_{2} = 0.98

, and

λ = 0.01

. The initial learning rate was set to the standard value, and the batch size was set to 16. All model training and fine-tuning were performed on an NVIDIA RTX 3090 GPU. To ensure the effectiveness of the proposed AirSpeech model, we utilized FastSpeech2 as the pre-trained acoustic model and trained it in a fully end-to-end manner. In the initial phase of training, since the FTC could not yet stably generate meaningful training targets (Mels), the Speech Feature Enhancement Module (SEM) was not involved in the first 100 k steps of training. Joint training was conducted only after the overall performance of the model stabilized, completing a total of 350 k training steps. We used HiFi-GAN as the vocoder and FastSpeech2 as the pre-trained acoustic model, training each on its respective datasets.

For our experiments, we used the Surfboard library to extract a comprehensive set of speech features from audio, which included formant frequencies (F1, F2, F3, F4), fundamental frequency (F0), logarithmic energy (LE), zero-crossing rate (ZCR), intensity (INT), spectral centroid (SCE), spectral slope (SSL), and spectral spread (SSP). To quantify the contribution of each feature, we conducted a sensitivity analysis based on Permutation Feature Importance. The sensitivity score

S_{i}

for a given feature

i

is calculated as follows:

S_{i} = \frac{L_{p e r m}^{(i)}}{L_{o r i g}}

(4)

where

L_{o r i g}

represents the model’s baseline loss on the validation set, and

L_{p e r m}^{(i)}

denotes the loss computed after randomly shuffling the values of the feature

i

across the dataset. Through a process of sensitivity analysis, we assessed the impact of each feature on AirSpeech’s performance, removing F4 and LE due to their minimal influence, as indicated by low scores.

Subsequently, we performed a Spearman correlation analysis on the remaining features, focusing on those with low intercorrelation to maintain the model’s generalizability. This refined selection process led us to retain F1, F2, F0, INT, SCE, and SSP as the core speech features. Notably, despite the high correlation between formant frequencies, we chose to keep F1 and F2 to preserve the model’s ability to perceive a full range of vowel spaces. The outcomes of these analyses are presented in Figure 2 and Figure 3.

4.2. Hyper-Parameter Analysis

We investigate the effect of the hyperparameter γ in Equation (3) on model performance over LJSpeech. We conduct weight analysis by altering the γ while maintaining the other variables constant. The Structure Similarity Index Measure (SSIM) and Mel Cepstral Distortion (MCD) are employed to evaluate the synthesized speeches from AirSpeech at different training steps.

Figure 4 illustrates the impact of the γ parameter on model convergence and Mel spectrogram clarity. A lower γ value results in slower initial convergence, as LFTC, weighted by σ < 1, is consistently minor compared to the MAE loss, hindering the FTC learning efficiency. Conversely, a higher γ accelerates convergence but can produce less distinct Mel spectrograms due to the model’s difficulty in learning from LS EM when LFTC is heavily weighted. Balancing these factors, a γ value of 5 is optimal. As such, we set γ as 5 in Section 4.3 and Section 4.4 studies to ensure consistency in the training speed of various modules in the model.

4.3. Comparison of TTS Performance

We evaluate our model and other comparison models both on subjective and objective evaluations. For objective evaluation, we assess the performance of AirSpeech and other models [1,2,3,4,44,45] using SSIM, MCD, F0 root-mean-squared error (F0 RMSE), short-term objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). For subjective evaluation, we utilize the mean opinion score (MOS) to evaluate voice naturalness. We generate 50 speeches from the English and Mandarin test sets using each model separately to compare their performance. A team composed of 30 native English speakers and 30 native Mandarin speakers is recruited to rate each audio sample and assign a score on a scale from 1 to 5. The average MOS is calculated as the final result.

Table 1 presents the results for LJSpeech and AISHELL3, highlighting the superior performance of our proposed model, particularly on the AISHELL3 dataset. AirSpeech achieves the best F0 RMSE, STOI, and PESQ, indicating that the FFE can effectively model speech features, leading to improved accuracy, intelligibility, and signal quality in synthesized speech. Our model performs best in the MCD evaluation and achieves a slightly higher score than AdaSpeech in terms of SSIM. Compared to recent advanced models like NaturalSpeech and F5-TTS, AirSpeech demonstrates competitive performance on the single-speaker LJSpeech dataset and superior robustness on the multi-speaker AISHELL3 dataset, achieving the highest SSIM and MOS among all baselines. AdaSpeech requires additional recordings from the same speaker as input during the inference process, while the proposed model does not. Thus, the proposed cGAN-based feature enhancement method not only provides a more accurate prediction of SFPS but also effectively reintegrates absent high-frequency information, addressing the concern of over-smoothing in MelS. Subjectively, our model garners the highest MOS, reflecting the naturalness boost from the SEM and FTC integration. Overall, the model’s objective and subjective scores are highly competitive compared to state-of-the-art models, underscoring its effectiveness.

To verify the efficiency of the proposed framework, we further compared the model complexity and inference speed. We measured the Real-Time Factor (RTF) on an NVIDIA Tesla V100 GPU. As shown in Table 2, although AirSpeech introduces an additional Speech Feature Enhancement Module, the increase in model parameters is limited to 3.78 M compared to FastSpeech2. Thanks to the non-autoregressive and fully convolutional structure of the SEM, AirSpeech achieves an RTF of 0.0086. This is significantly faster than generative models requiring iterative steps, confirming its suitability for real-time interaction in home service robots.

4.4. Ablation Study

We performed three ablation studies to affirm the effectiveness of various components in our proposed model. The comparative analysis, presented in Table 3, reveals the impact on MOS and SSIM when altering different elements within the AISHELL3 dataset. Switching from GN to BN in the SEM negatively impacts performance, suggesting GN’s superior role in stabilizing training and enhancing sample quality. Similarly, substituting SWMAE with the traditional MAE in the FTC shows SWMAE’s advantage in managing outliers and improving MelS clarity. Removing the SEM leads to a slight decrease in both SSIM and CMOS. This demonstrates that, even without excluding errors caused by the speech feature extractor, the FTC can extract acoustic feature information from SFPES and reconstruct the MelS with an acceptable loss.

5. Conclusions and Outlook

This study addresses the speech interaction requirements of home intelligent service robots by proposing an innovative non-autoregressive acoustic model architecture called AirSpeech. Leveraging adversarial learning and a Speech Feature Enhancement Module (SEM) with a dynamic feature alignment mechanism, the system demonstrates superior robustness in speech generation amidst typical domestic noise. The introduction of a global spectrogram normalization method significantly enhances the model’s adaptability to speech characteristics across different age groups. This method employs the SigmoidWeighted Mean Absolute Error (SWMAE) as the loss function to enable the conversion from specific speech features to Mel spectrograms within an acceptable loss margin. The optimization strategy based on SWMAE ensures a stable mapping from user-specific features to highly expressive speech spectrograms while maintaining real-time generation efficiency.

However, certain limitations inherent to the proposed framework must be acknowledged to provide a comprehensive evaluation. First, regarding environmental robustness, while AirSpeech effectively mitigates typical domestic background noise, its performance gains are limited under extremely low Signal-to-Noise Ratio conditions. In such scenarios, the precision of the extracted speech feature parameters is compromised and may introduce spectral artifacts in the synthesized output. Second, as a lightweight non-autoregressive architecture optimized for real-time edge deployment, the model prioritizes inference speed and stability. Consequently, compared to large-scale and computationally intensive diffusion models, AirSpeech may exhibit a constrained prosodic dynamic range when synthesizing highly expressive speech. Finally, in the synthesis of extremely short or isolated commands, the model occasionally demonstrates reduced prosodic naturalness due to the sparsity of contextual phoneme duration information.

In future work, we plan to develop an environment-aware adaptive noise suppression module to achieve real-time responses to sudden household noises (e.g., doorbells, appliance alarms). Additionally, integrating user facial expressions and body language captured by the robot’s visual sensors to build a dynamic association model between speech prosody and emotional states represents a promising direction for further research.

To facilitate reproducibility and future comparisons, the source code and pre-trained models of AirSpeech will be made publicly available upon publication.

Author Contributions

Conceptualization, X.Q. and F.P.; methodology, X.Q.; software, J.G.; validation, X.Q., S.H. and Y.S.; formal analysis, X.Z.; investigation, X.Q.; resources, F.P.; data curation, X.Q.; writing—original draft preparation, X.Q.; writing—review and editing, X.Q.; visualization, S.H.; supervision, Y.S.; project administration, X.Z.; funding acquisition, X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

The General Project of Science and Technology Plan of Key Technologies and International Standard Development for Intelligent Service Robots (2023YFF0612100).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

X.Q. would like to thank Beijing Research Institute of Automation for Machinery Industry Co., Ltd. for supporting this work.

Conflicts of Interest

Authors Xiugong Qin, Jing Gao, Shilong Huang, Yichen Sun and Xiao Zhong were employed by the company Beijing Research Institute of Automation for Machinery Industry Co., Ltd. Author Fenghu Pan was employed by the company Yanqi Lake Institute of Basic Manufacturing Technology Research Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
Łan’cucki, A. Fastpitch: Parallel text-to-speech with pitch prediction. In Proceedings of the ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6588–6592. [Google Scholar]
Chen, M.; Tan, X.; Li, B.; Liu, Y.; Qin, T.; Zhao, S.; Liu, T.-Y. Adaspeech: Adaptive text to speech for custom voice. arXiv 2021, arXiv:2103.00993. [Google Scholar] [CrossRef]
Liu, S.; Su, D.; Yu, D. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. arXiv 2022, arXiv:2201.11972. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: New York, NY, USA, 2018; pp. 4779–4783. [Google Scholar]
Lim, D.; Jang, W.; Park, H.; Kim, B.; Yoon, J. JDI-T: Jointly trained duration informed transformer for text-to-speech without explicit alignment. arXiv 2020, arXiv:2005.07799. [Google Scholar]
Huang, W.-C.; Hayashi, T.; Wu, Y.-C.; Kameoka, H.; Toda, T. Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining. arXiv 2019, arXiv:1912.06813. [Google Scholar]
Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of the International Conference on Learning Representations, ICLR, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1094–1099. [Google Scholar]
Wang, L.; Yu, Z.; Gao, S.; Mao, C.; Huang, Y. Dets: End-to-end singlestage text-to-speech via hierarchical diffusion gan models. In Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 10916–10920. [Google Scholar]
Song, K.; Zhang, Y.; Lei, Y.; Cong, J.; Li, H.; Xie, L.; He, G.; Bai, J. DSPGAN: A Gan-based universal vocoder for high-fidelity tts by time-frequency domain supervision from dsp. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Du, C.; Yu, K. Phone-level prosody modelling with gmm-based mdn for diverse and controllable speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 190–201. [Google Scholar] [CrossRef]
Xin, D.; Adavanne, S.; Ang, F.; Kulkarni, A.; Takamichi, S.; Saruwatari, H. Improving speech prosody of audiobook text-to-speech synthesis with acoustic and textual contexts. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Li, J.; Meng, Y.; Li, C.; Wu, Z.; Meng, H.; Weng, C.; Su, D. Enhancing speaking styles in conversational text-to-speech synthesis with graph-based multi-modal context modeling. In Proceedings of the ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 7917–7921. [Google Scholar]
Zhang, Y.-J.; Song, W.; Yue, Y.; Zhang, Z.; Wu, Y.; He, X. Masked-speech: Context-aware speech synthesis with masking strategy. arXiv 2022, arXiv:2211.06170. [Google Scholar]
Valle, R.; Santos, J.F.; Shih, K.J.; Badlani, R.; Catanzaro, B. High-acoustic fidelity text to speech synthesis with fine-grained control of speech attributes. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Peng, K.; Ping, W.; Song, Z.; Zhao, K. Non-autoregressive neural text-to-speech. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 7586–7598. [Google Scholar]
Kim, J.; Kim, S.; Kong, J.; Yoon, S. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 2020, 33, 8067–8077. [Google Scholar]
Lu, H.; Wu, Z.; Wu, X.; Li, X.; Kang, S.; Liu, X.; Meng, H. Vaenar-tts: Variational auto-encoder based non-autoregressive text-to-speech synthesis. arXiv 2021, arXiv:2107.03298. [Google Scholar]
Guan, W.; Li, T.; Li, Y.; Huang, H.; Hong, Q.; Li, L. Interpretable style transfer for text-to-speech with controlVAE and diffusion bridge. arXiv 2023, arXiv:2306.04301. [Google Scholar]
Yang, D.; Yu, J.; Wang, H.; Wang, W.; Weng, C.; Zou, Y.; Yu, D. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Trans-Actions Audio Speech Lang. Process. 2023, 31, 1720–1733. [Google Scholar] [CrossRef]
Choi, W.-G.; Kim, S.-J.; Kim, T.; Chang, J.-H. Prior-free guided tts: An improved and efficient diffusion-based text-guided speech synthesis. Proc. Interspeech 2023, 2023, 4289–4293. [Google Scholar]
Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 5530–5540. [Google Scholar]
Guo, H.; Lu, H.; Wu, X.; Meng, H. A multi-scale time-frequency spectrogram discriminator for gan-based non-autoregressive tts. arXiv 2022, arXiv:2203.01080. [Google Scholar]
Yang, J.; Bae, J.-S.; Bak, T.; Kim, Y.; Cho, H.-Y. Ganspeech: Adversarial training for high-fidelity multi-speaker speech synthesis. arXiv 2021, arXiv:2106.15153. [Google Scholar]
Deng, Y.; Zhou, L.; Yi, Y.; Liu, S.; He, L. Prosody-aware speecht5 for expressive neural tts. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. FastSpeech: Fast, Robust and Controllable Text to Speech. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 3165–3174. [Google Scholar]
Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Prenger, R.; Valle, R.; Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019; pp. 3617–3621. [Google Scholar]
Kong, J.; Kim, J.; Bae, J. Hifi-Gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
Casanova, E.; Weber, J.; Shulby, C.D.; Junior, A.C.; Go, E.; Ponti, M.A. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 2709–2720. [Google Scholar]
Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; Zhou, J. Qwen-audio: Advancing universal audio understanding via unified largescale audio-language models. arXiv 2023, arXiv:2311.07919. [Google Scholar]
Min, D.; Lee, D.B.; Yang, E.; Hwang, S.J. Meta-style-speech: Multispeaker adaptive text-to-speech generation. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 7748–7759. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Yang, Z.; Wu, Z.; Jia, J. Speaker characteristics guided speech synthesis. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]
Pamisetty, G.; Easow, R.A.; Gupta, K.; Murty, K.S.R. Stream-tts: A lowlatency text-to-speech using kolmogorov-arnold networks for streaming speech applications. In Proceedings of the ICASSP 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–2. [Google Scholar]
Ellinas, N.; Vamvoukakis, G.; Markopoulos, K.; Chalamandaris, A.; Maniati, G.; Kakoulidis, P.; Raptis, S.; Sung, J.S.; Park, H.; Tsiakoulis, P. High quality streaming speech synthesis with low, sentence-length-independent latency. arXiv 2021, arXiv:2111.09052. [Google Scholar]
Mehta, S.; Tu, R.; Beskow, J.; Sze’kely, E.; Henter, G.E. Matcha-tts: A fast tts architecture with conditional flow matching. In Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 11341–11345. [Google Scholar]
Anastassiou, P.; Chen, J.; Chen, J.; Chen, Y.; Chen, Z.; Chen, Z.; Cong, J.; Deng, L.; Ding, C.; Gao, L.; et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv 2024, arXiv:2406.02430. [Google Scholar]
Kumar, K.; Kumar, R.; De Boissiere, T.; Gestin, L.; Teoh, W.Z.; De Brebisson, A.; Bengio, Y.; Courville, A.C. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 14881–14892. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ito, K. The Lj Speech Dataset. 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 22 May 2025).
Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
Tan, X.; Chen, J.; Liu, H.; Cong, J.; Zhang, C.; Liu, Y.; Wang, X.; Leng, Y.; Yi, Y.; He, L.; et al. Natural speech: End-to-end text-to-speech synthesis with human-level quality. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4234–4245. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Niu, Z.; Ma, Z.; Deng, K.; Wang, C.; Zhao, J.; Yu, K.; Chen, X. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 6255–6271. [Google Scholar]

Figure 1. Overall Model Framework.

{M e l}_{G T}

represents the reference Mel spectrogram, while in the order,

{M e l}_{R S}

,

{M e l}_{S}

, and

{M e l}_{E S}

represent the output mel-spectrogram of the pre-trained acoustic model, decoder, and MEM.

{W a v e}_{G T}

and

{W a v e}_{R S}

denote the reference waveform and the waveform converted from

{M e l}_{R S}

, respectively. SFPGT and SFPS represent the speech feature parameter sequence extracted from

{W a v e}_{G T}

and

{W a v e}_{R S}

.

Figure 1. Overall Model Framework.

{M e l}_{G T}

represents the reference Mel spectrogram, while in the order,

{M e l}_{R S}

,

{M e l}_{S}

, and

{M e l}_{E S}

represent the output mel-spectrogram of the pre-trained acoustic model, decoder, and MEM.

{W a v e}_{G T}

and

{W a v e}_{R S}

denote the reference waveform and the waveform converted from

{M e l}_{R S}

, respectively. SFPGT and SFPS represent the speech feature parameter sequence extracted from

{W a v e}_{G T}

and

{W a v e}_{R S}

.

Figure 2. Sensitivity analysis scores of speech features.

Figure 3. Spearman correlation analysis between speech features.

Figure 4. Subfigures (a,b) depict the impact of varying the hyperparameter γ on the model performance with LJSpeech.

Table 1. Comparison evaluation between AirSpeech and other comparison models.

Model	Dataset	SSIM (↑)	MCD (↓)	F0 RMSE (↓)	STOI (↑)	PESQ (↑)	MOS (↑)
FastSpeech2	LJSpeech	0.502	10.76	46.62	0.792	1.302	3.93 ± 0.08
FastSpeech2	AISHELL3	0.465	15.84	60.68	0.705	1.343	3.82 ± 0.07
FastPitch	LJSpeech	0.511	10.65	47.01	0.779	1.307	3.95 ± 0.07
FastPitch	AISHELL3	0.461	15.47	61.62	0.711	1.349	3.79 ± 0.08
AdaSPeech	LJSpeech	0.541	9.91	47.61	0.781	1.315	4.04 ± 0.07
AdaSPeech	AISHELL3	0.484	14.99	60.33	0.707	1.352	3.94 ± 0.06
DiffGAN-TTS	LJSpeech	0.533	9.74	45.83	0.807	1.311	4.01 ± 0.07
DiffGAN-TTS	AISHELL3	0.479	14.82	60.29	0.725	1.355	3.87 ± 0.06
NaturalSpeech	LJSpeech	0.565	8.50	42.50	0.895	1.350	4.56 ± 0.13
NaturalSpeech	AISHELL3	0.495	14.05	58.90	0.765	1.375	4.09 ± 0.07
F5-TTS	LJSpeech	0.545	9.20	45.10	0.860	1.325	4.18 ± 006
F5-TTS	AISHELL3	0.488	14.30	59.50	0.750	1.360	4.02 ± 008
AirSpeech	LJSpeech	0.558	8.76	43.96	0.885	1.336	4.27 ± 0.06
AirSpeech	AISHELL3	0.501	13.74	58.17	0.781	1.388	4.13 ± 0.07

Note. Bold indicates the best performance in each column; ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 2. Comparisons of model complexity and inference speed.

Model	Params	RTF
FastSpeech2	25.54 M	0.0058
GANSpeech	25.54 M	0.0058
DiffSpeech	44.43 M	0.2224
DiffGAN-TTS	32.81 M	0.0105
NaturalSpeech	28.7 M	0.013
F5-TTS	335.8 M	0.15
AirSpeech	29.32 M	0.0086

Note. Bold indicates the best performance in each column.

Table 3. AirSpeech ablation evaluation.

Model	LJSpeech	AISHELL3
Model	SSIM/CMOS	SSIM/CMOS
AirSpeech	0.558/0	0.501/0
with MAE	0.551/−0.07	0.493/−0.08
with BN	0.545/−0.13	0.484/−0.17
w/o SEM	0.547/−0.11	0.489/−0.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, X.; Pan, F.; Gao, J.; Huang, S.; Sun, Y.; Zhong, X. AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots. Electronics 2026, 15, 239. https://doi.org/10.3390/electronics15010239

AMA Style

Qin X, Pan F, Gao J, Huang S, Sun Y, Zhong X. AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots. Electronics. 2026; 15(1):239. https://doi.org/10.3390/electronics15010239

Chicago/Turabian Style

Qin, Xiugong, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun, and Xiao Zhong. 2026. "AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots" Electronics 15, no. 1: 239. https://doi.org/10.3390/electronics15010239

APA Style

Qin, X., Pan, F., Gao, J., Huang, S., Sun, Y., & Zhong, X. (2026). AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots. Electronics, 15(1), 239. https://doi.org/10.3390/electronics15010239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Speech Feature Enhancement Module (SEM)

3.1.1. The Structure and Principle of SEM

3.1.2. A More Stable Adversarial Training Process

3.2. Feature Transformation Component (FTC)

3.3. Training Loss Terms

4. Experiment

4.1. Datasets and Speech Feature Selection

4.2. Hyper-Parameter Analysis

4.3. Comparison of TTS Performance

4.4. Ablation Study

5. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI