Next Article in Journal
ZPD Retrieval Performances of the First Operational Ship-Based Network of GNSS Receivers over the North-West Mediterranean Sea
Previous Article in Journal
A GM-JMNS-CPHD Filter for Different-Fields-of-View Stochastic Outlier Selection for Nonlinear Motion Tracking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Rate–Distortion–Perception Optimized Neural Speech Transmission System for High-Fidelity Semantic Communications †

1
Key Laboratory of Universal Wireless Communications, Beijing University of Posts and Telecommunications, Beijing 100876, China
2
Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen 518066, China
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023.
Sensors 2024, 24(10), 3169; https://doi.org/10.3390/s24103169
Submission received: 19 April 2024 / Revised: 14 May 2024 / Accepted: 15 May 2024 / Published: 16 May 2024
(This article belongs to the Section Communications)

Abstract

:
We consider the problem of learned speech transmission. Existing methods have exploited joint source–channel coding (JSCC) to encode speech directly to transmitted symbols to improve the robustness over noisy channels. However, the fundamental limit of these methods is the failure of identification of content diversity across speech frames, leading to inefficient transmission. In this paper, we propose a novel neural speech transmission framework named NST. It can be optimized for superior rate–distortion–perception (RDP) performance toward the goal of high-fidelity semantic communication. Particularly, a learned entropy model assesses latent speech features to quantify the semantic content complexity, which facilitates the adaptive transmission rate allocation. NST enables a seamless integration of the source content with channel state information through variable-length joint source–channel coding, which maximizes the coding gain. Furthermore, we present a streaming variant of NST, which adopts causal coding based on sliding windows. Experimental results verify that NST outperforms existing speech transmission methods including separation-based and JSCC solutions in terms of RDP performance. Streaming NST achieves low-latency transmission with a slight quality degradation, which is tailored for real-time speech communication.

1. Introduction

The vast demand of streaming audio and video communication poses significant challenges to wireless communication systems, underscoring the need to elevate both the quality and efficiency of speech transmission. Current wireless communication systems suffers from the cliff effect where the signal reconstruction quality breaks down if the channel quality falls below the level anticipated by the channel code. Learning-based speech transmission methods [1,2,3] are emerging as promising solutions to improve the end-to-end transmission performance in the context of semantic communication [4,5,6,7]. They mostly leverage the idea of joint source–channel coding (JSCC) to produce transmitted symbols directly from raw speech signals with neural networks, which is featured with graceful degradation with respect to channel quality [1,2,8]. However, these approaches fail to identify the content diversity among signals, leading to inefficient transmission. Streaming inference is also a fundamental aspect in real-time communication (RTC) scenarios. Although transmission errors can be compensated by retransmission such as hybrid automatic repeat requests, these lead to a loss of efficiency and transmission delay.
To address the above-mentioned issues, we make the first attempt to design a high-fidelity neural speech transmission framework (NST) for better end-to-end transmission performance. Motivated by learned data compression techniques [9,10], NST establishes a learned entropy model on latent speech features and then realizes semantic-guided variable-length joint source–channel coding, thus achieving better coding gain. Specifically, a critical set of hyperprior variables is established upon the latent features, which estimate the entropy of speech features by variational modeling. Under the guidance, speech latent features are dynamically encoded to variable-length symbol sequences via a joint source–channel encoder. Based on our previous work [11], we further investigate the real-time speech transmission within latency-sensitive contexts, such as online conferencing and voice calls. In particular, we develop a streaming variant of NST tailored for low-latency transmission. All the operators of the model are strictly causal ones, which attend to the past speech signals only, to satisfy the real-time property. In addition, we design a sliding-window based inference mechanism in joint source–channel coding, which balances the performance of speech reconstruction and the overall delay.
We evaluate the performance by conducting simulations over wireless channels. The results demonstrate that the proposed NST model is source and channel-adaptive. In comparison to advanced speech coding combined with error correction coding, and the existing JSCC solution, the proposed NST achieves a superior rate–distortion–perception tradeoff. This translates to a high-fidelity speech reconstruction performance while incurring lower bandwidth costs. Notably, the streaming NST makes a slight compromise in speech quality to meet the low-latency requirement.
Notational Conventions: Throughout this paper, bold letters (e.g., x ) denote vectors and the scalars, and lowercase ones denote scales. Bold uppercase letters (e.g., V ) represent a collection. log · is the logarithm to base 2. p x denotes a probability density function (pdf) with respect to the continuous-valued random variable x. U ( a m , a + m ) denotes a uniform distribution centered on a with width 2 m . R and C denote the real number set and the complex number set, respectively. E · denotes the statistical expectation operation.

2. Methodology

2.1. Architecture

The NST system architecture is illustrated in Figure 1. Assuming a sequence of T speech frames x = x 1 , x 2 , x T , the analysis transform module g a · ; ϕ g , which consists of convolutional neural networks (CNNs) with temporal downsampling, transforms them into a semantic latent feature sequence y = y 1 , y 2 , y T . Then, the latent features y are fed into both a hyperprior encoder h a · ; ϕ h and a variable-length JSCC encoder f e · ; ϕ f . On one hand, in order to conveniently quantify the amount of information for speech features, each element of y is variationally modeled by a simple Gaussian, whose parameters are encapsulated by the hyperprior variable z . The means and variances of the Gaussians are encoded by h a · ; ϕ h and h s · ; θ h to capture the dependencies of y . On the other hand, f e · ; ϕ f encodes y into channel-input sequence s = s 1 , s 2 , s T , where s i C k i is a k i -dimensional complex vector to transmit y i . We consider a wireless channel denoted by W · ; ν , where ν denotes the channel parameters. Thus, the receiver obtains the sequence s ^ = W s ; ν with the transition probability p s ^ | s s ^ | s . As illustrated in Figure 1, with a mirrored design, the JSCC decoder f d · ; θ f reconstructs latent representation y ^ , and semantic synthesis transform g s · ; θ g recovers speech waveform x ^ . Hence, the total link of NST is formulated by
x g a · ; ϕ g y f e · ; ϕ f s W · ; ν s ^ f d · ; θ f y ^ g s · ; θ g x ^ ,
with the latent prior y h a · ; ϕ h z h s · θ h μ , σ and ( θ , ϕ ) = ( ϕ g , ϕ h , ϕ f , θ g , θ h , θ f ) encapsulating the learnable parameters of each function above. Moreover, the hyperprior z can be viewed as side information, which is optionally sent via a digital link to the receiver to refine the latent feature y .

2.2. Dynamic Variable-Length Joint Source–Channel Coding

As defined previously, each y i is variationally modeled as a Gaussian with mean μ i and variance σ i 2 , whose density function is factorized as
p y | z ; θ h , ψ h = i N μ i , σ i 2 U 1 2 , + 1 2 p y i | z ( y i ) ,
with ( μ , σ ) = h s ( z ) , where ∗ is a convolutional operation. Dithered quantization is adopted [12], such that we can derive a non-negative entropy estimation of log p y | z ( y | z ) by directly using the proxy y ˜ i = y i + o , o U 1 2 , + 1 2 . The estimated entropy is directly linked to the channel bandwidth cost in the JSCC encoder for transmission. Intuitively, if y i is tagged with high entropy, it will be allocated with more bandwidth and vice versa.
In practice, the total bandwidth cost K y for transmitting y is formulated by
K y = i = 1 T k ¯ y i = i = 1 T Q ( k y i ) = i = 1 T Q ( η y log p y i | z ( y i | z ) ) ,
where η y controls the scaling between the estimated entropy and the number of transmitted symbols, and Q denotes a 2 n -level scalar quantization with the quantized value set as V = { v 1 , v 2 , , v 2 n } . Hence, n bits are transmitted as side information to inform the receiver in which k ¯ y i V is selected for transmitting y i .
We adopt a pair consisting of a Transformer-like [13] JSCC encoder and decoder as f e and f d , as plotted in Figure 2. Guided by the entropy model log p y | z ( y | z ) , a set of learnable rate token embeddings with the same dimension with y i are developed, each of which corresponds to a value in V . To adapt to various channel environments, we assume a channel state information feedback to inform the sender of the instant signal-to-noise ratio (SNR). Similarly, a set of learnable SNR tokens are developed. T frames of speech features are gathered and fused with respective rate tokens and an SNR token, and they are finally fed into the Transformer block with N e Transformer layers. A bunch of fully connected (FC) layers with output dimensions of v q , q = 1 , 2 , , 2 n are employed to map the embeddings into s i with given dimensions. A toy visualization of the rate allocation result is displayed in Figure 3. It can be observed that more bandwidth is allocated to frames with prominent contents and less is allocated to ones in silence. The overall bandwidth is adjusted by tuning the hyperparameter η y .

2.3. Streaming NST for Real-Time Communication

In this subsection, we propose a streaming NST model variant to facilitate real-time speech communication.
Firstly, all the convolutional operators in g a and g s are substituted by causal ones. The transposed convolution operations in g s only pad on the past steps to meet the causal property. Secondly, the joint source–channel coding of speech latent features is modified from that in Figure 3 to reduce the latency. Traditionally, Transformer uses multi-head attention that jointly learns diverse relationships between queries and keys, which are the speech features y in this paper, from different representation subspaces with j-th head computing
Q j = W j Q y , K j = W j K y , V j = W j V y .
To meet the real-time requirement, a causal masked attention method is proposed together with a sliding-window inference mechanism. As shown in Figure 4, the JSCC encoder f e has a limited contextual window with W frames, which hops along the temporal domain with a stride of N frames. In particular, we follow Transformer-XL [14] and create a segment-level recurrence of output by intermediate layers. In this paper, we define the attention span of each layer as 2 N frames, which ends with the last frame of the current N target frames. The self-attention is computed after a causal mask M is applied, whose elements satisfy
M t , τ = 1 , t 2 N < τ t , others .
Then, the output of the j-th head self-attention a j is formulated by
a j = Softmax Q j K j T d h M V j ,
where d h is the dimension of each head. Thus, the length of contextual window W grows linearly with respect to the number of Transformer layers as well as the window stride, which can be written as W = N ( N e + 1 ) frames.

2.4. Optimization Goal

The analysis transform together with the joint source–channel encoder creates a parametric density q s ^ , z ˜ | x to approximate the true posterior distribution p s ^ , z ˜ | x . The optimization goal is to minimize the Kullback–Leibler (KL) divergence between the above two components. After reformulation, it minimizes its upper bound, i.e.,
min E x p x E s ^ , z ˜ q s ^ , z ˜ D KL q s ^ , z ˜ | x p s ^ , z ˜ | x min E x p x E s ^ , z ˜ q s ^ , z ˜ log p z ˜ ( z ˜ ) side info . coding rate log p s ^ | z ˜ ( s ^ | z ˜ ) bandwidth E y p y | s ^ , z ˜ log p x | y ( x | y ) distortion + const .
The first term of (7) represents the cost of encoding the side information assuming p z ˜ as the entropy model, where z ˜ i = z i + o is the proxy quantization of z i . Since there is no prior information about z , p z ˜ ( z ˜ ) is modeled as a non-parametric fully factorized density [9] p z ˜ ( z ˜ ) = i p z i | ψ ( i ) ( z i | ψ ( i ) ) U ( 1 2 , 1 2 ) ( z i ) . The second term represents the bandwidth cost of encoding s ^ . In practice, the intermediate variable y is utilized by p s ^ | z ˜ = W ( p s | z ˜ | h ) = W ( f e ( p y | z ˜ ) | h ) . The third term denotes the weighted distortion of the reconstructed speech waveform. d ( · , · ) indicates the objective signal distortion. To enrich the distortion term in alignment with human perceptual quality, a differentiable F ( · ) is employed as a perceptual feature extractor, and the distance between perceptual features d p ( · , · ) is minimized to improve the listening quality.
In summary, the RDP function is formulated as
L RDP ( θ , ϕ , ψ ) = E x p x [ η y log p y | z ( y | z ) η z log p z ˜ ( z ˜ ) + λ D d ( x , x ^ ) + λ P d p ( F x , F x ^ ) ] ,
where the Lagrange multipliers λ D , λ P control the tradeoff among the total transmission rate, the distortion and the perceptual quality. The scaling factor η y is adjusted for RDP tradeoff, while η z is determined according to the channel capacity of the optional tranmission link.

3. Results

In this section, we provide numerical results in terms of objective quality metrics and subjective scores to evaluate the quality of speech transmission.

3.1. Experimental Setup

The mono speech signals are sampled at 16 kHz from the TIMIT dataset [15]. Compared to our conference paper [11], to adapt to RTC scenarios, a shorter frame length is considered in this paper. Each speech frame has L = 128 samples with an overlap of eight samples. The analysis transform module g a and synthesis transform module g s consist of stacks of 1D convolutional layers with a residual connection. The number of channels of the convolutional kernel of the output/input layer for g a / g s is configured with C g = 4 , while the one for h a / h s is set as C h = 2 . In the variable-length JSCC coder f e and f d , we use N e = 3 Transformer layers with eight-head self-attention. The quantized channel bandwidth cost value set is defined as V = 10 , 40 , 90 , 120 , 200 , 250 , 300 , 400 . Each speech frame x i R 1 × L is transformed into latent feature y i R C g × L 4 with a downsampling factor of four. It is then flattened into an embedding vector with a dimension of C g L 4 = 128 , which is identical to the dimension of the Transformer in JSCC coders.
In (8), the object signal distortion d is evaluated by the mean square error in the time domain. In terms of perceptual optimization, we minimize the difference of Mel-frequency cepstral coefficients (MFCCs) [16], which is a hand-crafted speech perceptual feature. Specifically, a mean square loss function d p for MFCCs is employed, where F denotes the function of the MFCC extractor.
We compare our NST model with traditional separation-based transmission schemes. Specifically, we employ the widely used speech codec AMR-WB [17] and Opus [18] for source coding and convolutional codes, 5G LDPC [19] for channel coding, and follow the principle of adaptive modulation coding (AMC) [20]. Moreover, we also compare our NST model with another JSCC model DeepSC-S [1] for speech transmission, which is a non-streaming model with CNN modules. We modify its model to support low bandwidth transmission with 12 kHz and 32 kHz, separately.

3.2. Evaluation Metrics

In terms of objective metrics for perceptual quality, we report the perceptual evaluation of speech quality (PESQ) [21] scores, which range from 1.0 to 4.5. Furthermore, we implement a Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) subjective test [22] for human preference evaluation. As a widely used approach in the subject quality assessment method, the MUSHRA test allows users to compare multiple variants of reconstructed audio and provides the relative score between 0 and 100. We randomly select 10 speech segments from the test set.

3.3. Results Analysis

Figure 5 reports the PESQ performance over additive white Gaussian noise (AWGN) channels. In Figure 5a, with a fixed channel bandwidth cost of K = 10 kHz, we find that the proposed NST brings a performance gain for all SNRs by incorporating source and channel information into JSCC, especially in a low SNR region. In addition to traditional speech source coding methods, we also compare with a nonlinear neural speech compressor which employs the similar entropy model as NST to entropy encode the latent speech features. This scheme is marked in the figure as “NTC + QPSK, 1/2 Conv” when using convolutional codes with a rate of 1/2 and QPSK modulation as “NTC + 5G LDPC” when using LDPC codes. NST demonstrates graceful performance degradation with the decrease of SNR, while the performance of other separation-based methods probably breaks down (cliff effect) when using a single-channel coding rate and modulation level, e.g., “NTC + QPSK, 1/2 Conv”. For the 5G LDPC, we plot the envelope of several curves, corresponding to different coding rates and modulation levels. Compared with another JSCC method DeepSC-S, our model achieves better perceptual quality by introducing an explicit perceptual loss function with much less bandwidth cost. In addition, we notice a slight quality drop under the streaming inference setting, but it remains better than other methods. NST adapts well to various channel conditions by means of SNR token fusion in f e and f d using a single model.
Figure 5b compares the rate–distortion–perception performance using different methods for the 6 dB AWGN channel. Since the NST model learns an adaptive rate allocation mechanism, we traverse the η y from 0.1 to 0.3 and finetune the model with a fixed λ D and λ P . It can be observed that a remarkable bandwidth saving can be accomplished for NST by integrating source semantic information as well as channel information.
To delineate the distortion–perception tradeoff, we conduct an ablation study examining the impact of perceptual optimization. We evaluate the signal–to–distortion ratio (SDR) performances to assess the traditional signal distortion. The results in Figure 6 demonstrate that the proposed NST using an RDP optimization objective function (8) outperforms its counterpart solely optimized toward reduced signal distortion in terms of perceptual quality. NST with rate–distortion (RD) optimization (omitting perceptual loss in (8)) exhibits inferior perceptual quality despite there being less objective signal distortion. Performance using the traditional speech coding method is also included in the figure, which also underscores the significance of perceptual optimization in addition to minimizing the objective signal distortion.
Figure 7 displays the effect of SNR fusion in our SNR-adaptive joint source–channel coding. It can be observed that the PESQ-SNR curve of the proposed NST trained under multiple SNRs with SNR token fusion closely approximates the envelope of the curves obtained from models trained using single SNR values.
We additionally carry out experiments on the widely used COST2100 fading channel [23] to verify the robustness of the NST model. Figure 8 shows the results. With a feedback of average SNR and the SNR token fusion, our model adapts to the channel states well, while performances of DeepSC-S are evaluated on models trained at multiple SNRs. With lower bandwidth in Figure 8, NST also shows better transmission efficiency compared to traditional methods.
The subjective user rating results in Figure 9 verify that the proposed NST recovers perceptually satisfying speech over 6 dB AWGN channels, even consuming much less bandwidth than separation-based speech coding methods. Compared to DeepSC-S, which is only optimized for lower distortion, the RDP-optimized NST achieves semantics-guided dynamic rate allocation, thus much improving the end-to-end system gain. The perceptual quality of streaming NST exhibits no substantial degradation compared to the non-streaming one, which is of practical value in RTC scenarios.

3.4. Discussion on the Quality–Latency Tradeoff

In terms of streaming NST, we investigate the tradeoff between the perceptual quality and the transmission latency. As is defined previously, each frame of speech feature y i accounts for 8 milliseconds (ms) of a 16 kHz signal.
Table 1 shows the tradeoff between speech quality and transmission delay, which consists of encoding and decoding time (runtime) and the latency. In the context of sliding-window-based inference, a longer stride will increase the latency as it needs to wait for the arrival of future frames to collect all features belonging to the same window.
We also compare the PESQ performances versus stride frames across different SNRs and bandwidth cost. The results in Figure 10 verify that a longer window stride as well as the length of contextual windows in JSCC consistently presents a better coding gain across different transmission conditions at the cost of longer delay according to Table 1. Except this subsection, performances of streaming NST are reported with a stride of N = 3 and a total delay of less than 100 ms. It satisfies the real-time property and ensures a high-quality speech restoration simultaneously. The runtime is evaluated on an Intel(R) Core i9-12900K CPU (Intel Corporation, Santa Clara, CA, USA). Table 2 presents the model complexity comparison of both computational (measured by giga floating point operations per second, i.e., GFLOPs) and space complexity. Due to the employment of a tiny Transformer in the joint source–channel encoder, our model is comparably lightweight and computational efficient. Extra measures for accelerating inference may be taken to facilitate lightweight deployment in resource-limited devices.

4. Conclusions

In this paper, we present the NST, which is a novel neural speech transmission framework. The model features dynamic rate allocation for variable-length JSCC, which is guided by the variational modeling of speech latent features. It presents good adaptability to varying channel conditions by channel information fusing in JSCC. A streaming variant of NST is also designed for RTC. Simulation results verify that the proposed method consumes much less bandwidth cost than classical methods when achieving similar perceptual performances. It highlights NST’s potential in high-efficiency and high-fidelity speech transmission in the realm of semantic communication.

Author Contributions

Conceptualization, S.Y.; Methodology, S.Y. and Z.X.; Validation, S.Y. and Z.X.; Formal analysis, S.Y. and Z.X.; Writing—original draft preparation, S.Y. and Z.X.; Writing—review and editing, S.Y.; Visualization, S.Y and Z.X.; Supervision, K.N.; Project administration, K.N; Funding acquisition, S.Y. and K.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grant 92267301 and Grant 62071058 and in part by the BUPT Excellent Ph.D. Students Foundation under Grant CX2023305.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Weng, Z.; Qin, Z. Semantic communication systems for speech transmission. IEEE J. Sel. Areas Commun. 2021, 39, 2434–2444. [Google Scholar] [CrossRef]
  2. Han, T.; Yang, Q.; Shi, Z.; He, S.; Zhang, Z. Semantic-preserved communication system for highly efficient speech transmission. IEEE J. Sel. Areas Commun. 2022, 41, 245–259. [Google Scholar] [CrossRef]
  3. Guo, J.; Zhang, Y.; Liu, C.; Xu, W.; Bie, Z. SNR-Adaptive Multi-Layer Semantic Communication for Speech. In Proceedings of the 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada, 5–8 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
  4. Qin, Z.; Tao, X.; Lu, J.; Tong, W.; Li, G.Y. Semantic communications: Principles and challenges. arXiv 2021, arXiv:2201.01389. [Google Scholar]
  5. Dai, J.; Zhang, P.; Niu, K.; Wang, S.; Si, Z.; Qin, X. Communication beyond transmitting bits: Semantics-guided source and channel coding. IEEE Wirel. Commun. 2023, 30, 170–177. [Google Scholar] [CrossRef]
  6. Xu, J.; Tung, T.Y.; Ai, B.; Chen, W.; Sun, Y.; Gündüz, D.D. Deep joint source-channel coding for semantic communications. IEEE Commun. Mag. 2023, 61, 42–48. [Google Scholar] [CrossRef]
  7. Lu, Z.; Li, R.; Lu, K.; Chen, X.; Hossain, E.; Zhao, Z.; Zhang, H. Semantics-empowered communications: A tutorial-cum-survey. IEEE Commun. Surv. Tutor. 2023, 26, 41–79. [Google Scholar] [CrossRef]
  8. Bourtsoulatze, E.; Kurka, D.B.; Gündüz, D. Deep joint source-channel coding for wireless image transmission. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 567–579. [Google Scholar] [CrossRef]
  9. Ballé, J.; Laparra, V.; Simoncelli, E.P. End-to-end optimized image compression. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  10. Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. In Proceedings of the International Conference on Learning Representations, Vancouver, QC, Canada, 30 April–3 May 2018. [Google Scholar]
  11. Xiao, Z.; Yao, S.; Dai, J.; Wang, S.; Niu, K.; Zhang, P. Wireless deep speech semantic transmission. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  12. Schuchman, L. Dither signals and their effect on quantization noise. IEEE Trans. Commun. Technol. 1964, 12, 162–165. [Google Scholar] [CrossRef]
  13. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017): 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  14. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.G.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar]
  15. Garofolo, J.S. Timit Acoustic Phonetic Continuous Speech Corpus; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
  16. Muda, L.; Begam, M.; Elamvazuthi, I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv 2010, arXiv:1003.4083. [Google Scholar]
  17. Bessette, B.; Salami, R.; Lefebvre, R.; Jelinek, M.; Rotola-Pukkila, J.; Vainio, J.; Mikkola, H.; Jarvinen, K. The adaptive multirate wideband speech codec (AMR-WB). IEEE Trans. Speech Audio Process. 2002, 10, 620–636. [Google Scholar] [CrossRef]
  18. Valin, J.M.; Vos, K.; Terriberry, T. Definition of the Opus Audio Codec, Technical Report. 2012. Available online: https://www.rfc-editor.org/rfc/pdfrfc/rfc6716.txt.pdf (accessed on 1 July 2022).
  19. Ryan, W.; Lin, S. Channel Codes: Classical and Modern; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  20. Peng, F.; Zhang, J.; Ryan, W.E. Adaptive modulation and coding for IEEE 802.11 n. In Proceedings of the 2007 IEEE Wireless Communications and Networking Conference, Hong Kong, 11–15 March 2007; pp. 656–661. [Google Scholar]
  21. ITU-T. Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs; International Telecommunication Union: Geneva, Switzerland, 2001. [Google Scholar]
  22. BS Series. Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems; International Telecommunication Union: Geneva, Switzerland, 2014. [Google Scholar]
  23. Liu, L.; Oestges, C.; Poutanen, J.; Haneda, K.; Vainikainen, P.; Quitin, F.; Tufvesson, F.; De Doncker, P. The COST 2100 MIMO channel model. IEEE Wirel. Commun. 2012, 19, 92–99. [Google Scholar] [CrossRef]
Figure 1. The architecture of the Neural Speech Transmission system (NST).
Figure 1. The architecture of the Neural Speech Transmission system (NST).
Sensors 24 03169 g001
Figure 2. The pipeline of the variable-length joint source-channel coding (JSCC) via f e and f d . FC denotes fully connected layers.
Figure 2. The pipeline of the variable-length joint source-channel coding (JSCC) via f e and f d . FC denotes fully connected layers.
Sensors 24 03169 g002
Figure 3. Visualization of rate allocation along temporal domain. (a) Bandwidth K = 10 kHz with η y = 0.16. (b) Bandwidth K = 4 kHz with η y = 0.105.
Figure 3. Visualization of rate allocation along temporal domain. (a) Bandwidth K = 10 kHz with η y = 0.16. (b) Bandwidth K = 4 kHz with η y = 0.105.
Sensors 24 03169 g003
Figure 4. Streaming joint source–channel encoding for real-time inference. It encodes latent features of N frames into transmitted symbols in each inference, e.g., y t N + 1 , , y t for the blue window in the figure and then the ones in orange in the next inference.
Figure 4. Streaming joint source–channel encoding for real-time inference. It encodes latent features of N frames into transmitted symbols in each inference, e.g., y t N + 1 , , y t for the blue window in the figure and then the ones in orange in the next inference.
Sensors 24 03169 g004
Figure 5. Perceptual evaluation of speech quality (PESQ) performance over additive white Gaussian noise (AWGN) channel. (a) PESQ scores versus signal-to-noise ratio (SNR). The bandwidth of all methods K is 10 kHz, except those of DeepSC-S are 12 kHz and 32 kHz (yellow lines). (b) PESQ scores versus channel bandwidth cost when SNR = 6 dB.
Figure 5. Perceptual evaluation of speech quality (PESQ) performance over additive white Gaussian noise (AWGN) channel. (a) PESQ scores versus signal-to-noise ratio (SNR). The bandwidth of all methods K is 10 kHz, except those of DeepSC-S are 12 kHz and 32 kHz (yellow lines). (b) PESQ scores versus channel bandwidth cost when SNR = 6 dB.
Sensors 24 03169 g005
Figure 6. Distortion–perception tradeoff using different optimization objectives with 9 kHz channel bandwidth cost over AWGN channel. (a) PESQ for assessing perceptual quality. (b) Signal–to–distortion ratio (SDR) for assessing signal distortion.
Figure 6. Distortion–perception tradeoff using different optimization objectives with 9 kHz channel bandwidth cost over AWGN channel. (a) PESQ for assessing perceptual quality. (b) Signal–to–distortion ratio (SDR) for assessing signal distortion.
Sensors 24 03169 g006
Figure 7. Effect of SNR token fusion in joint source–channel coding.
Figure 7. Effect of SNR token fusion in joint source–channel coding.
Sensors 24 03169 g007
Figure 8. PESQ performance over COST2100 fading channel. (a) PESQ scores versus average SNR. (b) PESQ scores versus channel bandwidth cost.
Figure 8. PESQ performance over COST2100 fading channel. (a) PESQ scores versus average SNR. (b) PESQ scores versus channel bandwidth cost.
Sensors 24 03169 g008
Figure 9. MUSHRA scores evluated under 6 dB AWGN channel. Audio samples are available at https://ximoo123.github.io/NSTSpeech (accessed on 1 March 2024).
Figure 9. MUSHRA scores evluated under 6 dB AWGN channel. Audio samples are available at https://ximoo123.github.io/NSTSpeech (accessed on 1 March 2024).
Sensors 24 03169 g009
Figure 10. PESQ performances using different strides N over AWGN channels. (a) PESQ scores versus SNR. (b) PESQ scores versus bandwidth.
Figure 10. PESQ performances using different strides N over AWGN channels. (a) PESQ scores versus SNR. (b) PESQ scores versus bandwidth.
Sensors 24 03169 g010
Table 1. Quality–delay tradeoff for the streaming NST model tested with SNR = 10 dB over the AWGN channel.
Table 1. Quality–delay tradeoff for the streaming NST model tested with SNR = 10 dB over the AWGN channel.
StridePESQTotal DelayRuntimeMaximum Latency
24.0967.1 ms51.1 ms16 ms
34.1383.2 ms59.2 ms24 ms
54.15112.7 ms72.7 ms40 ms
74.17140.1 ms84.1 ms56 ms
Table 2. Model complexity comparison.
Table 2. Model complexity comparison.
ModelGFLOPs#Params (Unit: Million)
[3]>31>106
DeepSC-S [1]7.600.24
NST (Ours)9.872.49
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, S.; Xiao, Z.; Niu, K. Rate–Distortion–Perception Optimized Neural Speech Transmission System for High-Fidelity Semantic Communications. Sensors 2024, 24, 3169. https://doi.org/10.3390/s24103169

AMA Style

Yao S, Xiao Z, Niu K. Rate–Distortion–Perception Optimized Neural Speech Transmission System for High-Fidelity Semantic Communications. Sensors. 2024; 24(10):3169. https://doi.org/10.3390/s24103169

Chicago/Turabian Style

Yao, Shengshi, Zixuan Xiao, and Kai Niu. 2024. "Rate–Distortion–Perception Optimized Neural Speech Transmission System for High-Fidelity Semantic Communications" Sensors 24, no. 10: 3169. https://doi.org/10.3390/s24103169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop