A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems

Zeybek, Münif; Kartal Çetin, Bilge; Engin, Erkan Zeki

doi:10.3390/electronics14061130

Open AccessArticle

A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems

by

Münif Zeybek

,

Bilge Kartal Çetin

^*

and

Erkan Zeki Engin

Department of Electrical and Electronics Engineering, Ege University, 35100 İzmir, Türkiye

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1130; https://doi.org/10.3390/electronics14061130

Submission received: 16 February 2025 / Revised: 6 March 2025 / Accepted: 11 March 2025 / Published: 13 March 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Wireless Communications)

Download

Browse Figures

Versions Notes

Abstract

Recent advances in deep learning have fostered a transition from the traditional, bit-centric paradigm of Shannon’s information theory to a semantic-oriented approach, emphasizing the transmission of meaningful information rather than mere data fidelity. However, black-box AI-based semantic communication lacks structured discretization and remains dependent on analog modulation, which presents deployment challenges. This paper presents a new semantic-aware digital speech communication system, named Hybrid-DeepSCS, a stepping stone between traditional and fully end-to-end semantic communication. Our system comprises the following parts: a semantic encoder for extracting and compressing structured features, a standard transmitter for digital modulation including source and channel encoding, a standard receiver for recovering the bitstream, and a semantic decoder for expanding the features and reconstructing speech. By adding semantic encoding to a standard digital transmission, our system works with existing communication networks while exploring the potential of deep learning for feature representation and reconstruction. This hybrid method allows for gradual implementation, making it more practical for real-world uses like low-bandwidth speech, robust voice transmission over wireless networks, and AI-assisted speech on edge devices. The system’s compatibility with conventional digital infrastructure positions it as a viable solution for IoT deployments, where seamless integration with legacy systems and energy-efficient processing are critical. Furthermore, our approach addresses IoT-specific challenges such as bandwidth constraints in industrial sensor networks and latency-sensitive voice interactions in smart environments. We test the system under various channel conditions using Signal-to-Distortion Ratio (SDR), PESQ, and STOI metrics. The results show that our system delivers robust and clear speech, connecting traditional wireless systems with the future of AI-driven communication. The framework’s adaptability to edge computing architectures further underscores its relevance for IoT platforms, enabling efficient semantic processing in resource-constrained environments.

Keywords:

semantic communication; semantic-aware speech communication; deep learning for wireless communication; hybrid digital–semantic communication; IoT edge computing; semantic-aware IoT; 6G-and-beyond networks

1. Introduction

Recent advancements in deep learning have revolutionized wireless communication, and the concept of semantic communication has gained increasing attention as a fundamental shift from Shannon’s [1] bit-accurate transmission to meaning-aware information exchange. This evolution aligns with the growing demands of IoT ecosystems, where heterogeneous devices—from industrial sensors to voice-enabled edge nodes—require adaptive communication frameworks that balance efficiency, intelligibility, and interoperability. Unlike conventional approaches, which primarily aim to recover transmitted symbols, semantic communication systems prioritize the preservation of meaning and intelligibility. To this end, modern communication networks evolve toward 6G and beyond, and new paradigms that prioritize semantic meaning over exact bit recovery have emerged. This shift is particularly important for bandwidth-limited and noisy environments, where the reconstruction of the underlying semantic content can prove to be more beneficial than the sole recovery of the symbols. Such environments are ubiquitous in IoT applications, ranging from smart factories with high electromagnetic interference to urban deployments with dense wireless traffic.

The transition from technical-level communication (Level A) to semantic-level communication (Level B) was first theorized by Weaver [2], who proposed that communication systems should not only transmit symbols accurately but also preserve their intended meaning. Several studies have attempted to formalize semantic information theory, laying the groundwork for modern AI-driven communication paradigms. Bao et al. [3] extend Shannon’s classic information theory (CIT) [1] to incorporate semantic-level communication that goes beyond bit-accurate transmission to meaning-aware transmission. They introduce a model-theoretic approach to semantic data compression and reliable transmission. Niu & Zhang [4] formalize semantic information processing by defining entropy, compression, and transmission limits. Shao et al. [5] propose a structured approach to defining semantic information, semantic noise, and the fundamental limits of semantic coding.

It appears that recent advances have shown the significance of deep learning (DL) and the impacts of physical-layer (PHY) communications. These perspectives lead to end-to-end learned transmission models, adaptive modulation techniques, and AI-driven semantic communication. Different from conventional rule-based systems, deep learning facilitates global optimization across transmitter, channel, and receiver components, thereby paving the way for semantic-aware communication models. O’Shea & Hoydis [6] introduced the concept of autoencoder-based communication, where a transmitter, channel, and receiver are jointly optimized as a deep neural network. Their work shows that deep learning can substitute traditional encoding, modulation, and decoding blocks, which offers improved performance under complex channel conditions. Qin et al. [7] investigated the potential of deep neural networks (DNNs) to enhance modulation, signal compression, and detection processes.

Early works such as DeepSC have demonstrated the potential of end-to-end deep learning-based semantic communication models [8]. They have proposed a system that is capable of jointly performing semantic channel coding for text transmission using neural networks (NNs). Furthermore, DeepSC-S extends semantic communication to speech signals [9]. The utilization of semantic speech information has been shown to facilitate more efficient transmission compared to conventional systems. Specifically, the NN-based joint design of speech coding and channel coding has been developed to facilitate the learning and extraction of essential speech information. L-DeepSC [10] has been introduced as a lightweight alternative to DeepSC, with the objective of reducing computational complexity for Internet of Things (IoT) applications. This objective is pursued by employing model compression techniques such as quantization and model pruning. Tong et al. [11] investigated Federated Learning (FL) for Audio Semantic Communication (ASC), introducing a wav2vec-based autoencoder that extracts semantic speech features for transmission over wireless networks. Although there exist remarkable studies in this context, these models often assume fully neural transceivers, making them impractical for real-world integration with existing wireless infrastructure. They lack structured feature discretization and remain heavily dependent on black-box AI models. Numerous studies suggest the concept of analog modulation which involves the transmission of continuous signals without discretization into constellation symbols, allowing for the representation of any value. However, this ideal assumption regarding modulation poses significant challenges in practical implementation due to limitations inherent in hardware components, such as power amplifiers.

With regard to digital semantic communications, Huang et al. [12] introduced D²-JSCC, a digital deep joint source–channel coding (JSCC) framework for semantic communication by utilizing deep source coding to extract and encode semantic features before transmission and employing adaptive density models to jointly optimize digital source and channel coding for image-based transmission. Bo et al. [13] proposed a Joint Coding–Modulation (JCM) framework based on Variational Autoencoders (VAEs) to facilitate digital semantic communication by using the Gumbel Softmax method [14] to create a differentiable constellation of symbols for image semantic communications. Yu et al. [15] introduced HybridBSC, a hybrid bit and semantic communication system that enables the co-existence of semantic information and bit information within the same transmission framework for image transmission. Alongside these works, there are numerous surveys and tutorials designed to give a broad overview of semantic communications [16,17,18,19,20]. While the field is still evolving, existing studies predominantly focus on image- or text-based communication.

In contrast to frameworks such as D²-JSCC, which focuses on image-based semantic communications through deep source–channel coding, our work introduces structured feature discretization, allowing it to be integrated with traditional digital telecommunication frameworks. Moreover, while HybridBSC incorporates semantic- and bit-based transmission, a crucial limitation of HybridBSC is the lack of an explicit bit allocation for hybrid bit–semantic transmission. Without a defined bit burden framework, the system’s efficiency and adaptability to varying channel conditions remain vague. In addition, its main focus is on image transmission. Speech signals are required to preserve perceptual quality rather than solely minimize pixel distortion. Yan & Li [21] present a transmission technique for transmitting digital signals by utilizing images as carrier signals. Although this study addresses digital signal transmission within a semantic communication framework, its methodology fundamentally differs from ours in both encoding strategy and transmission approach. These situations leave semantic-aware digital speech transmission underexplored.

To address these gaps, this paper proposes a semantic-aware digital speech communication system that serves as an intermediate transition between conventional and fully end-to-end semantic communication models. Unlike purely deep learning-based systems, our model is capable of the following:

Extracts structured semantic representations of speech while ensuring compatibility with conventional digital transmission techniques.
Discretizes feature tensors before transmission, initiating a pathway for integration with existing bit-based communication systems.
Mirrors the transmitter architecture at the receiver side, reconstructing speech with high perceptual quality while maintaining structured signal integrity.

Our system is designed to function under practical wireless conditions, incorporating stochastic channel models (Additive White Gaussian Noise: AWGN, Rayleigh, Rician multipath fading) while ensuring robust speech transmission across varying signal-to-noise (SNR) levels. Performance is evaluated using the Signal-to-Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI) to comprehensively assess both signal fidelity and intelligibility. By integrating semantic-aware encoding within a structured digital transmission framework, our system remains applicable to existing telecommunication infrastructures while demonstrating the potential of deep learning-based feature representation and reconstruction. This hybrid approach enables gradual adoption in practical deployments, making it more accessible for real-world applications, such as low-bandwidth speech communication, robust voice transmission over wireless networks, and AI-assisted speech processing for edge computing devices.

This study provides a scalable and interpretable approach to integrating semantic-aware speech processing into digital communication frameworks, bridging the gap between conventional wireless systems and emerging AI-driven transmission paradigms. Our core novelty lies in the pragmatic fusion of deep learning-driven semantic speech understanding with established digital communication principles. By moving beyond purely end-to-end neural models, we introduce a structured discretization process that allows semantic features to be transmitted as digital signals, initiating a pathway for integration with existing wireless infrastructure. Specifically, by focusing on speech, we address the critical need for perceptual quality preservation in voice communications—a requirement that differentiates it significantly from image- and text-centric semantic models. It serves as a transitional framework between conventional source–channel coding techniques and end-to-end deep learning-based semantic communication models. Instead of replacing existing paradigms outright, our work demonstrates how semantic processing can be incrementally introduced into current speech communication systems, ensuring practical viability. By seamlessly integrating semantic-aware speech processing into IoT ecosystems, our framework elevates voice-driven interactions in critical domains such as industrial automation, healthcare monitoring, and smart city infrastructure. Designed for practical deployment, the system operates harmoniously with existing edge networks, avoiding costly hardware upgrades while delivering reliable performance in resource-limited environments—an essential feature for scalable IoT solutions. The rest of this paper is organized as follows: Section 2 describes the system model, Section 3 details the experimental setup and implementation, Section 4 presents the results and performance analysis, and Section 5 concludes the study with key findings and future research directions.

2. System Model

The proposed digital speech communication system with semantic awareness is intended to align closely with speech transmission systems in practical use. It uses semantic advanced feature extraction, compression, and semantically aware reconstruction, distinguishing it from pure theoretical performance models. Our pipeline is compatible with conventional digital transmission techniques while allowing for an intermediate transition stage toward fully end-to-end semantic communication models found in the literature.

The system comprises four key components:

Semantic Encoder: It extracts and compresses high-level speech representations, which ensures that the structure remains compatible with conventional transmission methods.
Conventional Transmitter: It converts the compressed representations into a structured digital bitstream and prepares it for transmission through standard digital communication techniques.
Conventional Receiver: It recovers the transmitted bitstream and reconstructs the semantic representation while maintaining alignment with existing digital decoding standards.
Semantic Decoder: It expands and reconstructs the received tensor into an intelligible speech waveform, maintaining both linguistic content and perceptual quality.

2.1. Transmitter

The proposed system model is shown in Figure 1. From the figure, the input of the transmitter is a speech sample sequence:

s = [s_{1}, s_{2}, \dots, s_{W}], s_{w} \in R, w = 1, 2, \dots, W

(1)

where W denotes the total number of samples, and

s_{w}

represents the w-th speech sample. Each sample

s_{w}

is a real-valued scalar, taking positive or negative values within the speech waveform.

The semantic encoder processes the speech sequence and extracts a high-dimensional semantic feature tensor:

F = T_{S} (s; θ_{S}), F \in R^{A \times B \times C}

(2)

where

T_{S} (\cdot)

represents a deep feature extractor parameterized by

θ_{S}

, responsible for capturing linguistic, spectral, and phonetic structures. Here,

A \times B \times C

denotes the dimensionality of the extracted feature tensor, preserving fine-grained speech information.

To enable efficient transmission, the semantic compressor transforms the extracted representation into a compressed latent space, reducing dimensionality while retaining essential semantic features:

X_{s} = T_{C} (F; θ_{C}), X_{s} \in R^{A^{'} \times B^{'} \times C^{'}}

(3)

where

T_{C} (\cdot)

performs structured compression, mapping the original feature tensor

F

into a lower-dimensional latent representation of size

A^{'} \times B^{'} \times C^{'}

.

Since deep learning-based semantic features exist in a continuous-valued space, quantization and encoding are required for digital transmission. First, the semantic feature tensor is vectorized into a sequential representation:

Z = flatten (X_{s})

(4)

A quantization function then maps the floating-point values to discrete integer levels while introducing controlled quantization noise:

B = f_{quant} (Z)

(5)

where

f_{quant} (\cdot)

denotes a quantization function that determines the trade-off between fidelity and bit efficiency. The quantized values are subsequently encoded into a binary stream:

b = bin (B), b \in {0, 1}^{N \times 16}

(6)

Before transmission, the digital bitstream undergoes channel encoding to improve robustness against impairments and is then modulated using M-QAM. The modulated signal, denoted as

X

, is transmitted over the wireless channel.

2.2. Transmission Channel

The wireless channel introduces distortions due to multipath fading and additive noise, which affect the integrity of the transmitted signal. The received signal (

Y

) can be modeled as follows:

Y = H X + N,

(7)

where

X

represents the modulated transmitted signal, and

H

is the channel matrix (Rayleigh, Rician).

N \sim CN (μ, σ^{2} I)

represents AWGN with a mean of

μ

and a variance of

σ^{2}

. The channel, denoted by

p_{h} (y | x)

, characterizes the probabilistic transformation of input symbols x into received symbols y (Figure 1).

2.3. Receiver

At the receiver, a structured decoding process is applied to reconstruct the semantic information from the received signal. The received bitstream undergoes a sequence of transformations, including demodulation, error correction, semantic feature recovery, and final waveform reconstruction.

First, demodulation maps the received symbols back to a bitstream representation. The demodulated bits are then passed through channel decoding to achieve optimal error correction by minimizing the bit error rate (BER). The recovered bitstream is subsequently mapped back to its integer and floating-point representations:

\hat{B} = f_{int} (\hat{b})

(8)

\hat{Z} = f_{float} (\hat{B})

(9)

{\hat{X}}_{s} = reshape (\hat{Z})

(10)

Following this, the semantic expander reconstructs the full-resolution semantic feature tensor representation:

\hat{F} = T_{E} ({\hat{X}}_{s}; θ_{E})

(11)

The semantic interpreter synthesizes the original speech waveform:

\hat{s} = T_{D} (\hat{F}; θ_{D})

(12)

where

T_{D} (\cdot)

represents a learned mapping function that reconstructs the speech signal while preserving perceptual quality and intelligibility. The final output

\hat{s}

approximates the original speech signal with minimal distortion. This architecture bridges conventional communication systems with emerging semantic communication paradigms. The detailed model is shown in Figure 2.

2.4. Performance Metrics

Evaluation of the proposed semantic-aware speech communication system requires metrics that assess both signal fidelity and perceptual quality. Unlike conventional transmission systems that rely primarily solely on BER or Symbol Error Rate (SER), semantic-aware transmission focuses on preserving the intelligibility and meaning of speech rather than exact waveform reproduction. To comprehensively evaluate system performance, the following three key metrics are employed.

2.4.1. Signal-to-Distortion Ratio (SDR)

SDR [22] is a widely used metric that quantifies the level of distortion introduced during communication. It measures the ratio of the power of the original speech signal to the power of the distortion components, providing an objective assessment of signal degradation due to semantic compression, channel impairments, and reconstruction errors. SDR is defined as follows:

SDR = 10 {log}_{10} (\frac{{∥ s ∥}^{2}}{∥ s - \hat{s} ∥^{2}}) (dB)

(13)

where s denotes the original speech signal,

\hat{s}

represents the reconstructed speech signal,

{∥ s ∥}^{2}

is the total energy of the original signal, and

∥ s - \hat{s} ∥^{2}

quantifies the distortion energy introduced during the communication.

A higher SDR value indicates lower distortion and better waveform preservation. However, SDR is a purely signal-based metric and does not account for human auditory perception. Therefore, perceptual evaluation methods are necessary to complement SDR and provide a more holistic assessment of speech quality.

2.4.2. Perceptual Evaluation of Speech Quality (PESQ)

PESQ [23], standardized as ITU-T P.862, provides an objective estimation of speech quality that closely correlates with human perception. Unlike SDR, which evaluates the fidelity of raw signals, PESQ models human auditory characteristics, including frequency masking, time-domain warping, and asymmetric sensitivity to distortions. PESQ scores range from −0.5 to 4.5, where higher values indicate better perceptual quality. Since PESQ considers both time-aligned distortions and perceptual loudness variations, it is particularly suited for assessing speech transmitted over lossy channels.

2.4.3. Short-Time Objective Intelligibility (STOI)

STOI [24] evaluates speech intelligibility by measuring the correlation between the temporal envelopes of the original and processed speech signals. This metric is especially sensitive to time-domain degradations that affect human speech comprehension, making it complementary to PESQ. STOI scores range from −1 to 1, where higher scores indicate greater intelligibility.

To ensure robustness, the selected performance metrics are computed across a dataset of speech signals transmitted under varying SNR levels and different fading channel conditions. This evaluation framework uses multiple metrics to thoroughly assess the system’s capability to maintain both speech quality and intelligibility while operating in practical communication environments.

3. Results

We investigate the performance of our model compared to conventional communication systems for speech signals. The systems are tested under AWGN, Rician, and Rayleigh fading channels, with the assumption of perfect channel state information (CSI) at the receiver. These channel conditions are selected to represent a range of real-world wireless transmission scenarios, from ideal Gaussian noise environments to multipath-dominated fading conditions. The experiments utilize the Edinburgh DataShare dataset [25], which provides diverse speech recordings suitable for robust performance evaluation.

3.1. Dataset and Preprocessing

The dataset consists of a training set of 10,000 .wav files, a validation set of 800 files, and a test set of 50 files. The speech signals are originally sampled at 48 kHz, and in this study, they are downsampled to 8 kHz to match practical bandwidth constraints in wireless communication, as speech intelligibility is largely contained within the 4 kHz frequency range.

Before transmission, all speech samples undergo amplitude normalization to mitigate level variations across recordings to achieve a consistent dynamic range. The selection of dataset emphasizes phonetic and acoustic diversity to evaluate the system’s generalization ability across different speaker profiles and speech characteristics.

For evaluation, each test speech sample undergoes five independent transmission iterations, each initialized with a different random seed. By doing so, we introduce a controlled randomness in the transmission process which ensures robustness against initialization biases. Additionally, performance is assessed across SNR levels and fading channels for simulating real-world wireless conditions.

To ensure a fair comparison between semantic-aware transmission and conventional digital speech communication, we enforce a bit allocation constraint based on conventional PCM encoding. In traditional digital speech transmission, an audio signal is represented by its sampling rate and bit depth, which determine the total number of bits required for transmission. For a speech signal of

2.048

s, sampled at 8 kHz with a 16-bit resolution, the total bit allocation is given by the following:

B_{total} = sampling rate \times duration \times bit depth

(14)

Numerically, in our case, this becomes the following:

262144 bits = (8000 samples / s) \times (2.048 s) \times (16 - bit)

(15)

This bit allocation serves as a reference for conventional communication, where each sample is explicitly quantized into a 16-bit PCM format for digital transmission. To maintain a fair and analytically rigorous comparison, we impose the same bit burden constraint on our proposed model. Specifically, the output of the semantic compressor, after flattening and discretization, is mapped into 16-bit integers to match the bit depth of conventional PCM encoding. This ensures that the overall number of bits used for semantic-aware transmission aligns with conventional digital transmission, enabling an equitable performance evaluation. Furthermore, this structured feature discretization prevents uncontrolled increases in transmission overhead while preserving semantic integrity.

By enforcing this constraint, we confirm that our semantic-aware model aligns with the same bit-rate requirements as conventional methods, allowing for a meaningful comparison in terms of transmission efficiency, perceptual quality, and robustness to channel impairments.

3.2. Implementation Details

The proposed semantic-aware transmission system is implemented using TensorFlow 2.10.0 and trained on an NVIDIA RTX 2080 Ti GPU. The training configuration is as follows:

Batch size: 64
Loss/cost function: MSE
Optimization: RMSProp
Learning rate: 0.0005
Training epochs: 750
Fading model: Rayleigh
Training SNR: 8 dB

The semantic encoder consists of following components as shown in Table 1:

A semantic feature extraction network, comprising two 2D CNN layers followed by five Squeeze-and-Excitation Residual Network (SE-ResNet) blocks, which extract meaningful speech representations.
A semantic compression module, consisting of two 2D CNN layers, which efficiently reduces redundant information before encoding.
A quantization module, which maps the compressed semantic representations to a finite set of discrete values before modulation.

We use SE-ResNet blocks as they are commonly implemented in semantic speech encoding in the literature [26]. Thanks to their attention-based and excitation mechanism, SE blocks adaptively facilitate feature extraction with essential speech information. The selection of hyperparameters, including the training SNR of 8 dB and the number of filters, was based on empirical tuning in place of an exhaustive grid search. This work mainly aims to demonstrate a proof of concept, and these hyperparameters were chosen as reasonable settings rather than globally optimized values. Future work could further refine these choices using hyperparameter search methods.

In the semantic compression module, the first 2D CNN layer employs a ReLU activation function, while the second layer does not incorporate any activation function. At the receiver, a symmetric architecture is employed for semantic decoding, where the semantic expander reconstructs feature dimensions before the semantic interpreter synthesizes the speech waveform. The first 2D transposed CNN layer in the semantic expander operates without an activation function, whereas the second layer applies the ReLU activation function. Through our empirical analyses, this configuration resulted in the lowest reconstruction error. The last layer of the semantic compressor is designed without an activation function to ensure semantic information remains within an appropriate range [11,27]. Notably, we determined that omitting activation functions in specific layers led to the best performance in our case. In particular, the second layer of the semantic compressor and the first layer of the semantic expander achieved the best results without activation functions. Additionally, the output of the semantic interpreter showed improved performance when no activation function was applied. The training phase of the proposed system differs from conventional communication models as it does not involve explicit channel encoding, modulation, or decoding operations. These traditional signal processing components are inherently non-differentiable, which makes them unsuitable for the backpropagation-based optimization commonly used in deep learning models [13]. The conventional transmission system follows a standard digital communication approach, as summarized in Table 2.

We compare our proposed digital semantic-aware system against the conventional approach as shown in Table 2. It is important to note that the conventional system under discussion is maintained in its current form within the proposed new system.

To objectively assess the accuracy of reconstructed speech signals, we analyze SDR, PESQ, and STOI at various SNR levels, which measure how well the system preserves both the structure of the signal and its perceptual quality under different channel conditions. The following figures illustrate the performance evaluation in AWGN, Rayleigh, and Rician (K = 5) channels, using SDR, STOI, and PESQ metrics.

Figure 3 shows the SDR comparison, indicating that the proposed system is better equipped to handle multipath fading, leading to improved signal clarity and lower distortion in Rayleigh and Rician channels. Figure 4 illustrates the STOI results, showing that the proposed system achieves slightly higher intelligibility scores in fading environments compared to the traditional system. However, under AWGN, the conventional method remains better. Figure 5 presents the PESQ scores, demonstrating that while traditional methods perform well under AWGN, the proposed system provides better perceptual quality under severe channel impairments.

Next, we visually compare the performance of the proposed and conventional models through spectrograms for a sample speech signal. Figure 6 illustrates the spectrograms of the original speech signal, while Figure 7 and Figure 8 present the spectogram of reconstructed speech signals under Rician and Rayleigh fading channels for both the proposed and conventional models. These visualizations demonstrate the effectiveness and robustness of our model in preserving spectral features within fading environments, outperforming conventional methods by maintaining critical spectral content.

4. Discussion

The results indicate that under AWGN conditions, the conventional system outperforms the proposed system when the SNR level is higher than 8 dB. However, under fading channels, the proposed semantic-aware system outperforms the conventional approach from low to high-mid SNR regions, demonstrating improved robustness against multipath fading effects. This highlights the ability of semantic-aware transmission to preserve intelligibility and perceptual quality even in highly degraded conditions.

It is important to note that although the proposed system shows improved PESQ and SDR scores in fading scenarios, its STOI performance does not always align with these results. While our system effectively enhances perceptual speech quality compared to the conventional system, as reflected in PESQ and SDR, it potentially introduces subtle distortions in the temporal envelope of the reconstructed speech. These distortions, though not significantly impacting overall perceptual quality, can affect the short-time temporal envelope correlation, which is the core metric assessed by STOI. Specifically, we believe that the semantic decoder, in its effort to generalize and reconstruct speech from compressed semantic features, may smooth out or alter fine-grained temporal details that are critical for STOI. One potential mitigation strategy would be using the hybrid loss function. Incorporating a hybrid loss term that explicitly penalizes temporal envelope distortions during training could help align STOI with PESQ and SDR. This could involve using a time-domain loss function or a STOI-specific loss component.

While we are aware of latency implications, this research places its core focus on the practical applicability of semantic communication. We posit that the semantic computations are executed on a high-performance edge device, mitigating potential latency concerns. Furthermore, the computational efficiency of the proposed model can be significantly enhanced through hardware acceleration, such as Tensor Processing Units (TPUs). Additionally, the application of established model compression techniques, including pruning [28] and quantization [29], effectively reduces the computational burden on resource-constrained IoT devices.

These results show the trade-off between bit-level accuracy and semantic reconstruction. While conventional systems excel in ideal conditions, semantic-aware models demonstrate resilience to real-world wireless impairments, making them more suitable for low-SNR and multipath environments.

Complexity Analysis

To provide a quantitative assessment of the computational complexity of Hybrid-DeepSCS, particularly relevant for IoT and edge device applications, we conducted a detailed analysis of the model’s floating-point operations (FLOPs). The following sections present the complexity analysis and FLOP count for each segment of our model. In this study, we analyze the computational cost of our model by estimating the FLOPs for each of its main components. The results are summarized in Table 3.

From the table, we observe that the semantic interpreter requires the highest computational workload and model size. On the other hand, the semantic compressor has the lowest computational requirement as well as the model size, reflecting its role in reducing data dimensionality before transmission. Future work could explore model compression techniques, such as pruning or quantization, as mentioned earlier, to improve computational efficiency without significant performance loss. Based on the computational power of a given edge device, the latency incurred by the semantic model can be estimated. Specifically, latency can be calculated as the ratio of the FLOP value to the peak floating-point operations per second (FLOPS) capability of the target edge device when the processing unit is operating at maximum utilization [30]. We denote FLOPs as the number of floating-point operations and FLOPS (note the uppercase ‘S’), as a measure of hardware performance. Mathematically, this can be expressed as follows [31]:

Latency ≅ \frac{FLOPs}{FLOPS}

(16)

This approach allows for a device-specific estimation of latency, providing a practical understanding of the system’s performance in resource-constrained environments.

5. Conclusions

In this paper, we present Hybrid-DeepSCS, a novel semantic-aware digital speech communication system that bridges the gap between traditional digital transmission and emerging deep learning-based semantic communication. Our approach integrates a semantic encoder and decoder within a conventional digital transmission chain, enabling the extraction, discretization, and reconstruction of structured semantic features from speech. This hybrid design allows compatibility with existing communication infrastructures while leveraging deep learning for enhanced feature representation and speech reconstruction.

Experimental results demonstrate that Hybrid-DeepSCS significantly improves performance in challenging channel conditions, such as Rayleigh and Rician fading. While conventional systems maintain an advantage in high-SNR AWGN channels, our method offers superior robustness and perceptual quality—measured through PESQ and SDR—under noisy and multipath fading conditions, making it particularly suited for real-world wireless environments. Although STOI performance varies, our findings highlight the trade-off between bit-level accuracy and semantic preservation. Prioritizing semantic content enhances speech intelligibility and resilience, particularly in low-SNR and fading scenarios.

This work marks an important step toward the practical deployment of semantic communication systems, providing a pathway for gradual integration with current technologies. Future research will focus on optimizing semantic encoding and decoding, exploring adaptive modulation strategies, and assessing the impact of different network topologies on overall system performance. Another promising area for future research can be to investigate federated or transfer learning methods to improve the model’s ability to adapt to various speakers and languages. In addition, a further direction is to analyze the system’s performance under extreme bandwidth limitations by testing its effectiveness at ultra-low bit rates. We hope that our work will provide valuable insight for the gradual transition to practical digital semantic communication technologies for future networks.

Author Contributions

Conceptualization, B.K.Ç.; methodology, B.K.Ç., M.Z. and E.Z.E.; software, M.Z.; validation, E.Z.E.; formal analysis, M.Z. and B.K.Ç.; investigation, M.Z. and B.K.Ç.; data curation, M.Z.; writing—original draft preparation, M.Z.; writing—review and editing, B.K.Ç. and E.Z.E.; visualization, M.Z.; supervision, B.K.Ç. and E.Z.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Weaver, W. Recent contributions to the mathematical theory of communication. ETC Rev. Gen. Semant. 1953, 10, 261–281. [Google Scholar]
Bao, J.; Basu, P.; Dean, M.; Partridge, C.; Swami, A.; Leland, W. Towards a theory of semantic communication. In Proceedings of the 2011 IEEE Network Science Workshop, West Point, NY, USA, 22–24 June 2011; pp. 110–117. [Google Scholar]
Niu, K.; Zhang, P. A Mathematical Theory of Semantic Communication. arXiv 2024, arXiv:2401.13387. [Google Scholar]
Shao, Y.; Cao, Q.; Gündüz, D. A Theory of Semantic Communication. IEEE Trans. Mob. Comput. 2024, 23, 12211–12228. [Google Scholar] [CrossRef]
O’Shea, T.; Hoydis, J. An Introduction to Deep Learning for the Physical Layer. IEEE Trans. Cogn. Commun. Netw. 2017, 3, 563–575. [Google Scholar] [CrossRef]
Qin, Z.; Ye, H.; Li, G.Y.; Juang, B.-H.F. Deep Learning in Physical Layer Communications. IEEE Wirel. Commun. 2019, 26, 93–99. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z.; Li, G.Y.; Juang, B.-H. Deep Learning Enabled Semantic Communication Systems. IEEE Trans. Signal Process. 2021, 69, 2663–2675. [Google Scholar] [CrossRef]
Weng, Z.; Qin, Z. Semantic Communication Systems for Speech Transmission. IEEE J. Sel. Areas Commun. 2021, 39, 2434–2444. [Google Scholar] [CrossRef]
Xie, H.; Qin, Z. A Lite Distributed Semantic Communication System for Internet of Things. IEEE J. Sel. Areas Commun. 2021, 39, 142–153. [Google Scholar] [CrossRef]
Tong, H.; Yang, Z.; Wang, S.; Hu, Y.; Saad, W.; Yin, C. Federated Learning based Audio Semantic Communication over Wireless Networks. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; pp. 1–6. [Google Scholar]
Huang, J.; Yuan, K.; Huang, C.; Huang, K. D2-JSCC: Digital Deep Joint Source-channel Coding for Semantic Communications. IEEE J. Sel. Areas Commun. 2025; early access. [Google Scholar] [CrossRef]
Bo, Y.; Duan, Y.; Shao, S.; Tao, M. Joint Coding-Modulation for Digital Semantic Communications via Variational Autoencoder. IEEE Trans. Commun. 2024, 72, 5626–5640. [Google Scholar] [CrossRef]
Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Yu, K.; Fan, R.; Wu, G.; Qin, Z. Hybrid Bit and Semantic Communications. arXiv 2024, arXiv:2404.19477. [Google Scholar]
Qin, Z.; Tao, X.; Lu, J.; Tong, W.; Li, G.Y. Semantic Communications: Principles and Challenges. arXiv 2021, arXiv:2201.01389. [Google Scholar]
Yang, W.; Du, H.; Liew, Z.Q.; Lim, W.Y.B.; Xiong, Z.; Niyato, D.; Chi, X.; Shen, X.; Miao, C. Semantic Communications for Future Internet: Fundamentals, Applications, and Challenges. IEEE Commun. Surv. Tutor. 2023, 25, 213–250. [Google Scholar] [CrossRef]
Gündüz, D.; Qin, Z.; Aguerri, I.E.; Dhillon, H.S.; Yang, Z.; Yener, A.; Wong, K.K.; Chae, C.B. Beyond Transmitting Bits: Context, Semantics, and Task-Oriented Communications. IEEE J. Sel. Areas Commun. 2023, 41, 5–41. [Google Scholar] [CrossRef]
Qin, Z.; Liang, L.; Wang, Z.; Jin, S.; Tao, X.; Tong, W.; Li, G.Y. AI Empowered Wireless Communications: From Bits to Semantics. Proc. IEEE 2024, 112, 621–652. [Google Scholar] [CrossRef]
Islam, N.; Shin, S. Deep Learning in Physical Layer: Review on Data Driven End-to-End Communication Systems and Their Enabling Semantic Applications. IEEE Open J. Commun. Soc. 2024, 5, 4207–4240. [Google Scholar] [CrossRef]
Yan, Z.; Li, D. Semantic Communications for Digital Signals via Carrier Images. arXiv 2024, arXiv:2412.07173. [Google Scholar]
Vincent, E.; Gribonval, R.; Fevotte, C. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef]
Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Proceedings (Cat. No.01CH37221). Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4214–4217. [Google Scholar] [CrossRef]
Valentini-Botinhao, C. Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models, 2016; University of Edinburgh: Edinburgh, UK, 2017. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Dörner, S.; Cammerer, S.; Hoydis, J.; Brink, S.T. Deep Learning Based Communication Over the Air. IEEE J. Sel. Top. Signal Process. 2018, 12, 132–143. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, M.; Shi, J.Q. A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10558–10578. [Google Scholar] [CrossRef] [PubMed]
Novac, P.-E.; Boukli Hacene, G.; Pegatoquet, A.; Miramond, B.; Gripon, V. Quantization and Deployment of Deep Neural Networks on Microcontrollers. Sensors 2021, 21, 2984. [Google Scholar] [CrossRef] [PubMed]
Desislavov, R. Compute and Energy Consumption Trends in Deep Learning Inference. arXiv 2021, arXiv:2109.05472. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]

Figure 1. Semantic-aware digital speech communication system model.

Figure 2. Architecture of the semantic-aware digital speech communication system. (1) A semantic encoder for feature extraction and compression; (2) a conventional transmitter implementing digital communication protocols; (3) the wireless channel modeling real-world propagation effects; (4) a conventional receiver for signal recovery; and (5) a semantic decoder for speech reconstruction.

Figure 3. SDR score versus SNR for the conventional communication system summarized in Table 2 and semantic−aware proposed system over AWGN, Rician, and Rayleigh channels.

Figure 4. STOI value versus SNR for the conventional communication system summarized in Table 2 and semantic−aware proposed system over AWGN, Rician, and Rayleigh channels.

Figure 5. PESQ score versus SNR for the conventional communication system summarized in Table 2 and semantic−aware proposed system over AWGN, Rician, and Rayleigh channels.

Figure 6. Reference original clean speech signal spectrogram.

Figure 7. Conventional versus proposed reconstructed speech spectrograms under the Rician (K

=

5) channel at SNR

=

10 dB. The upper and lower spectrograms correspond to the conventional and proposed models, respectively.

Figure 7. Conventional versus proposed reconstructed speech spectrograms under the Rician (K

=

5) channel at SNR

=

10 dB. The upper and lower spectrograms correspond to the conventional and proposed models, respectively.

Figure 8. Conventional versus proposed reconstructed speech spectrograms under the Rayleigh channel at SNR

=

10 dB. The upper and lower spectrograms correspond to the conventional and proposed models, respectively.

Figure 8. Conventional versus proposed reconstructed speech spectrograms under the Rayleigh channel at SNR

=

10 dB. The upper and lower spectrograms correspond to the conventional and proposed models, respectively.

Table 1. Proposed semantic-aware digital speech communication parameter settings.

Component	Subcomponent	Layers	Filters	Activation
Semantic Encoder	Semantic Extractor	2 × 2D CNN	64	ReLU
	Semantic Extractor	5 × SE-ResNet	64	ReLU
	Semantic Compressor	2 × 2D CNN	128	ReLU/None
Semantic Decoder	Semantic Expander	2 × 2D Transposed CNN	128	None/ReLU
	Semantic Interpreter	5 × SE-ResNet	128	ReLU
		2 × 2D Transposed CNN	128	ReLU
		1 × 2D CNN	1	None

Note: Slashes (/) denote layer-specific activation functions.

Table 2. Conventional transmission pipeline parameter settings.

Parameter	Configuration
Sample Rate	8 kHz
Source Encoder	16-bit PCM
Channel Encoder	Turbo codes
Digital Modulator	64-QAM

Table 3. Model total parameters, size, and FLOPs for each subcomponent.

Subcomponent	Total Parameters	Size (MB)	FLOPs
Semantic Extractor	4,386,112	17.54 MB	$5.73 \times 10^{11}$
Semantic Compressor	221,440	0.88 MB	$4.83 \times 10^{9}$
Semantic Expander	295,168	1.18 MB	$1.21 \times 10^{10}$
Semantic Interpreter	9,086,112	36.34 MB	$1.22 \times 10^{12}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeybek, M.; Kartal Çetin, B.; Engin, E.Z. A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems. Electronics 2025, 14, 1130. https://doi.org/10.3390/electronics14061130

AMA Style

Zeybek M, Kartal Çetin B, Engin EZ. A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems. Electronics. 2025; 14(6):1130. https://doi.org/10.3390/electronics14061130

Chicago/Turabian Style

Zeybek, Münif, Bilge Kartal Çetin, and Erkan Zeki Engin. 2025. "A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems" Electronics 14, no. 6: 1130. https://doi.org/10.3390/electronics14061130

APA Style

Zeybek, M., Kartal Çetin, B., & Engin, E. Z. (2025). A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems. Electronics, 14(6), 1130. https://doi.org/10.3390/electronics14061130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems

Abstract

1. Introduction

2. System Model

2.1. Transmitter

2.2. Transmission Channel

2.3. Receiver

2.4. Performance Metrics

2.4.1. Signal-to-Distortion Ratio (SDR)

2.4.2. Perceptual Evaluation of Speech Quality (PESQ)

2.4.3. Short-Time Objective Intelligibility (STOI)

3. Results

3.1. Dataset and Preprocessing

3.2. Implementation Details

4. Discussion

Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI