Next Article in Journal
Text-Guided Geometric Relation Parsing with Logic Regularization
Previous Article in Journal
Ring-Shaped Wheeled Mobile Robot Circulation with Modified Van der Pol Limit-Cycle Reference
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stable Diffusion-Driven Semantic Coding Method for Image Transmission Under Low SNR Conditions

The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210007, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(11), 2459; https://doi.org/10.3390/electronics15112459
Submission received: 20 March 2026 / Revised: 10 May 2026 / Accepted: 23 May 2026 / Published: 4 June 2026

Abstract

With the advancement of wireless communication technologies, especially the emergence of mobile communication technologies such as satellite internet and sensor networks, the rapid proliferation of communication facilities has given rise to challenges such as the scarcity of spectrum bandwidth resources, heightened channel interference, and increased noise. Consequently, traditional image source coding technologies urgently require further improvements in their compression ratio and anti-interference capability. Targeting image transmission scenarios characterized by low signal-to-noise ratios and constrained channel bandwidths, this paper proposes an image semantic coding method based on the pre-trained Stable Diffusion model, producing a zero-shot universal image compressor. This compressor leverages the denoising network of the Stable Diffusion model, with feedback from channel SNR, to further enhance the adaptability of transmitted data to channel interference. Additionally, by designing quantization and entropy coding methods for feature tensors in the semantic space, the compression ratio of the image coding process is further improved. Simulation results demonstrate that the proposed method not only achieves superior compression performance but also ensures relatively high similarity between the decoded reconstructed image and the original. Notably, it delivers a significant improvement in the perceptual similarity of human visual quality. Furthermore, the method can adapt to Gaussian noise channels, Rician fading channels, and Rayleigh fading channels with low SNR, exhibiting broad application prospects in the field of wireless communication coding methods, where the electromagnetic environment is growing increasingly complex.

1. Introduction

In recent years, satellite communication networks have witnessed rapid development, emerging as a vital component of civil communication infrastructure and playing an indispensable role in wide-area communications under complex electromagnetic environments [1]. However, due to their long transmission distance, current satellite communication bandwidths are constrained, especially for newly developed satellite internet services, which struggle to meet users’ real-time transmission requirements, among which image transmission demands are the most urgent. Traditional image coding algorithms like JPEG have limited compression capability for data volume under certain distortion constraints. On the other hand, due to the open nature of wireless channels and the gradual scarcity of spectrum resources, interference and noise in transmission channels can no longer be ignored, and malicious interference in adversarial channels is unavoidable. Under harsh channels with low signal-to-noise ratio (SNR), traditional coding methods exhibit a “cliff effect” [2], where image quality deteriorates sharply after decompression. Furthermore, channel feedback enables adaptive coding and related strategies to enhance channel transmission efficiency and achievable rate [3]. With the development of artificial intelligence technology, semantic communication technology based on deep neural networks provides a new approach. Traditional wireless anti-interference technologies rely on Shannon’s formula, sacrificing communication bandwidth as a cost, while semantic communication converts images into semantic information via neural networks [4]. This semantic information contains key image details of human concern; further coding of such information removes redundant and non-critical semantic elements, thus compressing the bit data volume for transmission, reducing bandwidth requirements, and potentially improving interference tolerance beyond the limits of Shannon’s formula.
Based on the above requirements, there is a need to explore a coding technology that meets distortion constraints and exhibits better compression performance under low SNR conditions. With the development of deep learning algorithms and large models, semantic communication has become a research hotspot. Shi Guangming et al. [5] proposed a new semantic communication approach from the perspective of intelligent perception and discussed a semantic coding mechanism. Kalfa et al. [6] introduced a semantic signal-processing framework adaptable to different communication tasks at the receiver. Niu Kai et al. [7] and Zhang Ping et al. [4] explored the measurement of semantic information, proposed an intelligent and efficient semantic communication system architecture, and established a mathematical theory of semantic communication based on synonymous mapping [8]. For image source transmission, Bourtsoulatze et al. [9] proposed Joint Source-Channel Coding based on a convolutional neural network to perform image transmission over wireless channels, optimizing semantic codecs to enhance transmission performance. For image retrieval tasks, Jankowski et al. [9] developed an edge–cloud collaborative semantic communication method, significantly improving task performance. The emergence of large generative AI models has provided new means for developing ultra-low-rate high-fidelity semantic communication systems. Visual generative AI models such as Sora [10], Lumiere [11], and DALL·E [12], pre-trained on massive data, have acquired cognitive foundations about image distributions and can generate high-quality images from text prompts. Current generative models, such as GPT-4, are increasingly applied to image compression, achieving significantly higher compression ratios than traditional algorithms [13].
Current semantic coding approaches still face several critical limitations. Firstly, small models trained on specific datasets lack sufficient generalization capability, performing well only on their designated test sets and proving inadequate as universal compressors. Secondly, semantic compression based on large language models like GPT-4 presents inherent constraints in performance evaluation; typically measured using entire datasets, these methods utilize encoders that output fixed-dimensional vectors. Consequently, adjustments are possible only in input data volume, not the dimensionality of the encoded tensor representation. This renders compression that is solely reliant on large language models impractical at low data volumes [14]. Thirdly, research predominantly focuses on wide-bandwidth scenarios such as 5G mobile communications, largely neglecting low SNR scenarios characterized by significant channel interference.
The main contributions of this paper are outlined as follows.
First, a zero-shot image semantic coding framework is constructed based on the pre-trained Stable Diffusion model (SD_Semantic). The proposed architecture supports general image compression without dataset-specific fine-tuning, overcoming the poor generalization of conventional lightweight deep learning-based coding models.
Second, a channel-aware adaptive semantic optimization mechanism is proposed. By embedding SNR feedback into the denoising network of Stable Diffusion, the developed scheme adapts to adversarial channels with low SNR and strong malicious interference, mitigates the cliff effect of traditional coding, and enhances the transmission’s robustness against interference.
Third, latent-space entropy coding is tailored for semantic feature tensors in diffusion models. It further compresses redundant bit information and achieves high-fidelity image transmission at high compression ratios under low-SNR channel conditions.
Fourth, a multi-dimensional evaluation benchmark is established to compare the proposed scheme with traditional image coding methods and existing semantic coding methods. Experimental results in terms of PSNR, SSIM, LPIPS, bit rate and channel robustness demonstrate that the proposed method achieves superior overall performance over traditional coding schemes in low-SNR scenarios.
Compared with conventional coding and decoding methods, traditional schemes only eliminate redundancy in the pixel domain without exploiting high-level image semantic priors and lack channel adaptation capability. They suffer from sharp image quality degradation under low-SNR channels with malicious interference. By leveraging the powerful visual semantic prior and generative modeling capability of Stable Diffusion, the proposed method removes redundancy at the semantic level. Meanwhile, the introduced channel state awareness mechanism enables the scheme to adapt to complex electromagnetic and interference channels, breaking through the performance bottleneck of traditional coding approaches. Existing CNN-based joint source-channel coding and lightweight semantic models are trained on task-specific datasets, leading to limited generalization and poor universality. Most of them are designed for ideal broadband communication scenarios while ignoring low-SNR conditions and malicious interference. In contrast, this paper adopts a zero-shot architecture based on a pre-trained large generative model, which does not rely on scenario-specific datasets. It inherently adapts to harsh channel environments, thereby possessing wider applicability and stronger robustness.

2. Pre-Trained Model-Based Semantic Coding Methods

This thesis proposes an image semantic coding method, SD_Semantic, which is based on pre-trained generative diffusion models. The following sections will elaborate on the semantic communication model, image semantic coding workflow, and neural network architecture.

2.1. Semantic Communication Model

Source coding is a process of data compression aimed at eliminating redundancy from the source data as much as possible, whereas channel coding introduces redundant information to enhance transmission reliability. The design of both source and channel coding often requires joint consideration. Figure 1 illustrates a semantic communication architecture based on pre-trained generative diffusion models for image coding. The input source is denoted by x R m . The semantic spatial feature tensor z obtained by the semantic extraction network is derived through the semantic extraction function z f se ( x ; φ se ) , where φ se denotes the learnable parameters of the semantic extraction network. The tensor z is mapped to the channel input symbol vector y via quantization and the function y f e ( z ; φ e ) , where φ e represents the parameters for quantization and entropy coding. The symbol vector y passes through the wireless channel and arrives at the receiver, with the wireless channel modeled as y ^ H ( y , ν ) , where ν denotes the wireless channel parameters. In this paper, we mainly consider the impacts of channel noise types and signal-to-noise ratio (SNR) on the transmitted symbols. At the receiver, an estimate z ^ = g d ( y ^ ; φ d ) of the semantic tensor z is obtained through decoding. The estimate x ^ = g sd ( z ^ ; φ sd ) of the original source is recovered via semantic reconstruction, where φ d denotes the parameters for entropy decoding, and φ sd represents the parameters of the semantic decoding and reconstruction network.
After designing the semantic extraction network architecture, quantization, and entropy coding methods, this study optimizes the parameters φ e and φ d to minimize the distortion between the source image and the reconstructed image at the receiver, as formalized in Equation (1).
( φ se * , φ e * , φ d * , φ sd * ) = arg min φ se , φ e , φ d , φ d E x p x E x ^ p x ^ | x [ d ( x , x ^ ) ]
Here, φ e * and φ d * denote the optimal trainable parameters of the encoding and decoding neural networks, while d ( x , x ^ ) represents the distortion between the source data and the decoded reconstruction after transmission. In the field of image processing, the similarity between two images is typically measured by the minimum mean squared error (MSE), the peak signal-to-noise ratio (PSNR), and the structural similarity index (SSIM). This paper adopts PSNR, SSIM, and the Learned Perceptual Image Patch Similarity (LPIPS) as distortion metrics, where LPIPS is a perceptual similarity metric based on deep feature representations. PSNR is one of the metrics for measuring image distortion [15]. Its calculation is based on the definition of MSE, as formalized in Equations (2) and (3), given a clean image I of size m × n and its noisy version K.
MSE = 1 m n i = 0 m 1 j = 0 n 1 [ I ( i , j ) K ( i , j ) ] 2
PSNR = 10 log 10 ( MAX I 2 MSE )
Here, MA X I denotes the maximum possible pixel value of the image. If each pixel is quantized with 8 bits, then MA X I is 255. This paper denotes the bit depth of pixel quantization as B, then MA X I = 2 B 1 . Typically, for uint8 data, the maximum pixel value is 255, whereas for floating-point data, it is 1. The SSIM serves as a metric for quantifying structural similarity between two images [15]. It computes the mean luminance values of both images and utilizes these means as comparative parameters for luminance alignment, with the calculation method formalized in Equation (4).
SSIM = ( 2 μ x μ y + c 1 ) ( 2 σ x σ y + c 2 ) ( μ x 2 + μ y 2 + c 1 ) ( σ x 2 + σ y 2 + c 2 )
Here, μ x and μ y denote the mean values of the luminance components for image x and image y, respectively, σ x and σ y represent the variances of image x and image y, while c 1 and c 2 are constants introduced for numerical stability.

2.2. Semantic Encoding and Decoding Workflow

The source data in this study comprises image tensors denoted as H W C , where H, W, and C represent the height, width, and channel dimensions, respectively. This thesis proposes a semantic image coding method based on pre-trained generative diffusion models for low SNR scenarios. As illustrated in Figure 2, the communication pipeline initiates with natural-scene images as the source signal; these undergo semantic feature extraction to transform and compress raw pixel data into latent-space semantic feature tensors, significantly reducing bitrate requirements. Subsequent quantization of these tensors further enhances compression efficiency, followed by entropy coding and modulation for channel transmission. After demodulation, the received bitstream is decoded and dequantized to reconstruct semantic feature tensors, which are then processed by an auto-decoder to recover image content. Simultaneously, Deepseek-VL generates textual descriptions of the source images, which are transmitted to the receiver to guide the diffusion model’s reconstruction via semantic alignment. Notably, the Low-Density Parity-Check Code (LDPC) coding is adaptively activated based on real-time channel SNR measurements to optimize error correction without necessarily compromising throughput, wherein both its deployment status and code rate are dynamically configurable.

2.3. Semantic Coding Neural Network Model

The image semantic extraction and reconstruction modules employ the Variational Autoencoder (VAE) from the stable diffusion model. VAE utilizes pre-trained models to transform an input image into a semantic latent space tensor via a deep neural network. As illustrated in Figure 3, the VAE encoder decomposes the original image into n n independent Gaussian distributions, sampling from which yields an n n feature tensor encapsulating the semantic characteristics of the source image [10]. The VAE encoder and decoder share a symmetric network architecture, trained jointly during optimization.
The following introduces the principles of the VAE network [16]. First, we assume there is a dataset X = { x ( i ) } i = 1 N , where samples are independent and identically distributed (i.i.d). We assume that each sample in this dataset is generated from a stochastic process, which is described as follows: First, a semantic space tensor z ( i ) is sampled from the semantic space variable distribution p θ ( z ) . We assume that the semantic space distribution p θ ( z ) here is a continuous distribution. Then, based on the latent space tensor z ( i ) , a data sample x ( i ) is generated, which follows the conditional distribution p θ ( x | z = z ( i ) ) .
According to the above assumptions, the probability of each sample in the dataset can be calculated as p ( x ( i ) ) = p θ ( x ( i ) | z ) p θ ( z ) d z . If each term in the formula has an analytical expression, we can solve the model parameters θ through maximum likelihood estimation, and the objective function of maximum likelihood is shown in Equation (5).
θ * = arg max θ i = 1 N log p θ ( x ( i ) )     = arg max θ i = 1 N log p θ ( x ( i ) | z ) p θ ( z ) d z
However, since p ( x ( i ) ) = p θ ( x ( i ) | z ) p θ ( z ) d z is non-computable, the maximum likelihood function cannot be used to solve for θ . VAE introduces a distribution q ϕ ( z | x ) , which serves as an approximation of the true distribution p θ ( z | x ) . This z represents the latent space distribution given the sample x, and we employ machine learning via the VAE network to solve for the parameters ϕ . In VAE, q ϕ ( z | x ) is the encoder network that maps samples to latent space tensors, while p θ ( x | z ) is the decoder network that maps latent space tensors back to sample data. Since the distribution q ϕ ( z | x ) is an approximation of the true distribution p θ ( z | x ) , the KL divergence of this distribution with respect to the true distribution on the sample x ( i ) is as expressed in Equation (6).
D K L ( q ϕ ( z | x ( i ) ) | | p θ ( z | x ( i ) ) ) = q ϕ ( z | x ( i ) ) log q ϕ ( z | x ( i ) ) p θ ( z | x ( i ) ) d z = q ϕ ( z | x ( i ) ) log q ϕ ( z | x ( i ) ) p θ ( z , x ( i ) ) d z + log p θ ( x ( i ) )
After transposing terms, we can obtain Equation (7).
log p θ ( x ( i ) ) = D K L ( q ϕ ( z | x ( i ) ) | | p θ ( z | x ( i ) ) ) q ϕ ( z | x ( i ) ) log q ϕ ( z | x ( i ) ) p θ ( z | x ( i ) ) d z
The second term is denoted as η ( θ , ϕ , x ( i ) ) , as shown in Equation (8).
η ( θ , ϕ , x ( i ) ) = q ϕ ( z | x ( i ) ) log p θ ( z | x ( i ) ) q ϕ ( z | x ( i ) ) d z = E z q ϕ ( z | x ( i ) ) [ log p θ ( z , x ( i ) ) log q ϕ ( z | x ( i ) ) ]
Equation (7) can be rewritten as Equation (9).
log p θ ( x ( i ) ) = D K L ( q ϕ ( z | x ( i ) ) | | p θ ( z | x ( i ) ) ) + η ( θ , ϕ , x ( i ) )
Since the KL divergence is non-negative, η ( θ , ϕ , x ( i ) ) serves as a lower bound for log p θ ( x ( i ) ) . Therefore, the maximum likelihood objective i = 1 N log p θ ( x ( i ) ) is transformed into maximizing i = 1 N η ( θ , ϕ , x ( i ) ) , which is the ELBO (Evidence Lower Bound). This can be further decomposed as shown in Equation (10).
η ( θ , ϕ , x ( i ) ) = q ϕ ( z | x ( i ) ) log p θ ( z , x ( i ) ) q ϕ ( z | x ( i ) ) d z = D K L ( q ϕ ( z | x ( i ) ) | | p θ ( z ) ) + E z q ϕ ( z | x ( i ) ) log p θ ( x ( i ) | z )
To solve the expectation log p θ ( x ( i ) | z ) z q ϕ ( z | x ( i ) ) containing parameters, it is necessary to compute the gradient with respect to the parameters and update them. For convenience, we denote f ( z ) = log p θ ( x ( i ) | z ) . Then, we compute the gradient of E z q ϕ ( z | x ( i ) ) log p θ ( x ( i ) | z ) with respect to the parameter ϕ , as shown in Equation (11).
ϕ E z q ϕ ( z | x ( i ) ) f ( z ) = E z q ϕ ( z | x ( i ) ) f ( z ) ϕ log q ϕ ( z | x ( i ) )
In the VAE network, a reparameterization trick is proposed. A distribution p ( ε ) and a function g ϕ ( ε , x ( i ) ) dependent on parameters ϕ , ε , x ( i ) are constructed, satisfying Equation (12).
E z q ϕ ( z | x ( i ) ) [ f ( z ) ] = E ε p ( ε ) [ f ( g ϕ ( ε , x ( i ) ) ) ]
The above expectation is calculated using the Monte Carlo method to obtain Equation (13).
E z q ϕ ( z | x ( i ) ) [ f ( z ) ] = E ε p ( ε ) [ f ( g ϕ ( ε , x ( i ) ) ) ] 1 L l = 1 L f ( g ϕ ( ε ( l ) , x ( i ) ) )
where ε ( l ) p ( ε ) , substituting Equation (13) into Equation (8) gives Equation (14).
η ( θ , ϕ , x ( i ) ) = 1 L l = 1 L [ log p θ ( z ( i , l ) , x ( i ) ) log q ϕ ( z ( i , l ) | x ( i ) ) ]
where z i , l = g ϕ ( ε l , x i ) , ε l p ( ε ) , substituting Equation (13) into Equation (10) gives Equation (15).
η ( θ , ϕ , x ( i ) ) = D K L ( q ϕ ( z | x ( i ) ) | | p θ ( z ) ) + 1 L l = 1 L log p θ ( x ( i ) | z ( i , l ) )
where z ( i , l ) = g ϕ ( ε ( l ) , x ( i ) ) , ε ( l ) p ( ε ) , both Equations (14) and (15) can be used to estimate η ( θ , ϕ , x ( i ) ) and the gradients of parameters θ and ϕ . The difference between the two formulas is that Equation (15) requires calculating the KL divergence. When we assume that both q ϕ ( z | x ( i ) ) and p θ ( z ) are Gaussian distributions, the KL divergence can be directly computed instead of estimated, so the gradient variance calculated by the latter is smaller. The VAE network adopts the approach of calculating the KL divergence.
The noise injection network progressively adds Gaussian noise to the latent space vectors output by the VAE encoder. This process drives the semantic feature tensor toward a stochastic noise distribution. Subsequently, the denoising network predicts and removes the injected noise to recover the original semantic feature tensor. The denoising intensity parameter (denoted as λ ) controls the noise magnitude: when λ = 0 , no noise is added; when λ = 1 , maximum noise is applied. In this study, the initial denoising intensity is set to λ = 0.2 . Both the noise injection and denoising networks utilize pre-trained stable diffusion models, with their operational workflows illustrated in Figure 4 [10].
In this study, the prompt not only serves as a generic input for the Stable Diffusion model but also includes textual descriptions of the source image generated by the DeepSeek large model. Both the textual descriptions and the semantic feature vectors of the image are jointly fed as inputs for decoding and reconstruction. As channel noise increases and the SNR progressively decreases, this study linearly increases the denoising intensity from 0.04 to 0.2. Guided by the prompt, the noise injection and denoising networks adaptively perform their operations. Under extremely low SNR conditions where the semantic feature tensor is severely corrupted by noise, this thesis assumes the channel can only transmit textual information. In such cases, the text-to-image capability of the stable diffusion model is activated, directly synthesizing images solely from the prompt.
The noise addition network processes the image semantic space tensor z by progressively introducing Gaussian noise, ensuring that the distribution of the noisy data gradually converges to a Gaussian distribution associated with the input data [16]. Let the noise-free data be denoted as z 0 , which in this paper refers to the feature tensor obtained by superimposing quantization noise n q and channel noise n c onto the semantic space feature tensor output by the autoencoder, i.e., z 0 = z + n q + n c . Here, z 0 q ( z 0 ) , and q ( z 0 ) represents the original noise-free data distribution, and the state transition from time t 1 instant to t is characterized by Equation (16).
q ( z t | z t 1 ) = N ( z t ; 1 β t · z t 1 , β t · I )
where t { 0 , 1 , , T } , N denotes a Gaussian distribution, β t is a noise scaling factor associated with time instant t, and I is an identity matrix of the same dimension as the initial state z 0 . Given the input z 0 , the joint distribution of z 1 , z 2 , , z T can be expressed as Equation (17).
q ( z 1 , z 2 , , z T | z 0 ) = t = 1 T q ( z t | z t 1 )
According to the properties of Markov processes, the state at time t given the input z 0 can be expressed as Equation (18).
q ( z t | z t 1 ) = N ( z t ; 1 β t · z t 1 , β t · I )
where α t = 1 β t , α ¯ t = Π s = 0 t α s . Based on Equation (16), the relationship between z t and z t 1 is shown in Equation (19).
z t = α t · z t 1 + 1 α t · μ t 1
where μ t 1 N ( 0 , I ) , the relationship between z t and z 0 can be obtained by recursion as shown in Equation (20).
z t = α t · z t 1 + 1 α t · ϵ t 1 = α t α t 1 · z t 2 + 1 α t α t 1 · ϵ ¯ t 2 = α t α t 1 α t 2 · z t 3 + 1 α t α t 1 α t 2 · ϵ ¯ t 3 = α ¯ t · z 0 + 1 α ¯ t · ϵ
where ϵ N ( 0 , I ) , and ϵ ¯ t 2 is the distribution obtained by summing two Gaussian distributions. According to the properties of Gaussian noise, for two Gaussian distributions with different variances N 0 , σ 1 2 · I and N 0 , σ 2 2 · I , their summed Gaussian distribution is N 0 , σ 1 2 + σ 2 2 · I . Therefore, Equation (20) can be rewritten as Equation (21).
z t = α t · z t 1 + 1 α t · ϵ t 1 = α t · α t 1 · z t 2 + 1 α t 1 · ϵ t 2 + 1 α t · ϵ t 1 = α t α t 1 · z t 2 + α t 1 α t 1 · ϵ t 2 + 1 α t · ϵ t 1 = α t α t 1 · z t 2 + 1 α t α t 1 · ϵ ¯ t 2
The standard deviation of the sum of two Gaussian distributions is given by Equation (22).
α t 1 α t 1 + 1 α t = 1 α t α t 1
In the noise addition network, since the noise added at each step is identically distributed Gaussian noise, the noisy state z T at time T can be directly derived from the input z 0 When α ¯ T 0 , T , the distribution of z T at time T is nearly a Gaussian distribution, which can be defined as Equation (23).
q z T : = q z T x 0 q z 0 d z 0 N z T ; 0 , I
The denoising network estimates the noise distribution by learning from the existing states, further obtains the state at the previous time instant, and gradually constructs real data from the Gaussian distribution. Based on the forward diffusion results, it can be considered that the posterior distribution of the noisy state z T at time T satisfies p z t N z t ; 0 , I , and the joint distribution p θ z 0 , z 1 , , z T is also a Markov chain, which is defined as Equation (24).
p θ z 0 , z 1 , , z T : = p z T t = 1 T p θ z t 1 z t
The state z t 1 at time t 1 can be obtained from the state z t at the previous time step t, and its conditional distribution is expressed as Equation (25).
p θ z t 1 z t = N z t 1 ; μ θ z t , t , Σ θ z t , t
Here μ θ z t , t and Σ θ z t , t denote the noise mean and variance obtained by the noise estimation network at time t, respectively, with θ being the parameters of the noise estimation network. In this case, given the input z 0 , the true conditional distribution between the state z t at time t and the previous state z t 1 at time t 1 is expressed as Equation (26).
q z t 1 z t , z 0 = N z t 1 ; μ ˜ t z t , z 0 , β ˜ t · I
where the parameters of the noise posterior distribution, μ ˜ t and β ˜ t , are given by Equation (27).
μ ˜ t = 1 α t z t β t 1 α ¯ t · ϵ t , β ˜ t = 1 α ¯ t 1 1 α ¯ t · β t
Here θ z t , t = σ t 2 · I , that is, σ t 2 = β ˜ t , so the predicted posterior conditional distribution is shown in Equation (28).
μ θ z t , t = 1 α t z t β t 1 α ¯ t · ϵ θ z t , t
Based on the known formula, the state z t at time t satisfies z t = α ¯ t · z 0 + 1 α ¯ t · ϵ . Therefore, the optimization objective of the denoising network is to make the estimated noise distribution close to the real noise distribution, as shown in Equation (29).
L L D M = E z 0 , t , ϵ t N ( 0 , I ) ϵ t ϵ θ α ¯ t · z 0 + 1 α ¯ t · ϵ , t 2 2
The state z t 1 at time t 1 can be expressed as Equation (30).
z t 1 = α ¯ t 1 z t 1 α ¯ t · ϵ θ z t , t α ¯ t + 1 α ¯ t 1 · ϵ θ z t , t
where z N ( 0 , I ) , the real data distribution can be gradually obtained through reverse sampling based on the noise distribution estimated by the noise estimation network at different time instants, as per Equation (30). In the image restoration task of this paper, a conditional diffusion model must be employed to generate the expected restored image. Specifically, the semantic space feature tensor with quantization errors, channel errors, and noise is used as the initial input image, and the text description of the image is introduced as a condition into the noise estimation network to estimate the conditional noise distribution. The conditional diffusion model used in this paper shares an identical forward diffusion process with the classical diffusion model. The only difference lies in whether the image text description is introduced as a prompt during the reverse sampling process [17]. The text description m is processed by an encoder τ ϖ to obtain the corresponding conditional embedding tensor τ ϖ ( m ) , which is fused with the input semantic space feature tensor z t via cross-attention mechanism to guide image restoration, as shown in Equations (31) and (32).
Attention ( Q , K , V ) = softmax ( Q K T d ) · V
Q = W Q ( i ) · φ i ( z t ) K = W K ( i ) · τ ϖ ( m ) V = W V ( i ) · τ ϖ ( m )
Here, φ i ( z t ) denotes the intermediate layer representation of the denoising network. Then the objective function under this control condition can be expressed as Equation (33).
L L D M = E z 0 , t , m , ϵ t N ( 0 , I ) [ | | ϵ ϵ θ ( z t , t , τ ϖ ( m ) ) | | 2 2 ]

3. Simulation Experiment and Results Analysis

3.1. Simulation Parameter Settings

The simulation employs the Kodak benchmark dataset—a standard test set in image compression research comprising 24 RGB-format images with diverse visual styles and content. For communication channel modeling, 4-QAM modulation is implemented across Gaussian, Rayleigh, and Rician fading channels, with SNR varying from 0 to 50 dB. Detailed simulation parameters are listed in Table 1.
Futhermore, we adopt 8-bit uniform quantization combined with Huffman entropy coding. The denoising strength is controlled by the number of denoising steps, which is calculated as a linear function of the SNR. A higher SNR corresponds to fewer denoising steps. In the test, we use the pre-trained weights of stable diffusion v1.4, and all network parameters are frozen during the experiment. In terms of compression ratio configuration, we set the compression ratios of comparative experiments according to that of the proposed model. This is realized by adjusting the quality factors of JPEG and Webp in the Python 3.8 code. For the Deep-JSCC scheme [2], the compression ratio is adjusted by configuring its bandwidth parameter.

3.2. Experimental Procedure and Results Analysis

3.2.1. Comparison of Source Comprssion Capabilities

To compare the source compression capabilities of the SD_Semantic method with traditional coding methods, we set a high SNR of 100 dB to benchmark the compression ratio and transmission distortion between the proposed semantic compression method SD_Semantic and traditional source-channel coding schemes. Pixel-level distortion was quantified using conventional metrics: PSNR and SSIM. For perceptual quality and semantic fidelity evaluation, the LPIPS metric was employed. LPIPS leverages pre-trained convolutional neural networks (e.g., AlexNet) to extract image features, aligning with human judgments of visual similarity [18]. Since networks like AlexNet are widely adopted in downstream vision tasks (e.g., object detection, image classification, and semantic segmentation), LPIPS effectively captures human-perceived visual quality and semantic distortion. Experimental results are summarized in Table 2. The baseline schemes for comparison in this paper include JPEG, WebP, and Deep-JSCC [2]. Specifically, WebP is a modern image format developed by Google. It is designed to deliver image quality comparable to that of JPEG and PNG while achieving a significant reduction in file size. WebP supports both lossy and lossless compression, and provides a set of advanced features such as transparency, animation, and metadata processing. It mainly adopts intra-frame prediction combined with discrete cosine transform and entropy coding to reduce the file size while preserving image quality as much as possible. Compared with JPEG, WebP introduces an intra-frame prediction algorithm, which reduces storage data volume by predicting the color values of each pixel block. Each pixel block of size 16 × 16 or 4 × 4 can be predicted using the surrounding pixel values. Only the residuals between the actual pixel values and the predicted values are stored after prediction, thereby effectively reducing the amount of data.
As indicated in Table 2, the proposed semantic coding method based on pre-trained generative diffusion models achieves superior compression ratio performance compared with traditional JPEG and WebP. Regarding distortion metrics: In terms of PSNR, Webp outperforms JPEG, while JPEG surpasses the proposed SD_Semantic. In terms of SSIM, Webp exceeds SD_Semantic, and SD_Semantic is superior to JPEG. In terms of LPIPS, SD_Semantic significantly surpasses Webp, while Webp exceeds JPEG. Compared with the Deep-JSCC method, our proposed approach achieves better performance in terms of compression ratio, SSIM and LPIPS. Under high SNR conditions, the core advantages of the proposed algorithm lie merely in the compression ratio and LPIPS perceptual performance.
To visually validate these effects, we randomly selected five images from the dataset for qualitative comparison, as shown in Figure 5. The SD_Semantic coding method can better preserve image details compared to traditional JPEG and Webp, such as the details of the doorknob in the first image, the details of trees and rivers in the second image, the texture details of the window in the third image, the details of the flower stamen in the fourth image, and the texture details of the skin and sweater in the fifth image. That is to say, with a superior compression ratio, the SD_Semantic method can preserve more image details and exhibits superior performance in terms of human visual perception similarity. Compared with Deep-JSCC, the proposed SD_Semantic semantic coding method can preserve more image details while delivering clearer visual quality and better human perceptual quality.

3.2.2. Comparison of Algorithm Adaptability Under Low SNR Conditions

Under the fixed compression ratio in the above experimental setup, as channel quality degrades, the distortion in image transmission increases. To quantify this relationship, this thesis first simulated the bit error rate (BER) performance of 4QAM modulation under three distinct channel conditions (Gaussian channel, Rayleigh channel, and Rician channel) across varying SNR. The results, illustrated in Figure 6, include the Rician channel with a Rician factor of K = 5 . In both Rayleigh fading and Rician fading channels, the BER becomes intolerable when the SNR drops below 20. Such channel conditions are quite common in the low-SNR satellite communication scenarios with severe interference considered in this work. Furthermore, we conduct simulations to evaluate the performance of various coding and decoding schemes under the above channel conditions.
The curves in Figure 7 depict the relationship between image decoding distortion and SNR under the Gaussian fading channel. As SNR decreases, image distortion (containing PSNR, SSIM, and LPIPS) gradually increases. The results demonstrate that our proposed semantic coding method based on pre-trained stable diffusion models (SD_Semantic) outperforms traditional methods under low-SNR conditions.
As shown in Figure 8 and Figure 9, the advantages of the SD_Semantic method are more pronounced in Rayleigh fading channels and Rician fading channels, as the BER of these channels is higher than that of Gaussian fading channels under the same SNR.
When SNR is extremely low, the high BER may lead to the complete failure of signal recovery. Taking the Rayleigh channel as an example, Figure 10 visualizes the image transmission results with increasing SNR. The results intuitively demonstrate that the proposed SD_Semantic exhibits superior adaptability to low-SNR environments. Specifically, SD_Semantic progressively reconstructs images at low SNR levels, while image distortion gradually diminishes as SNR increases.

3.2.3. The Impact of Different Prompts on the SD_Semantic Method

Furthermore, different prompts exert distinct effects on image reconstruction. Taking the image “kodim15” as an example. This thesis tested the impact of varying textual descriptions on its restoration performance, as detailed in Table 3 The text description generated by Deepseek is as follows: “A close-up portrait of an adorable young girl. Her face has playful face paint, with a bright sun motif circling one eye, vibrant colors like yellow, red, and blue. Colorful ribbons adorn her hair. She’s dressed in a cozy, multicolored knitted sweater with a retro–ike pattern. The lighting is gentle and even, creating a warm and lively atmosphere, photorealistic quality, high resolution, capturing the innocence and fun of childhood”. As shown in Table 3, when the SNR is 10 dB, the distortion in transmitted semantic features is minimal. Consequently, image restoration relies less on prompt guidance, and while textual prompts can marginally enhance performance in distortion metrics, their impact remains statistically insignificant.
However, under low SNR conditions, the semantic spatial feature tensor is severely corrupted and may even misguide the image generation process, leading to significant distortion in the reconstructed image at the receiver compared to the original. This thesis attempted to reconstruct the image using only textual descriptions. In this study, for the Kodim15 image under Gaussian noise (SNR < 5 dB), the image generated solely by the Deepseek prompt as quoted above is shown in Figure 11b, with the original image in Figure 11a. The quantitative metrics between them are P S N R = 10.7824 dB, S S I M = 0.2080 , and L P I P S = 0.7060 . Although the reconstructed image fails to transmit the original visual content accurately, it maintains semantic perceptual similarity to the source image. The description text can be adaptively adjusted according to channel quality, ranging from several bits to hundreds of bits. A larger number of text bits enables more detailed semantic description and achieves higher similarity between the generated image and the original image.

3.2.4. The Compatibility of the SD_Semantic Coding Method

The SD_Semantic coding method exhibits excellent compatibility, which can be hybridized with the traditional LDPC channel coding method to achieve superior anti-interference capability. Its superior source compression capability significantly conserves bandwidth resources while remaining compatible with LDPC codebooks exhibiting enhanced error correction capabilities. Figure 12 and Figure 13 compare the performance of the combined semantic coding and LDPC(1/2) scheme against traditional JPEG+LDPC(1/2) and JPEG+LDPC(1/2). As demonstrated in Figure 13, even with higher compression ratios, the integration of semantic coding enables image transmission to adapt to lower SNR regions. Moreover, it demonstrates significant advantages in LPIPS under ultra-high compression ratios.

4. Conclusions

To address the challenge of image transmission under low SNR and bandwidth-constrained conditions, we proposed an image semantic coding method, SD_Semantic, which was based on pre-trained generative diffusion models and realized a zero-shot universal image compressor. The method integrated SNR feedback into the denoising network of the diffusion model, enhancing its adaptability to channel interference. Meanwhile, by performing quantization and entropy coding on feature tensors in the semantic space, the compression ratio was dynamically optimized. Simulation results demonstrated that while maintaining superior compression efficiency, The human perceptual quality is significantly improved compared with traditional coding and decoding methods. Furthermore, it robustly adapts to lower SNR scenarios in Gaussian noise channels, Rician fading channels, and Rayleigh fading channels, exhibiting broad application prospects in wireless image transmission scenarios with increasingly complex electromagnetic environments. The proposed algorithm is built upon the diffusion model and entails relatively high computational complexity. Given the limited computing resources of mobile communication devices, model lightweighting will be a key concern in our subsequent research.

Author Contributions

Conceptualization, S.L. and R.L.; methodology, S.L.; validation, S.L., J.Q. and Z.Y.; data curation, J.Q. and Z.Y.; writing—original draft preparation, S.L.; writing—review and editing, Y.Z.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The validation dataset employed in this work is the widely used public dataset Kodak, which can be downloaded via the following URL: https://www.kaggle.com/datasets/sherylmehta/kodak-dataset (accessed on 20 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Silva, H.T.P.D.; Silva, H.S.; Figueiredo, F.A.P.; Anjos, A.A.D.; Souza, R.A.A. A survey on noise-based communication. arXiv 2025, arXiv:2511.04011. [Google Scholar] [CrossRef]
  2. Bourtsoulatze, E.; Kurka, D.B.; Gunduz, D. Deep joint source-channel coding for wireless image transmission. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 567–579. [Google Scholar] [CrossRef]
  3. Tan, C.W. Optimal Power Control in Rayleigh-Fading Heterogeneous Wireless Networks. IEEE/ACM Trans. Netw. 2016, 24, 940–953. [Google Scholar] [CrossRef]
  4. Zhang, P.; Xu, W.; Gao, H.; Niu, K.; Xu, X.; Qin, X.; Yuan, C.; Qin, Z.; Zhao, H.; Wei, J.; et al. Toward wisdom-evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks. Engineering 2022, 8, 60–73. [Google Scholar] [CrossRef]
  5. Shi, G.; Xiao, Y.; Li, Y.; Xie, X. From semantic communication to semantic-aware networking: Model, architecture, and open problems. IEEE Commun. Mag. 2021, 59, 44–50. [Google Scholar] [CrossRef]
  6. Kalfa, M.; Gok, M.; Atalik, A.; Tegin, B.; Duman, T.M.; Arikan, O. Towards goal-oriented semantic signal processing: Applications and future challenges. Digit. Signal Process. 2021, 119, 103134. [Google Scholar] [CrossRef]
  7. Wang, Y.; Han, H.; Feng, Y.; Zheng, J.; Zhang, B. Semantic communication empowered 6g networks: Techniques, applications, and challenges. IEEE Access 2025, 13, 28293–28314. [Google Scholar] [CrossRef]
  8. Niu, K.; Zhang, P. A mathematical theory of semantic communication. arXiv 2024, arXiv:2401.13387. [Google Scholar]
  9. Jankowski, M.; Gunduz, D.; Mikolajczyk, K. Wireless image retrieval at the edge. IEEE J. Sel. Areas Commun. 2021, 39, 89–100. [Google Scholar] [CrossRef]
  10. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 10684–10695. [Google Scholar]
  11. Karn, A.; Kumar, S.; Kushwaha, S.K.; Katarya, R. Image synthesis using gans and diffusion models. In Proceedings of the 2023 IEEE International Conference on Contemporary Computing and Communications (InC4), Bangalore, India, 21–22 April 2023; IEEE: New York, NY, USA, 2023; Volume 1, pp. 1–6. [Google Scholar]
  12. Kingma, D.P.; Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. arXiv 2018, arXiv:1807.03039. [Google Scholar] [CrossRef]
  13. Li, C.H.Z.; Wang, X.; Hu, H.; Wyeth, C.; Bu, D.; Yu, Q.; Gao, W.; Liu, X.; Li, M. Lossless data compression by large models. Nat. Mach. Intell. 2025, 7, 794–799. [Google Scholar] [CrossRef]
  14. Deletang, G.; Ruoss, A.; Duquenne, P.-A.; Catt, E.; Genewein, T.; Mattern, C.; Grau-Moya, J.; Wenliang, L.K.; Aitchison, M.; Orseau, L.; et al. Language modeling is compression. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 March 2024. [Google Scholar]
  15. Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  16. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. Camb. Explor. Arts Sci. 2024, 2. [Google Scholar] [CrossRef]
  17. Luo, C. Understanding diffusion models: A unified perspective. arXiv 2022, arXiv:2208.11970. [Google Scholar] [CrossRef]
  18. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2366–2369. [Google Scholar]
Figure 1. Semantic encoding and decoding communication workflow.
Figure 1. Semantic encoding and decoding communication workflow.
Electronics 15 02459 g001
Figure 2. Semantic coding workflow based on stable pre-trained diffusion models.
Figure 2. Semantic coding workflow based on stable pre-trained diffusion models.
Electronics 15 02459 g002
Figure 3. Schematic architecture of VAE.
Figure 3. Schematic architecture of VAE.
Electronics 15 02459 g003
Figure 4. Schematic architecture of the noise injection network and the denoising network.
Figure 4. Schematic architecture of the noise injection network and the denoising network.
Electronics 15 02459 g004
Figure 5. Visualized results of image restoration at SNR = 100 dB.
Figure 5. Visualized results of image restoration at SNR = 100 dB.
Electronics 15 02459 g005
Figure 6. BER simulation results under different SNR conditions.
Figure 6. BER simulation results under different SNR conditions.
Electronics 15 02459 g006
Figure 7. Image decoding distortion in Gaussian fading channel.
Figure 7. Image decoding distortion in Gaussian fading channel.
Electronics 15 02459 g007
Figure 8. Image decoding distortion versus SNR in Rayleigh channel.
Figure 8. Image decoding distortion versus SNR in Rayleigh channel.
Electronics 15 02459 g008
Figure 9. Image decoding distortion versus SNR in Rician channel ( K = 5 ).
Figure 9. Image decoding distortion versus SNR in Rician channel ( K = 5 ).
Electronics 15 02459 g009
Figure 10. Visualized results of image transmission under SNR from 30 to 50 dB.
Figure 10. Visualized results of image transmission under SNR from 30 to 50 dB.
Electronics 15 02459 g010
Figure 11. Comparison between the image generated solely from a textual description and the original image.
Figure 11. Comparison between the image generated solely from a textual description and the original image.
Electronics 15 02459 g011
Figure 12. Image decoding distortion versus SNR in Gaussian channel.
Figure 12. Image decoding distortion versus SNR in Gaussian channel.
Electronics 15 02459 g012
Figure 13. Image decoding distortion versus SNR in Rayleigh channel with LDPC coding.
Figure 13. Image decoding distortion versus SNR in Rayleigh channel with LDPC coding.
Electronics 15 02459 g013
Table 1. Simulation parameter.
Table 1. Simulation parameter.
Configuration ItemValue/Specification
DatasetKodak
Modulation Scheme4-QAM
Channel TypeGaussian, Rayleigh, Rician
SNR0–50 dB
LDPC code rate 1 / 2
Table 2. Compression ratio and image distortion at SNR = 100 dB.
Table 2. Compression ratio and image distortion at SNR = 100 dB.
Evaluation MetricSD_SemanticJPEGWebPDeep-JSCC
Compression ratio0.0048310.0052030.0055040.33
PSNR23.58869825.06153627.51618925.0750
SSIM0.8067160.7399520.8591740.5999
LPIPS0.1016480.2919460.2551750.4470
Table 3. Impact of different prompts on distortion in reconstructed images.
Table 3. Impact of different prompts on distortion in reconstructed images.
PromptsPSNRSSIMLPIPS
No prompt25.418340.865900.13966
High quality25.420400.865940.13975
Masterpiece25.421220.865930.13987
Best quality25.418230.865910.13967
Highly detailed25.413810.865860.13984
HDR25.411120.865850.14000
Text description25.419120.865970.13982
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Lv, R.; Yang, Z.; Qin, J.; Zhu, Y. Stable Diffusion-Driven Semantic Coding Method for Image Transmission Under Low SNR Conditions. Electronics 2026, 15, 2459. https://doi.org/10.3390/electronics15112459

AMA Style

Liu S, Lv R, Yang Z, Qin J, Zhu Y. Stable Diffusion-Driven Semantic Coding Method for Image Transmission Under Low SNR Conditions. Electronics. 2026; 15(11):2459. https://doi.org/10.3390/electronics15112459

Chicago/Turabian Style

Liu, Sili, Rong Lv, Zhixi Yang, Junxiang Qin, and Yonggang Zhu. 2026. "Stable Diffusion-Driven Semantic Coding Method for Image Transmission Under Low SNR Conditions" Electronics 15, no. 11: 2459. https://doi.org/10.3390/electronics15112459

APA Style

Liu, S., Lv, R., Yang, Z., Qin, J., & Zhu, Y. (2026). Stable Diffusion-Driven Semantic Coding Method for Image Transmission Under Low SNR Conditions. Electronics, 15(11), 2459. https://doi.org/10.3390/electronics15112459

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop