This thesis proposes an image semantic coding method, SD_Semantic, which is based on pre-trained generative diffusion models. The following sections will elaborate on the semantic communication model, image semantic coding workflow, and neural network architecture.
2.1. Semantic Communication Model
Source coding is a process of data compression aimed at eliminating redundancy from the source data as much as possible, whereas channel coding introduces redundant information to enhance transmission reliability. The design of both source and channel coding often requires joint consideration.
Figure 1 illustrates a semantic communication architecture based on pre-trained generative diffusion models for image coding. The input source is denoted by
. The semantic spatial feature tensor
z obtained by the semantic extraction network is derived through the semantic extraction function
, where
denotes the learnable parameters of the semantic extraction network. The tensor
z is mapped to the channel input symbol vector
y via quantization and the function
, where
represents the parameters for quantization and entropy coding. The symbol vector
y passes through the wireless channel and arrives at the receiver, with the wireless channel modeled as
, where
denotes the wireless channel parameters. In this paper, we mainly consider the impacts of channel noise types and signal-to-noise ratio (SNR) on the transmitted symbols. At the receiver, an estimate
of the semantic tensor
z is obtained through decoding. The estimate
of the original source is recovered via semantic reconstruction, where
denotes the parameters for entropy decoding, and
represents the parameters of the semantic decoding and reconstruction network.
After designing the semantic extraction network architecture, quantization, and entropy coding methods, this study optimizes the parameters
and
to minimize the distortion between the source image and the reconstructed image at the receiver, as formalized in Equation (
1).
Here,
and
denote the optimal trainable parameters of the encoding and decoding neural networks, while
represents the distortion between the source data and the decoded reconstruction after transmission. In the field of image processing, the similarity between two images is typically measured by the minimum mean squared error (MSE), the peak signal-to-noise ratio (PSNR), and the structural similarity index (SSIM). This paper adopts PSNR, SSIM, and the Learned Perceptual Image Patch Similarity (LPIPS) as distortion metrics, where LPIPS is a perceptual similarity metric based on deep feature representations. PSNR is one of the metrics for measuring image distortion [
15]. Its calculation is based on the definition of MSE, as formalized in Equations (
2) and (
3), given a clean image
I of size
and its noisy version
K.
Here,
denotes the maximum possible pixel value of the image. If each pixel is quantized with 8 bits, then
is 255. This paper denotes the bit depth of pixel quantization as
B, then
. Typically, for uint8 data, the maximum pixel value is 255, whereas for floating-point data, it is 1. The SSIM serves as a metric for quantifying structural similarity between two images [
15]. It computes the mean luminance values of both images and utilizes these means as comparative parameters for luminance alignment, with the calculation method formalized in Equation (
4).
Here, and denote the mean values of the luminance components for image x and image y, respectively, and represent the variances of image x and image y, while and are constants introduced for numerical stability.
2.3. Semantic Coding Neural Network Model
The image semantic extraction and reconstruction modules employ the Variational Autoencoder (VAE) from the stable diffusion model. VAE utilizes pre-trained models to transform an input image into a semantic latent space tensor via a deep neural network. As illustrated in
Figure 3, the VAE encoder decomposes the original image into
independent Gaussian distributions, sampling from which yields an
feature tensor encapsulating the semantic characteristics of the source image [
10]. The VAE encoder and decoder share a symmetric network architecture, trained jointly during optimization.
The following introduces the principles of the VAE network [
16]. First, we assume there is a dataset
, where samples are independent and identically distributed (i.i.d). We assume that each sample in this dataset is generated from a stochastic process, which is described as follows: First, a semantic space tensor
is sampled from the semantic space variable distribution
. We assume that the semantic space distribution
here is a continuous distribution. Then, based on the latent space tensor
, a data sample
is generated, which follows the conditional distribution
.
According to the above assumptions, the probability of each sample in the dataset can be calculated as
. If each term in the formula has an analytical expression, we can solve the model parameters
through maximum likelihood estimation, and the objective function of maximum likelihood is shown in Equation (
5).
However, since
is non-computable, the maximum likelihood function cannot be used to solve for
. VAE introduces a distribution
, which serves as an approximation of the true distribution
. This
z represents the latent space distribution given the sample
x, and we employ machine learning via the VAE network to solve for the parameters
. In VAE,
is the encoder network that maps samples to latent space tensors, while
is the decoder network that maps latent space tensors back to sample data. Since the distribution
is an approximation of the true distribution
, the KL divergence of this distribution with respect to the true distribution on the sample
is as expressed in Equation (
6).
After transposing terms, we can obtain Equation (
7).
The second term is denoted as
, as shown in Equation (
8).
Equation (
7) can be rewritten as Equation (
9).
Since the KL divergence is non-negative,
serves as a lower bound for
. Therefore, the maximum likelihood objective
is transformed into maximizing
, which is the ELBO (Evidence Lower Bound). This can be further decomposed as shown in Equation (
10).
To solve the expectation
containing parameters, it is necessary to compute the gradient with respect to the parameters and update them. For convenience, we denote
. Then, we compute the gradient of
with respect to the parameter
, as shown in Equation (
11).
In the VAE network, a reparameterization trick is proposed. A distribution
and a function
dependent on parameters
are constructed, satisfying Equation (
12).
The above expectation is calculated using the Monte Carlo method to obtain Equation (
13).
where
, substituting Equation (
13) into Equation (
8) gives Equation (
14).
where
, substituting Equation (
13) into Equation (
10) gives Equation (
15).
where
, both Equations (
14) and (
15) can be used to estimate
and the gradients of parameters
and
. The difference between the two formulas is that Equation (
15) requires calculating the KL divergence. When we assume that both
and
are Gaussian distributions, the KL divergence can be directly computed instead of estimated, so the gradient variance calculated by the latter is smaller. The VAE network adopts the approach of calculating the KL divergence.
The noise injection network progressively adds Gaussian noise to the latent space vectors output by the VAE encoder. This process drives the semantic feature tensor toward a stochastic noise distribution. Subsequently, the denoising network predicts and removes the injected noise to recover the original semantic feature tensor. The denoising intensity parameter (denoted as
) controls the noise magnitude: when
, no noise is added; when
, maximum noise is applied. In this study, the initial denoising intensity is set to
. Both the noise injection and denoising networks utilize pre-trained stable diffusion models, with their operational workflows illustrated in
Figure 4 [
10].
In this study, the prompt not only serves as a generic input for the Stable Diffusion model but also includes textual descriptions of the source image generated by the DeepSeek large model. Both the textual descriptions and the semantic feature vectors of the image are jointly fed as inputs for decoding and reconstruction. As channel noise increases and the SNR progressively decreases, this study linearly increases the denoising intensity from 0.04 to 0.2. Guided by the prompt, the noise injection and denoising networks adaptively perform their operations. Under extremely low SNR conditions where the semantic feature tensor is severely corrupted by noise, this thesis assumes the channel can only transmit textual information. In such cases, the text-to-image capability of the stable diffusion model is activated, directly synthesizing images solely from the prompt.
The noise addition network processes the image semantic space tensor
z by progressively introducing Gaussian noise, ensuring that the distribution of the noisy data gradually converges to a Gaussian distribution associated with the input data [
16]. Let the noise-free data be denoted as
, which in this paper refers to the feature tensor obtained by superimposing quantization noise
and channel noise
onto the semantic space feature tensor output by the autoencoder, i.e.,
. Here,
, and
represents the original noise-free data distribution, and the state transition from time
instant to
t is characterized by Equation (
16).
where
,
denotes a Gaussian distribution,
is a noise scaling factor associated with time instant
t, and
is an identity matrix of the same dimension as the initial state
. Given the input
, the joint distribution of
can be expressed as Equation (
17).
According to the properties of Markov processes, the state at time
t given the input
can be expressed as Equation (
18).
where
,
. Based on Equation (
16), the relationship between
and
is shown in Equation (
19).
where
, the relationship between
and
can be obtained by recursion as shown in Equation (
20).
where
, and
is the distribution obtained by summing two Gaussian distributions. According to the properties of Gaussian noise, for two Gaussian distributions with different variances
and
, their summed Gaussian distribution is
. Therefore, Equation (
20) can be rewritten as Equation (
21).
The standard deviation of the sum of two Gaussian distributions is given by Equation (
22).
In the noise addition network, since the noise added at each step is identically distributed Gaussian noise, the noisy state
at time
T can be directly derived from the input
When
, the distribution of
at time
T is nearly a Gaussian distribution, which can be defined as Equation (
23).
The denoising network estimates the noise distribution by learning from the existing states, further obtains the state at the previous time instant, and gradually constructs real data from the Gaussian distribution. Based on the forward diffusion results, it can be considered that the posterior distribution of the noisy state
at time
T satisfies
, and the joint distribution
is also a Markov chain, which is defined as Equation (
24).
The state
at time
can be obtained from the state
at the previous time step
t, and its conditional distribution is expressed as Equation (
25).
Here
and
denote the noise mean and variance obtained by the noise estimation network at time
t, respectively, with
being the parameters of the noise estimation network. In this case, given the input
, the true conditional distribution between the state
at time
t and the previous state
at time
is expressed as Equation (
26).
where the parameters of the noise posterior distribution,
and
, are given by Equation (
27).
Here
, that is,
, so the predicted posterior conditional distribution is shown in Equation (
28).
Based on the known formula, the state
at time
t satisfies
. Therefore, the optimization objective of the denoising network is to make the estimated noise distribution close to the real noise distribution, as shown in Equation (
29).
The state
at time
can be expressed as Equation (
30).
where
, the real data distribution can be gradually obtained through reverse sampling based on the noise distribution estimated by the noise estimation network at different time instants, as per Equation (
30). In the image restoration task of this paper, a conditional diffusion model must be employed to generate the expected restored image. Specifically, the semantic space feature tensor with quantization errors, channel errors, and noise is used as the initial input image, and the text description of the image is introduced as a condition into the noise estimation network to estimate the conditional noise distribution. The conditional diffusion model used in this paper shares an identical forward diffusion process with the classical diffusion model. The only difference lies in whether the image text description is introduced as a prompt during the reverse sampling process [
17]. The text description
m is processed by an encoder
to obtain the corresponding conditional embedding tensor
, which is fused with the input semantic space feature tensor
via cross-attention mechanism to guide image restoration, as shown in Equations (
31) and (
32).
Here,
denotes the intermediate layer representation of the denoising network. Then the objective function under this control condition can be expressed as Equation (
33).