Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models

Chen, Xinyi; Lei, Weimin; Zhang, Wei; Wang, Yanwen; Liu, Mingxin

doi:10.3390/sym17060913

Open AccessArticle

Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models

by

Xinyi Chen

,

Weimin Lei

^*

,

Wei Zhang

,

Yanwen Wang

and

Mingxin Liu

The School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 913; https://doi.org/10.3390/sym17060913

Submission received: 8 April 2025 / Revised: 2 June 2025 / Accepted: 5 June 2025 / Published: 10 June 2025

(This article belongs to the Special Issue 2025 9th International Conference on Electronic Information Technology and Computer Engineering)

Download

Browse Figures

Versions Notes

Abstract

Deep neural video compression codecs have shown great promise in recent years. However, there are still considerable challenges for ultra-low bitrate video coding. Inspired by recent diffusion models for image and video compression attempts, we attempt to leverage diffusion models for ultra-low bitrate portrait video compression. In this paper, we propose a predictive portrait video compression method that leverages the temporal prediction capabilities of diffusion models. Specifically, we develop a temporal diffusion predictor based on a conditional latent diffusion model, with the predicted results serving as decoded frames. We symmetrically integrate a temporal diffusion predictor at the encoding and decoding side, respectively. When the perceptual quality of the predicted results in encoding end falls below a predefined threshold, a new frame sequence is employed for prediction. While the predictor at the decoding side directly generates predicted frames as reconstruction based on the evaluation results. This symmetry ensures that the prediction frames generated at the decoding end are consistent with those at the encoding end. We also design an adaptive coding strategy that incorporates frame quality assessment and adaptive keyframe control. To ensure consistent quality of subsequent predicted frames and achieve high perceptual reconstruction, this strategy dynamically evaluates the visual quality of the predicted results during encoding, retains the predicted frames that meet the quality threshold, and adaptively adjusts the length of the keyframe sequence based on motion complexity. The experimental results demonstrate that, compared with the traditional video codecs and other popular methods, the proposed scheme provides superior compression performance at ultra-low bitrates while maintaining competitiveness in visual effects, achieving more than 24% bitrate savings compared with VVC in terms of perceptual distortion.

Keywords:

portrait video compression; diffusion model; adaptive coding strategy; video prediction

1. Introduction

With the rapid development of digital media technology, videos have emerged as the primary medium for information exchange, knowledge dissemination, and entertainment consumption. Driven by the widespread adoption of the Internet and mobile devices, video data is growing exponentially. This trend is evident across industries such as entertainment, education, healthcare and business. Over the past few decades, traditional video coding standards have advanced significantly, from H.264/AVC [1] and H.265/HEVC [2] to H.266/VVC [3]. Each technological iteration has substantially improved video compression efficiency and visual fidelity. However, existing traditional video coding standards are largely constrained by their reliance on manually designed components, thus preventing end-to-end optimization. Furthermore, as temporal and spatial prediction algorithms grow more complex, traditional coding standards face challenges in improving performance. Especially for human-centric video content, manually designed features struggle to sufficiently capture complex non-rigid motions and structural characteristics.

In recent years, the rise of deep learning has presented new opportunities for advancements in video coding. Initially, deep learning was employed to replace specific modules within traditional coding frameworks. These modifications yielded incremental improvements in coding performance. However, these hybrid approaches remain constrained by limitations in global optimization, thereby hindering their ability to fully leverage the capabilities of deep neural networks [4]. To address this challenge, the end-to-end video coding framework DVC [5] pioneered a new approach by replacing all traditional components with neural networks. This approach enabled global optimization and achieved superior encoding performance compared to H.264/AVC under specific conditions. Following this breakthrough, a series of end-to-end video coding frameworks emerged. Departing from DVC’s residual coding approach, Li et al. [6] proposed an end-to-end video coding framework based on conditional coding, called DCVC. By incorporating context-based methods, DCVC overcame the limitations of residual coding. Since then, conditional coding-based end-to-end video coding schemes have made rapid progress, further expanding the potential of deep learning in video compression.

Diffusion models, a type of generative model, draw their foundational concept from the physical diffusion process [7]. In recent years, diffusion models have garnered significant attention in the field of visual generation. The core idea involves gradually adding noise to corrupt the structure of original data and subsequently learning to reverse this process to estimate the original data distribution, thereby enabling the generation of novel, realistic samples. Recently, video prediction and generation based on conditional diffusion models [8,9,10] have garnered growing attention, with latent diffusion models [11] commonly used to reduce inference costs. Meanwhile, due to their strong generative capabilities, diffusion model-based approaches to image and video compression [12,13,14,15,16] have also attracted widespread attention. These methods seek to achieve efficient encoding and decoding processes by learning underlying data distributions. Beyond enhancing compression efficiency and reconstruction quality, these approaches aim to tackle challenges inherent in traditional compression techniques, such as blocking artifacts and blurring. However, the complex structure and non-rigid motion inherent in portrait videos pose significant challenges for diffusion model-based compression methods. Accurately capturing human motion during compression while achieving ultra-low bitrates remains an urgent and unresolved issue. To address this challenge, we propose leveraging the powerful generative capabilities of diffusion models by designing a diffusion model-based prediction model that implicitly captures motion information through video prediction. The high-quality predicted results are then directly utilized as decoded frames, thereby reducing the reliance on keyframes and enabling ultra-low bitrate compression.

Leveraging the temporal prediction capabilities of diffusion models, we propose a predictive portrait video compression method. Specifically, we divide the video frames into keyframes and predicted frames, where the predicted frames are generated using keyframes. We introduce a video prediction model based on conditional latent diffusion, called the temporal diffusion prediction model, which is designed to generate high-quality predicted frames. To fully exploit the high temporal correlation in video frames, we combine 3D causal convolution [17] with the temporal attention mechanism [18], enabling effective utilization of temporal information in video data. We predict the decoded keyframes sequentially and use each prediction result as a decoded frame, progressively reconstructing the complete video content. To ensure the perceptual quality of the prediction results meets practical requirements, we design an adaptive coding strategy at the encoder side. This strategy comprises two key components: frame quality assessment and adaptive keyframe control. The frame quality assessment module quantifies discrepancies between predicted and original frames, dynamically evaluates the visual fidelity of the prediction results, and selects predicted frames that meet the quality threshold. The adaptive keyframe control dynamically adjusts the length of the keyframe sequence based on video motion complexity, thereby minimizing keyframe transmission costs while maximizing the reconstruction quality of the entire video.

Overall, the contributions of this paper are summarized as follows:

We propose a predictive portrait video compression framework that encodes only keyframes and leverages the predicted results as reconstructed frames, thereby enabling high-quality compression at ultra-low bitrates.
We design a video frame prediction model based on conditional latent diffusion, termed the temporal diffusion prediction model. By integrating the 3D causal convolution with temporal attention mechanism, the model fully exploits the temporal correlations in video data, thereby enabling high-quality video prediction.
An adaptive coding strategy is developed. This strategy ensures predicted frames meet the current quality threshold for reconstruction while dynamically controlling the sequence length of keyframes, thereby effectively reducing transmission costs.

The remainder of this paper is organized as follows. Section 2 describes the related works. Section 3 introduces the implementation details of predictive portrait video compression scheme with the conditional latent diffusion model. The experimental results used to validate the effectiveness of the proposed method are shown in Section 4. Section 5 draws the conclusion.

2. Related Work

2.1. Traditional and Hybrid Video Compression

To address the growing demand for video transmission and storage, a series of advanced video coding standards have been developed, such as H.264/AVC [1], H.265/HEVC [2], VP9 [19] and AV1 [20]. Next-generation standards like H.266/VVC [3] and AVS3 [21] continue to be enhanced and refined. By adopting advanced algorithms and techniques, these standards significantly improve compression efficiency while maintaining video quality. This enables better video quality at the same bitrate or the same video quality at a lower bitrate, which is critical for reducing bandwidth consumption in applications such as video streaming, TV broadcasting, and video conferencing. Notably, Fraunhofer HHI has released an efficient open-source VVC codec called VVenC [22]. VVenC is designed to achieve compression performance comparable to the VVC standard while optimizing encoding complexity and significantly reducing computational resource requirements. This ensures that VVenC not only provides efficient compression, but is also easier to integrate into existing multimedia processing systems, making it a convenient component of other encoding solutions. In recent years, video coding tools have been significantly enhanced, leveraging the powerful representation capabilities of deep learning technologies. These tools include intra prediction [23,24,25,26], inter prediction [27,28], entropy coding [29] and loop filtering [30,31]. Akbulut et al. [23] introduced an adaptive subpartition mechanism instead of a fixed number of subpartitions to obtain a better coding gain compared to the reference software VTM11.0. In this paper, we aim to build a relatively basic and general compression framework to more clearly demonstrate the improvements of our method on video coding standards, thus laying the foundation for further in-depth exploration. In addition, since the initial goal of this research is to verify the effectiveness of our method, and VVC [3] has been widely recognized and adopted in the industry, using it as a general and widely accepted benchmark enables us to better evaluate the performance of the new method.

2.2. End-to-End Video Compression

Unlike aforementioned approaches that directly substitute video coding tools with neural network models, learning-based video compression approaches optimize the entire compression framework in an end-to-end manner. Lu et al. [5] pioneered the first end-to-end deep video compression framework, DVC, which integrates critical video coding components, including motion estimation, motion compensation, motion compression, residual compression, quantization and bitrate estimation. Hu et al. [32] introduced a feature-space video coding network, which addresses the limitations of traditional pixel-space operations, such as inaccurate motion estimation and poor compensation effects, by performing core encoding operations in the feature space. Subsequently, numerous end-to-end video compression schemes have rapidly emerged [33,34]. However, residual coding, which removes inter-frame redundancy through simple subtraction, may not be the most effective method for reducing temporal redundancy. To address this, Li et al. [6] proposed the first conditional coding-based end-to-end video coding scheme, DCVC, which extracts temporal context features from motion-compensated reference frames and automatically learns the optimal strategy for reducing temporal redundancy. Subsequently, numerous conditional coding-based end-to-end video compression schemes have rapidly been proposed [35,36,37,38]. Additionally, capitalizing on the powerful generative capabilities of generative adversarial networks (GANs) [39], generative video compression [40,41,42] has emerged as a promising new research direction. Specifically, Konuko et al. [40] constructed a fully hybrid compression framework for video conferencing. The keypoints are extracted at the encoder side, and motion modeling and frame reconstruction are performed at the decoder side. Based on [40], Konuko et al. [41] further proposed the hybrid deep animation codec, a hierarchical hybrid coding scheme that introduces HEVC bitstreams as auxiliary streams, to address the issue of rapidly saturated performance as bandwidth increases. Konuko et al. [42] also proposed to generate animation residuals of predicted and current frames and achieved predictive coding by modeling the temporal correlation of these residuals, which can significantly improve coding performance.

2.3. Diffusion Model Based Image and Video Compression

As a novel generative paradigm that has emerged in recent years, diffusion models provide new solutions for the field of image and video compression due to their outstanding generative capabilities and powerful ability to model data distributions. Some studies [12,13] have already employed diffusion models to achieve ultra-low bitrate image compression. Pan et al. [12] proposed a novel approach for ultra-low bitrate image compression, which transformed images into semantic text embeddings and reconstructed high-fidelity images using a pre-trained text-to-image diffusion model. Yang and Mandt [13] replaced the traditional VAE [43] decoder with a conditional diffusion model, which divides the image information into two latent variables, content and texture and reconstructs a high-quality image with a small number of decoding steps through parameterized training. In the field of video compression, some works [14,15] have effectively overcome the limitations of diffusion models in temporal modeling by introducing strategies such as spatio-temporal bidirectional compression and incorporating temporal context. Specifically, to tackle the limitation of existing latent diffusion models that rely on 2D image VAEs for spatial compression, Chen et al. [14] proposed an omni-dimension compression Variational Autoencoder (OD-VAE), which can compress videos both temporally and spatially, addressing the gap in temporal compression for concise latent representations. Ma and Chen [15] proposed fusing the reconstructed latent representation of the current frame with the temporal context of previously decoded frames via a conditional coding paradigm to achieve high-fidelity reconstruction. For video encoding that also leverages the predictive capabilities of diffusion models, Li et al. [16] use a subset of video frames as keyframes for compression and then predict the remaining frames at the decoder with a conditional diffusion model [8]. This approach achieves better visual quality than traditional methods at very low bitrates, but its generation efficiency still needs improvement. Unlike the keyframe selection in [16], we dynamically adjust the length of keyframe sequences based on motion complexity, so that more important information is kept in dynamic scenes, further improving visual quality.

3. Methodology

3.1. Overview

In this work, we propose a predictive portrait video compression framework that leverages the temporal prediction ability of diffusion models to enhance compression performance. The framework consists of two main components: keyframe compression and temporal diffusion-based prediction for subsequent frames. The general architecture of the proposed scheme is shown in Figure 1. During the encoding stage, k frames are sequentially selected from the image sequence

V_{1 : N}

, and high-quality encoding and transmission are executed using VVenC [22] in the intra-frame mode. Then, the temporal diffusion prediction model is used to iteratively predict k keyframes, predicting a single frame each time, and finally obtaining j initial predicted frames

P_{t + k + 1 : t + k + j}

. To ensure reconstruction quality, the frame quality assessment module uses Learned Perceptual Image Patch Similarity (LPIPS) [44] to assess the perceptual quality difference between the initial predicted frame and the original frame. If the quality of the

(i + 1)

th frame falls below the the threshold

ρ

, it is considered that this frame and all subsequent predicted frames fail to meet the requirements, and only the first i predicted frames

P_{t + k + 1 : t + k + i}

are retained. In addition, the frame quality assessment module sends the current predicted frame end position

(t + k + i)

and the keyframe sequence length k (parameter

δ

in Figure 1) to the arithmetic encoder [45]. Then,

δ

is losslessly compressed and sent to the decoding end. At the same time, parameter

δ

is transmitted to the adaptive keyframe control module to determine the starting position

(t + k + i + 1)

of the next group of keyframes. For the sequence length k of the next group of keyframes, the adaptive keyframe control module is dynamically updated based on the average motion complexity [46] of the previous group of keyframes. During the decoding stage, according to the received keyframe sequence

{\hat{V}}_{t + 1 : t + k}

and parameter

δ

, the decoder uses the temporal diffusion prediction model to generate the final predicted frame

{\hat{P}}_{t + k + 1 : t + k + i}

as the subsequent decoded frame. The entire encoding and decoding stage proceeds in sequence until the processing of N video frames is completed. The framework is symmetric at both ends of the bitstream. At the encoder, the temporal diffusion predictor decides how many frames to predict. At the decoder, the temporal diffusion predictor generates the same predicted frames.

3.2. Temporal Diffusion Prediction Model

In the proposed predictive portrait video coding framework, the reconstructed frames at the decoding end completely depend on the frame prediction ability of the pre-trained temporal diffusion prediction model. The widely used diffusion models currently are mostly derived from DDPMs (Denoising Diffusion Probabilistic Models) [7], which include two processes: the forward process and reverse diffusion process. The forward process refers to the gradual addition of Gaussian noise to the data until it is completely transformed into random noise. It can be represented as

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(1)

where normally distributed noise is slowly added to each sample

x_{t - 1}

to obtain

x_{t}

.

β_{t}

represents the variance parameter over time.

t \in {1, \dots, T}

. T is the total number of steps in the diffusion chain. The reverse diffusion process of the diffusion model

p (x_{t - 1} | x_{t})

is a denoising process, aiming to recover the original data from the noise. Since both the noising and denoising processes in diffusion models are performed on images, the loss function for pixel-domain diffusion models is as follows:

L_{D M} = E_{t, x_{0}, ϵ} {∥ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t)∥}^{2}

(2)

where

{\bar{α}}_{t} = \prod_{s = 1}^{t} 1 - β_{t}

.

ϵ

is the added Gaussian noise,

ϵ \sim N (0, I)

.

ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t)

denotes a neural network parameterized by

θ

that predicts the noise added from

x_{0}

to

x_{t}

.

Inspired by [8], given k keyframes as past frames for video prediction, the conditional diffusion model

p (P_{t + k + 1} | V_{t + 1 : t + k})

can be directly established. Therefore, based on

L_{D M}

, the loss function for the conditional diffusion model is constructed as follows:

L_{C D M} = E_{t, [x_{0}, y], ϵ} {∥ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t, y)∥}^{2}

(3)

where

x_{0}

and

y

represent predicted and past frames, respectively. To reduce computational cost and memory requirements, our temporal diffusion prediction model adopts a latent diffusion model [11], which is trained and inferred in the low-dimensional latent space through a pre-trained VAE, rather than operating directly in the pixel space. Specifically, high-dimensional image data

x

is mapped to a low-dimensional latent representation

z = E (x)

through encoder

E

, while decoder

D

performs the reverse operation. Therefore, based on the conditional diffusion model

L_{C D M}

, the objective function of the conditional latent diffusion model can be expressed as

L_{θ} = E_{t, [z_{0}, y], ϵ} {∥ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} z_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t, y)∥}^{2}

(4)

where

z_{0} = E (x_{0})

is the latent representation of the real image sample

x_{0}

. In this paper,

L_{θ}

is used as the loss function.

To accurately model the distribution of video frames, our temporal diffusion prediction model needs to analyze spatial features and model dynamic evolution in the temporal dimension. Consequently, the network must incorporate explicit the modeling of temporal correlation. Inspired by [47,48], we introduce temporal dilation into the text latent diffusion model [11]. Figure 2 shows the structure of our proposed temporal diffusion prediction model. We employ 3D causal convolution [17] in the shallow layer to extract short-term dynamics and utilize the temporal attention mechanism [18] in the deep layer to capture long-term dependencies.

3.3. Adaptive Coding Strategy

In the proposed predictive portrait video compression framework, the decision-making process at the encoding end constitutes the core mechanism of the entire framework. By leveraging the trained temporal diffusion prediction model, it is possible to strategically avoid encoding most of the keyframes, thereby significantly reducing encoding cost without affecting reconstruction quality. We design an adaptive encoding strategy wherein the keyframe sequence length is dynamically adjusted according to the motion complexity of the video [46]. The pseudocode of this strategy is shown in Algorithm 1. During the encoding stage, k keyframes

K = V_{t + 1 : t + k}

are sequentially selected. The initial length k of the first keyframe group is 2. For subsequent keyframe sequences, the length is determined by comparing the motion complexity

m_{q}

of the current keyframe group with the motion complexity

m_{q - 1}

of the previous keyframe group, with adjustments made in step sizes of

Δ_{k} = 1

, and the maximum length empirically capped at

m a x_{k} = 4

. q denotes the group number. The selected keyframes are then encoded using the video encoder

E n c (\cdot)

and transmitted. Subsequently, the predicted frame sequence

P_{t + k + 1 : t + k + j}

is generated using the given video prediction model

G_{θ} (\cdot)

, which takes the keyframes K as input to predict the next j frames. The prediction results are compared against uncompressed data, guided by a distortion perception threshold

ρ > 0

. For each frame in the predicted frame sequence, if it meets the threshold, i.e.,

D (V_{t + k + i + 1}, P_{t + k + i + 1}) < ρ

, the comparison proceeds to the next frame and record the current position. Otherwise, the end position of the predicted frame is updated to

t + k + i

, and we send the updated parameters

Δ_{k}

to the decoding end, which contains the current keyframe length and the end position of the predicted frame. During the decoding stage, the decoding end first decodes the keyframe sequence

\hat{K}

and employs the same video prediction model

G_{θ} (\cdot)

for prediction. Based on the predicted frame end position in the decoded parameters

\hat{δ}

, the first i predicted frames are selected. Finally, according to the parameters

\hat{δ}

, the decoded keyframe sequence is concatenated with the predicted frame sequence to yield the final reconstructed frame

\hat{V}

.

Algorithm 1 Adaptive coding strategy

Input:: Original video frames $V_{1 : N}$ , Quality threshold $ρ$ , Pre-trained temporal diffusion prediction model $G_{θ} (\cdot)$ , VVenc encoder $E n c (\cdot)$ , VVenc decoder $D e c (\cdot)$ , Arithmetic encoder $A E (\cdot)$ , Arithmetic decoder $A D (\cdot)$ , Keyframe length adjustment step $Δ_{k}$ , Maximum keyframe length $m a x_{k}$
Output:: Decoded keyframe set $\hat{K}$ , Predicted subsequent frame set $\hat{P}$ , Decoded frame set $\hat{V}$

1:: $/ /$ Encoding stage
2:: $k \leftarrow 0$ , $q = 0$ , $δ \leftarrow 0$
3:: while $t \leq N - 1$ do
4:: if $t = 0$ then
5:: $k \leftarrow 2$ $/ /$ Initial keyframe sequence length
6:: $m_{q} = M O T I O N (V_{1 : 2})$ $/ /$ Motion complexity
7:: else
8:: $m_{q} = M O T I O N (K)$
9:: if $m_{q} > m_{q - 1}$ then
10:: $k \leftarrow M I N (k + Δ_{k}, m a x_{k})$
11:: else if $m_{q} < m_{q - 1}$ then
12:: $k \leftarrow M A X (k - Δ_{k}, 2)$
13:: else
14:: $k \leftarrow k$
15:: end if
16:: end if
17:: $K \leftarrow V_{t + 1 : t + k}$ $/ /$ Get k keyframes
18:: SEND $E n c (K)$ $/ /$ Compress keyframes
19:: $P_{t + k + 1 : t + k + j} = G_{θ} (K)$ $/ /$ Get j predicted frames
20:: for $i = 1$ to $j - 1$ do
21:: if $D (V_{t + k + i + 1}, P_{t + k + i + 1}) < ρ$ then
22:: $t \leftarrow t + k + i + 1$
23:: CONTINUE
24:: else
25:: $t \leftarrow t + k + i$
26:: BREAK
27:: end if
28:: end for
29:: $δ \leftarrow (k, t)$
30:: SEND $A E (δ)$
31:: $q \leftarrow q + 1$
32:: end while
33:: $/ /$ Decoding stage
34:: $\hat{K} \leftarrow D e c (E n c (K))$ , $\hat{δ} \leftarrow A D (A E (δ))$ $/ /$ Decompress
35:: $\hat{P} = G_{θ} (\hat{K}) |_{\hat{δ}}$ , $\hat{V} = \hat{K} \cup \hat{P} |_{\hat{δ}}$
36:: return $\hat{K}$ , $\hat{P}$ , $\hat{V}$

4. Experimental Results

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

We employ three portrait video datasets for training and evaluation, described as follows:

Penn Action [49] is a human action dataset that contains 2326 RGB video sequences of 15 different actions. Following [50], we employ eight kinds of actions. We evenly divide the standard dataset into two subsets for training and evaluation and crop and resize them to a resolution of $128 \times 128$ based on the bounding boxes of the bodies.
TaiChiHD [51] is a dataset of TaiChi performance videos of 84 individuals. It contains 2884 training videos and 285 test videos and we resize the resolution of $128 \times 128$ .
Fashion [52] is a dataset of a single model dressed in diverse textured clothing. It contains 500 training videos and 100 test videos and we resize the resolution of $128 \times 128$ .

4.1.2. Evaluation Metrics

To measure the effectiveness of the proposed method, we adopt the Peak Signal-to-Noise Ratio (PSNR) as a pixel-level distortion metric. The PSNR is the ratio of the peak signal energy of an image to the average noise energy. Essentially, the PSNR measures the pixel value differences between two images. The PSNR is greater than 0, expressed in dB, and higher values indicate less distortion. The formula is as follows:

P S N R = 20 \cdot \log_{10} (\frac{M A X_{I}}{\sqrt{M S E}})

(5)

M S E = \frac{1}{h w} \sum_{i = 0}^{h - 1} \sum_{j = 0}^{w - 1} {[I (i, j) - K (i, j)]}^{2}

(6)

where MSE denotes the mean squared error between two images I and K of resolution

h \times w

, and

M A X_{I}

represents the maximum pixel value of the image, which is 255 for 8-bit depth.

For perceptual quality evaluation, we employ Deep Image Structure and Texture Similarity (DISTS) [53] and Learned Perceptual Image Patch Similarity (LPIPS) [44] as evaluation metrics to measure the quality of reconstruction. LPIPS and DISTS evaluate the perceptual quality of images by calculating the feature distance between two images in a convolutional network, which is more consistent with human visual perception. The values of LPIPS and DISTS have a range of (0, 1), where lower values indicate higher similarity between the two images, and conversely, higher values indicate greater differences. Given a reference patch of the real image x and a distorted patch of the noisy image

x_{0}

, LPIPS is as follows:

L P I P S (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥w_{l} ⊙ ({\hat{y}}_{h w}^{l} - {\hat{y}}_{0 h w}^{l})∥}_{2}^{2}

(7)

where l denotes the convolutional layer index, and

H_{l}

and

W_{l}

are the height and width of the feature maps at the l-th layer, respectively.

w_{l}

represents the scaling weight, while

{\hat{y}}_{h w}^{l}

and

{\hat{y}}_{0 h w}^{l}

denote the outputs of the convolutional layer for image patches x and

x_{0}

, respectively. Given real image x and noisy image y, DISTS is as follows:

D I S T S (x, y; α, β) = 1 - \sum_{i = 0}^{h} \sum_{j = 1}^{w_{i}} (α_{i j} l ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)}) + β_{i j} s ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)}))

(8)

where

l ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)})

denotes the feature texture metric at the i-th network layer,

s ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)})

denotes the feature structure metric, and

α_{i j}

and

β_{i j}

are the weights.

4.2. Implementation Details

Our predictive portrait video compression scheme employs the output of the trained prediction model as the reconstructed frame. Consequently, the only component of the entire framework that requires training is the temporal diffusion prediction model. Throughout the training process, we utilize the Adam optimizer [22] with a learning rate of 0.0001, and configure the batch size to 4. All models are trained in parallel on 2 GeForce RTX 3090 GPUs and 12 core CPU (Intel(R) Core(TM) i9-10920X CPU @ 3.50 GHz) using the Python 3.8 and the PyTorch 1.7.1 platform. The total training time for 100 epochs is approximately 8 days. To accelerate sampling, we adopt the fast sampler DDIMs (Denoising Diffusion Implicit Models) [54] in all experiments. The prediction model is trained to generate a single frame at a time. Due to GPU resource constraints, when long-term prediction is resource-intensive, we implement the same autoregressive method as MCVD [8]. Additionally, we define a quality threshold based on the LPIPS metric, with an empirical range of [0.02, 0.30].

4.3. Compression Performance Comparison

4.3.1. Baselines

We compare the proposed framework with the following video compression methods. For traditional video codec, we compare our method with the state-of-the-art VVC [3], configured with the VTM-21.2 software version in LowDelay-P mode. Among learning-based video compression methods, we select DCVC-DC [36], a conditional coding-based approach. Given that our method employs a predictive encoding scheme, we also include RDAC [42], an end-to-end animation model-based method, and Extreme [16], a diffusion model-based approach, for a comparative analysis of video compression performance. To ensure the fairness of the comparison, we retrained these baseline models on the selected datasets.

4.3.2. Quantitative Results

Table 1 presents the PSNR, DISTS, and LPIPS metrics for various video compression methods. Notably, the bitrates of all compared methods are either approximately equal to or higher than those of our method. The bitrate and the distortion of the reconstructed video in video compression are mutually constrained; that is, a lower bitrate may achieve a higher compression ratio but results in greater distortion, whereas a higher bitrate retains more details and yields less distortion in the reconstructed video, but with a lower compression ratio. Thus, as shown in Table 1, our method achieves less distortion at lower bitrates, which fully demonstrates the superior coding performance of our approach. Specifically, it can be observed that in terms of the pixel-level distortion metric PSNR, our method achieves performance comparable to the latest codec standard VVC [3] at similar bitrates. Although our method does not attain the optimal PSNR performance, it demonstrates superior performance on most perceptual metrics, namely DISTS and LPIPS, particularly at lower bitrates. Moreover, compared to other predictive coding schemes such as RDAC [42] and Extreme [16], our method achieves better performance across all metrics. This is because the predictive portrait video compression scheme reduces the need for high-bitrate-encoded keyframes, thereby enabling ultra-low bitrate compression. Meanwhile, our temporally extended diffusion model fully exploits temporal correlations to achieve high-quality video prediction results. Extreme [16] is the most comparable to our approach, and the superior performance of our method demonstrates the advantages of the proposed adaptive coding strategy.

Figure 3 illustrates the rate–distortion performance on the TaiChiHD dataset. We report both the pixel-level distortion metric PSNR and the perceptual metrics LPIPS and DISTS, which better align with human perception, particularly at low bitrates. The green, brown, yellow and blue curves correspond to the results of VVC [3], DCVC-DC [36], RDAC [42] and Extreme [16], respectively, while the orange curve represents our method’s performance. The observations reveal that although DCVC-DC [36] performs well at high bitrates, it encounters limitations in low bitrate scenarios, such as bandwidth-constrained environments. Our method, however, maintains strong performance even in ultra-low bitrate conditions, especially regarding the perceptual metric LPIPS and DISTS. Compared to VVC [3], our method achieves an average bitrate saving of 24.294% in terms of DISTS. Compared to other predictive video coding frameworks, namely RDAC [42] and Extreme [16], our approach achieves superior performance across all bitrate ranges. Even in terms of the pixel-level distortion metric PSNR, our method delivers results comparable to those of the standard codec VVC [3].

4.3.3. Qualitative Results

Figure 4 and Figure 5 illustrates a visual comparison between our method and baselines on the TaiChiHD [51] and Fashion [52] datasets, respectively. The bitrates of all baselines are either comparable or higher than those of our method. It can be seen that our method generates superior visual quality and temporally consistent results in long video sequence compression. The rightmost column in Figure 5 displays the magnified area, where our method demonstrates texture quality comparable to that of DCVC-DC [36]. Additionally, Figure 6 presents a qualitative comparison on the Penn Action [49] dataset. This dataset involves more complex motion patterns. We select the highly representative “push-up” action for detailed demonstration. The results show that our method generates smoother frames, maintains high image quality and temporal coherence, and achieves visual consistency with DCVC-DC [36]. This indicates that our approach can accurately capture key poses of the motion and exhibits superior capability in preserving contextual information. This is attributed to our adaptive adjustment of keyframe length based on motion complexity, enabling video sequences with higher motion intensity to utilize longer keyframe sequences as input, thereby providing more reference information for the prediction model.

4.3.4. Complexity

The complexity comparison is shown in Table 2. To ensure the objectivity of the evaluation results, we select video sequences from the Fashion dataset with a length of 100 frames and a resolution of

128 \times 128

to measure the encoding and decoding time. We set the quantization parameters (QPs) of the key frames to 32, 37, 42, 47, and 51 and use the average encoding and decoding time under these QPs as the actual encoding and decoding time. With the sampling steps set to 50, we find that our method achieves lower encoding and decoding time compared to VVC [22]. This is attributed to DDIM acceleration [54] in our method, which enables high-quality results with fewer sampling steps. In addition, it can be observed that the encoding time of our method is higher than the decoding time. This is because, during encoding, frame quality evaluation needs to be performed alongside prediction, whereas during decoding, only prediction is required.

4.4. Ablation Study

In our predictive portrait video compression framework, the temporal diffusion prediction model employs temporal expansion to fully leverage the temporal information within video data. Furthermore, the adaptive coding strategy constitutes a key mechanism of the entire framework, ensuring that the proposed scheme achieves superior compression performance and visual quality. To demonstrate the effectiveness of these contributions, we conduct ablation studies on the TaiChiHD [51] dataset. As shown in Table 3, “w/o temp” and “w/o adap” denote the removal of temporal expansion in the predictive model and the adaptive coding strategy, respectively. Specifically, “w/o temp” indicates the replacement of 3D causal convolution with 2D convolution and the removal of the temporal attention mechanism in the bottleneck layer. Meanwhile, “w/o adap” implies fixing the keyframe length to 2. From Table 3, it is evident that our proposed improvements significantly enhance video perceptual quality, with further enhancements achieved through the combination of both contributions. Additionally, Figure 7 illustrates that each component and mechanism of our framework improves rate–distortion performance across both the pixel-level distortion metric (PSNR) and perceptual metrics (DISTS and LPIPS).

4.5. Discussion

We propose a predictive portrait video compression framework that achieves high-quality compression at ultra-low bitrates. Specifically, we innovatively design a conditional latent diffusion-based temporal prediction model, which effectively exploits temporal correlations in videos and enhances video prediction quality by integrating 3D causal convolution and temporal attention mechanism. Meanwhile, an adaptive coding strategy dynamically controls the length of keyframe sequences, reducing transmission overhead while ensuring the reconstruction quality of predicted frames. However, there are still some limitations in this research. For example, the model may require further optimization and adjustment to handle complex scene variations, and its computational complexity needs to be reduced to better accommodate scenarios with higher real-time requirements. In the future, improvements can be made to enhance the model’s generalization ability, further reduce computational costs, and expand the applications to more types of video content.

5. Conclusions

In this paper, we propose a predictive portrait video compression framework based on diffusion models. By leveraging the temporal prediction capabilities of the designed temporal diffusion prediction model, we achieve efficient video compression performance. The prediction model integrates 3D causal convolution and temporal attention mechanisms, fully exploiting the temporal correlation in video data. In addition, we develop an adaptive coding strategy comprising two key components: frame quality assessment and adaptive keyframe control. This strategy ensures the perceptual quality of the predicted results and dynamically adjusts the keyframe sequence length according to the motion complexity of the video content. The experimental results demonstrate that under ultra-low bitrates, our framework outperforms the traditional video codec and other popular methods in compression efficiency, achieving more than 24% bitrate savings compared to VVC in terms of perceptual distortion. It also exhibits significant competitiveness in visual quality. For future work, we will further generalize this approach to a broader range of video content to facilitate wider application. In addition, we will further explore how to integrate our method with other video processing techniques, such as video super-resolution and video restoration, in order to build a more comprehensive video processing pipeline.

Author Contributions

Conceptualization, X.C.; Data Curation, M.L.; Investigation, Y.W.; Project Administration, W.Z.; Software, X.C.; Supervision, W.L.; Validation, X.C.; Formal Analysis, W.Z.; Formal Analysis, W.L.; Formal Analysis, M.L.; Visualization, Y.W.; Writing—Original Draft, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities of China (No.TN2216010), the ‘Jie Bang Gua Shuai’ Science and Technology Major Project of Liaoning Province in 2022 (No.2022JH1/10400025) and the National Key Research and Development Program of China (No.2018YFB1702000).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wiegand, T.; Sullivan, G.J.; Bjontegaard, G.; Luthra, A. Overview of the H. 264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 560–576. [Google Scholar] [CrossRef]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Hoang, T.M.; Zhou, J. Recent trending on learning based video compression: A survey. Cogn. Robot. 2021, 1, 145–158. [Google Scholar] [CrossRef]
Lu, G.; Ouyang, W.; Xu, D.; Zhang, X.; Cai, C.; Gao, Z. Dvc: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11006–11015. [Google Scholar]
Li, J.; Li, B.; Lu, Y. Deep contextual video compression. Adv. Neural Inf. Process. Syst. 2021, 34, 18114–18125. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Voleti, V.; Jolicoeur-Martineau, A.; Pal, C. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Adv. Neural Inf. Process. Syst. 2022, 35, 23371–23385. [Google Scholar]
Yang, R.; Srivastava, P.; Mandt, S. Diffusion probabilistic modeling for video generation. Entropy 2023, 25, 1469. [Google Scholar] [CrossRef]
Zhang, Z.; Hu, J.; Cheng, W.; Paudel, D.; Yang, J. Extdm: Distribution extrapolation diffusion model for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19310–19320. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Pan, Z.; Zhou, X.; Tian, H. Extreme generative image compression by learning text embedding from diffusion models. arXiv 2022, arXiv:2211.07793. [Google Scholar]
Yang, R.; Mandt, S. Lossy image compression with conditional diffusion models. Adv. Neural Inf. Process. Syst. 2023, 36, 64971–64995. [Google Scholar]
Chen, L.; Li, Z.; Lin, B.; Zhu, B.; Wang, Q.; Yuan, S.; Zhou, X.; Cheng, X.; Yuan, L. Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv 2024, arXiv:2409.01199. [Google Scholar]
Ma, W.; Chen, Z. Diffusion-based perceptual neural video compression with temporal diffusion information reuse. arXiv 2025, arXiv:2501.13528. [Google Scholar]
Li, B.; Liu, Y.; Niu, X.; Bait, B.; Han, W.; Deng, L.; Gunduz, D. Extreme Video Compression with Prediction Using Pre-trained Diffusion Models. In Proceedings of the 2024 16th International Conference on Wireless Communications and Signal Processing (WCSP), Hefei, China, 24–26 October 2024; pp. 1449–1455. [Google Scholar]
Yu, L.; Lezama, J.; Gundavarapu, N.B.; Versari, L.; Sohn, K.; Minnen, D.; Cheng, Y.; Birodkar, V.; Gupta, A.; Gu, X.; et al. Language Model Beats Diffusion–Tokenizer is Key to Visual Generation. arXiv 2023, arXiv:2310.05737. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. Bevformer: Learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Mukherjee, D.; Bankoski, J.; Grange, A.; Han, J.; Koleszar, J.; Wilkins, P.; Xu, Y.; Bultje, R. The latest open-source video codec VP9-an overview and preliminary results. In Proceedings of the 2013 Picture Coding Symposium (PCS), San Jose, CA, USA, 8–11 December 2013; pp. 390–393. [Google Scholar]
Chen, Y.; Murherjee, D.; Han, J.; Grange, A.; Xu, Y.; Liu, Z.; Parker, S.; Chen, C.; Su, H.; Joshi, U.; et al. An overview of core coding tools in the AV1 video codec. In Proceedings of the 2018 Picture Coding Symposium (PCS), San Francisco, CA, USA, 24–27 June 2018; pp. 41–45. [Google Scholar]
Zhang, J.; Jia, C.; Lei, M.; Wang, S.; Ma, S.; Gao, W. Recent development of AVS video coding standard: AVS3. In Proceedings of the 2019 Picture Coding Symposium (PCS), Ningbo, China, 11–15 November 2019; pp. 1–5. [Google Scholar]
Wieckowski, A.; Brandenburg, J.; Hinz, T.; Bartnik, C.; George, V.; Hege, G.; Helmrich, C.; Henkel, A.; Lehmann, C.; Stoffers, C.; et al. VVenC: An open and optimized VVC encoder implementation. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–2. [Google Scholar]
Akbulut, O.; Konyar, M.Z. Improved intra-subpartition coding mode for versatile video coding. Signal Image Video Process 2022, 16, 1363–1368. [Google Scholar] [CrossRef]
Amna, M.; Imen, W.; Soulef, B.; Fatma Ezahra, S. Machine Learning-Based approaches to reduce HEVC intra coding unit partition decision complexity. Multimed. Tools Appl. 2022, 81, 2777–2802. [Google Scholar] [CrossRef]
Yang, R.; Liu, H.; Zhu, S.; Zheng, X.; Zeng, B. DFCE: Decoder-friendly chrominance enhancement for HEVC intra coding. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1481–1486. [Google Scholar] [CrossRef]
Zhao, T.; Huang, Y.; Feng, W.; Xu, Y.; Kwong, S. Efficient VVC intra prediction based on deep feature fusion and probability estimation. IEEE Trans. Multimed. 2022, 25, 6411–6421. [Google Scholar] [CrossRef]
Wang, Y.; Fan, X.; Xiong, R.; Zhao, D.; Gao, W. Neural network-based enhancement to inter prediction for video coding. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 826–838. [Google Scholar] [CrossRef]
Shang, X.; Li, G.; Zhao, X.; Zuo, Y. Low complexity inter coding scheme for Versatile Video Coding (VVC). J. Vis. Commun. Image Represent. 2023, 90, 103683. [Google Scholar] [CrossRef]
Ma, C.; Liu, D.; Peng, X.; Li, L.; Wu, F. Convolutional neural network-based arithmetic coding for HEVC intra-predicted residues. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1901–1916. [Google Scholar] [CrossRef]
Meng, X.; Jia, C.; Zhang, X.; Wang, S.; Ma, S. Deformable Wiener Filter for Future Video Coding. IEEE Trans. Image Process. 2022, 31, 7222–7236. [Google Scholar] [CrossRef] [PubMed]
Zhu, L.; Zhang, Y.; Li, N.; Wu, W.; Wang, S.; Kwong, S. Neural Network Based Multi-Level In-Loop Filtering for Versatile Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12092–12096. [Google Scholar] [CrossRef]
Hu, Z.; Lu, G.; Xu, D. FVC: A New Framework towards Deep Video Compression in Feature Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1502–1511. [Google Scholar]
Guo, H.; Kwong, S.; Jia, C.; Wang, S. Enhanced motion compensation for deep video compression. IEEE Signal Process. Lett. 2023, 30, 673–677. [Google Scholar] [CrossRef]
Hu, Y.; Jung, C.; Qin, Q.; Han, J.; Liu, Y.; Li, M. HDVC: Deep video compression with hyperprior-based entropy coding. IEEE Access 2024, 12, 17541–17551. [Google Scholar] [CrossRef]
Li, J.; Li, B.; Lu, Y. Hybrid spatial-temporal entropy modelling for neural video compression. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 1503–1511. [Google Scholar]
Li, J.; Li, B.; Lu, Y. Neural video compression with diverse contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22616–22626. [Google Scholar]
Sheng, X.; Li, L.; Liu, D.; Li, H. Vnvc: A versatile neural video coding framework for efficient human-machine vision. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4579–4596. [Google Scholar] [CrossRef]
Li, J.; Li, B.; Lu, Y. Neural video compression with feature modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26099–26108. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 139–144. [Google Scholar]
Konuko, G.; Valenzise, G.; Lathuilière, S. Ultra-low bitrate video conferencing using deep image animation. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4210–4214. [Google Scholar]
Konuko, G.; Lathuilière, S.; Valenzise, G. A hybrid deep animation codec for low-bitrate video conferencing. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1–5. [Google Scholar]
Konuko, G.; Lathuilière, S.; Valenzise, G. Predictive coding for animation-based video compression. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 2810–2814. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014; p. 14. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic coding for data compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
Chen, X.; Lei, W.; Zhang, W.; Meng, H.; Guo, H. Model-based portrait video compression with spatial constraint and adaptive pose processing. Multimed. Syst. 2024, 30, 311. [Google Scholar] [CrossRef]
Gu, X.; Wen, C.; Ye, W.; Song, J.; Gao, Y. Seer: Language instructed video prediction with latent diffusion models. arXiv 2023, arXiv:2303.14897. [Google Scholar]
Blattmann, A.; Rombach, R.; Ling, H.; Dockhorn, T.; Kim, S.W.; Fidler, S.; Kreis, K. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22563–22575. [Google Scholar]
Zhang, W.; Zhu, M.; Derpanis, K.G. From actemes to action: A strongly-supervised representation for detailed action understanding. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2248–2255. [Google Scholar]
Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D. Learning to forecast and refine residual motion for image-to-video generation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 387–403. [Google Scholar]
Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N. First order motion model for image animation. Adv. Neural Inf. Process. Syst. 2019, 32, 7137–7147. [Google Scholar]
Zablotskaia, P.; Siarohin, A.; Zhao, B.; Sigal, L. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv 2019, arXiv:1910.09139. [Google Scholar]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]

Figure 1. An overview of our framework. The blocks with color are the proposed modules. Parameter

δ

contains the current prediction frame end position

t + k + i

and the length of the current keyframe sequence k.

Figure 1. An overview of our framework. The blocks with color are the proposed modules. Parameter

δ

contains the current prediction frame end position

t + k + i

and the length of the current keyframe sequence k.

Figure 2. Structure of temporal diffusion prediction model.

Figure 3. Rate–distortion performance on the TaiChiHD dataset.

Figure 4. Qualitative comparison on TaiChiHD dataset.

Figure 5. Qualitative comparison on Fashion (right) dataset. The rightmost column represents the partially enlarged region at

t = 18

.

Figure 5. Qualitative comparison on Fashion (right) dataset. The rightmost column represents the partially enlarged region at

t = 18

.

Figure 6. Qualitative comparison on Penn Action dataset.

Figure 7. Ablation study results of rate–distortion performance on the TaiChiHD dataset.

Table 1. Quantitative results of the perceptual performance for different video codecs in terms of PSNR, DISTS and LPIPS.

Datasets	Metrics	VVC [3]	RDAC [42]	Extreme [16]	DCVC-DC [36]	Proposed
Penn Action	Bitrate (kbps)	7.8992	19.3188	8.2875	21.5244	7.7517
	PSNR (↑)	29.0025	23.8252	22.4154	36.5420	28.6248
	LPIPS (↓)	0.2420	0.2251	0.1223	0.0936	0.1007
	DISTS (↓)	0.2635	0.2134	0.1050	0.1032	0.0886
TaiChiHD	Bitrate (kbps)	8.9744	16.9623	8.9214	20.8461	8.5186
	PSNR (↑)	28.0949	24.1158	24.8417	35.2847	27.4569
	LPIPS (↓)	0.2140	0.1662	0.1525	0.0681	0.0614
	DISTS (↓)	0.2319	0.2040	0.1396	0.1185	0.0945
Fashion	Bitrate (kbps)	8.9421	14.3497	9.4523	19.4912	8.2478
	PSNR (↑)	29.0217	26.1536	26.4863	32.1523	28.3516
	LPIPS (↓)	0.1627	0.1752	0.1219	0.0702	0.0621
	DISTS (↓)	0.2012	0.1514	0.1021	0.0986	0.1032

Table 2. Complexity comparison.

	Encoding Time	Decoding Time	All
VVC [22]	326.861 s	0.142 s	327.003 s
Proposed	65.352 s	43.811 s	109.163 s

Table 3. Ablation study results in terms of PSNR, DISTS and LPIPS. “w/o temp” and “w/o adap” denote the removal of the temporal expansion of the predictive model and the adaptive coding strategy, respectively.

Datasets	Metrics	w/o Temp	w/o Adap	w/o Temp + Adap	Full Model
Penn Action	Bitrate (kbps)	10.5221	9.6541	13.1585	7.7517
	PSNR (↑)	25.5422	26.7845	23.8561	28.6248
	LPIPS (↓)	0.1254	0.1342	0.1473	0.1007
	DISTS (↓)	0.1128	0.1231	0.1385	0.0886
TaiChiHD	Bitrate (kbps)	12.2441	11.8795	15.2528	8.5186
	PSNR (↑)	23.5247	24.4163	23.1254	27.4569
	LPIPS (↓)	0.1025	0.1094	0.1254	0.0614
	DISTS (↓)	0.1284	0.1732	0.1424	0.0945
Fashion	Bitrate (kbps)	10.0125	10.5569	11.5814	8.2478
	PSNR (↑)	26.8458	27.5967	24.5568	28.3516
	LPIPS (↓)	0.1158	0.1088	0.1205	0.0621
	DISTS (↓)	0.1246	0.1147	0.1384	0.1032

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Lei, W.; Zhang, W.; Wang, Y.; Liu, M. Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models. Symmetry 2025, 17, 913. https://doi.org/10.3390/sym17060913

AMA Style

Chen X, Lei W, Zhang W, Wang Y, Liu M. Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models. Symmetry. 2025; 17(6):913. https://doi.org/10.3390/sym17060913

Chicago/Turabian Style

Chen, Xinyi, Weimin Lei, Wei Zhang, Yanwen Wang, and Mingxin Liu. 2025. "Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models" Symmetry 17, no. 6: 913. https://doi.org/10.3390/sym17060913

APA Style

Chen, X., Lei, W., Zhang, W., Wang, Y., & Liu, M. (2025). Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models. Symmetry, 17(6), 913. https://doi.org/10.3390/sym17060913

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models

Abstract

1. Introduction

2. Related Work

2.1. Traditional and Hybrid Video Compression

2.2. End-to-End Video Compression

2.3. Diffusion Model Based Image and Video Compression

3. Methodology

3.1. Overview

3.2. Temporal Diffusion Prediction Model

3.3. Adaptive Coding Strategy

4. Experimental Results

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Compression Performance Comparison

4.3.1. Baselines

4.3.2. Quantitative Results

4.3.3. Qualitative Results

4.3.4. Complexity

4.4. Ablation Study

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI