Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC

He, Shuqian; Jin, Biao; Tian, Shangneng; Liu, Jiayu; Deng, Zhengjie; Shi, Chun

doi:10.3390/electronics13245028

Open AccessArticle

Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC

by

Shuqian He

^1,2,

Biao Jin

^1,2,

Shangneng Tian

^1,2,

Jiayu Liu

¹,

Zhengjie Deng

^1,2,* and

Chun Shi

³

¹

School of Information Science and Technology, Hainan Normal University, Haikou 571158, China

²

Hainan Provincial Engineering Research Center for Artificial Intelligence and Equipment for Monitoring Tropical Biodiversity and Ecological Environment, Haikou 571158, China

³

School of Electronic and Information, Guangdong Polytechnic Normal University, Guangzhou 510640, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(24), 5028; https://doi.org/10.3390/electronics13245028

Submission received: 7 November 2024 / Revised: 9 December 2024 / Accepted: 18 December 2024 / Published: 20 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

In video encoding rate control, adaptive selection of the initial quantization parameter (QP) is a critical factor affecting both encoding quality and rate control precision. Due to the diversity of video content and the dynamic nature of network conditions, accurately and efficiently determining the initial QP remains a significant challenge. The optimal setting of the initial QP not only influences bitrate allocation strategies but also impacts the encoding efficiency and output quality of the encoder. To address this issue in the H.266/VVC standard, this paper proposes a novel hierarchical reinforcement learning-based method for adaptive initial QP selection. The proposed method introduces a hierarchical reinforcement learning framework that decomposes the initial QP selection task into high-level and low-level strategies, handling coarse-grained and fine-grained QP decisions, respectively. The high-level strategy quickly determines a rough QP range based on global video features and network conditions, while the low-level strategy refines the specific QP value within this range to enhance decision accuracy. This framework integrates spatiotemporal video complexity, network conditions, and rate control objectives to form an optimized model for adaptive initial QP selection. Experimental results demonstrate that the proposed method significantly improves encoding quality and rate control accuracy compared to traditional methods, confirming its effectiveness in handling complex video content and dynamic network environments.

Keywords:

hierarchical reinforcement learning; initial quantization parameter; video encoding; rate control; H.266/VVC

1. Introduction

The rapid development of 5G and ultra-broadband networks has led to the emergence of new media applications, such as ultra-high-definition video, virtual reality (VR), and augmented reality (AR), which place higher demands on video encoding technologies [1,2]. The latest video coding standard, H.266/Versatile Video Coding (VVC), has significantly improved compression efficiency and flexibility. However, the dynamic network environments and the diverse nature of video content present substantial challenges in rate control, particularly in selecting the initial quantization parameter (QP). The initial QP selection directly influences output bitrate, video quality, and computational complexity, making it a critical factor in rate control for video encoding [3,4]. Traditional methods for initial QP selection struggle to adapt to real-time network fluctuations and varied content characteristics, highlighting the need for more intelligent and adaptive strategies.

1.1. The Role of Rate Control in Video Encoding

As multimedia technology and network communication evolve rapidly, video applications are becoming increasingly indispensable. From HD and UHD video to VR, the volume of video data is growing exponentially. Video encoding technology has become crucial for efficient video transmission and storage within the limits of bandwidth and storage resources. Rate control plays a pivotal role in this process. Its primary objective is to maximize both subjective and objective video quality while satisfying network bandwidth and device constraints [2]. By dynamically adjusting encoding parameters, rate control ensures the output bitrate stays within a target range, preventing noticeable quality fluctuations. This is especially important for real-time video communication, video-on-demand, and live streaming applications, where variations in network bandwidth and latency must be effectively managed [4,5]. In modern video coding standards like H.264/AVC, H.265/HEVC, and H.266/VVC, rate control involves complex algorithms, including bit allocation, quantization parameter adjustments, and buffer management. Effective rate control strategies not only improve encoding efficiency but also enhance user experience and reduce transmission load on networks.

1.2. The Role of Initial Quantization Parameter Selection in Video Encoding

The quantization parameter (QP) is a crucial element in video encoding that affects both encoding quality and bitrate. A higher QP reduces quantization precision, lowering the bitrate but also diminishing video quality, while a lower QP improves quality at the cost of a higher bitrate [4]. Therefore, optimal QP selection is essential for effective rate control. The initial QP setting is particularly important as it serves as the baseline for further encoding adjustments. If chosen poorly, the encoder may require frequent adjustments, increasing computational complexity and potentially failing to meet rate control targets [6]. Traditional methods for initial QP selection often rely on fixed values, empirical formulas, or simple linear models, which fail to account for the complexities of video content and dynamic network conditions. This limitation calls for adaptive approaches based on spatiotemporal video features and network conditions, which have become a focal point of recent research in video encoding.

1.3. Paper Organization

This paper addresses the challenge of adaptive initial QP selection in video encoding rate control by proposing a hierarchical reinforcement learning-based solution. Section 2 reviews related research, including advancements in H.266/VVC rate control, methods for initial QP selection, and the application of reinforcement learning in video encoding, with a focus on hierarchical reinforcement learning. Section 3 presents a detailed problem formulation, identifying the challenges of adaptive initial QP selection and providing a foundation for model development. Section 4 introduces the hierarchical reinforcement learning model, decomposing the initial QP selection task into high-level and low-level strategies for coarse- and fine-grained decisions, respectively. Section 5 outlines the algorithm design, including the network architecture, training methods, and flow of operations. Section 6 presents experimental validation and a comparative analysis against baseline approaches. Finally, Section 7 concludes the paper with a summary of the findings.

2. Related Work

2.1. Research on Rate Control in H.266/VVC

H.266/Versatile Video Coding (VVC) is a next-generation video coding standard jointly developed by the International Organization for Standardization (ISO) and the International Telecommunication Union (ITU). Compared to its predecessor, H.265/HEVC, H.266/VVC achieves approximately 50% bitrate savings at the same subjective quality level [5]. However, as encoding complexity increases, the challenge of achieving efficient rate control while maintaining encoding performance has become a central research focus.

Current research on VVC rate control primarily focuses on several key areas: improvements to rate-distortion optimization (RDO) models, where traditional rate control methods largely rely on Lagrangian multipliers for RDO [7,8,9,10,11,12,13]. In VVC, new coding tools and modes introduce additional complexity, prompting researchers to study Lagrangian multiplier estimation and adjustment to accommodate these new features. To improve the accuracy of rate control, some works [4,5,6,7,8,9,10,11,12,13,14,15,16] have incorporated machine learning methods. For instance, neural networks have been applied to predict the complexity of coding units, thereby enabling more precise bitrate allocation [14,15,16]. For content-adaptive rate control, studies [17,18,19] have proposed methods that dynamically adjust rate control parameters based on video texture and motion characteristics.

Nevertheless, research on rate control in VVC is still in its early stages, and significant challenges remain, especially for real-time encoding under complex network conditions. Therefore, developing efficient rate control methods suitable for VVC is of great importance.

2.2. Research on Initial Quantization Parameter Selection in Rate Control

The selection of the initial quantization parameter (QP) is a critical step in rate control, directly impacting the encoder’s output bitrate and video quality. Traditional initial QP selection methods include the use of fixed QP values, where a constant initial QP is set for all video sequences [6]. However, this approach fails to account for content complexity, potentially resulting in unbalanced encoding quality. Empirical methods [2] have used formulas based on average luminance or motion information, but these linear models often fail to capture the nonlinear characteristics of diverse video content. More advanced approaches [20,21,22,23,24,25] map initial QP values to video features through statistical analysis of historical data, thus enabling more accurate rate control. However, these methods require extensive data and have limited generalizability. Recently, machine learning models such as support vector machines (SVMs) and decision trees [6,26,27] have been used to predict initial QP values, capturing more complex mapping relationships, but they may face the “curse of dimensionality” when dealing with high-dimensional features.

Overall, traditional initial QP selection methods struggle to balance accuracy and computational efficiency, especially in high-resolution and real-time encoding scenarios. There is an urgent need for more effective strategies for initial QP selection.

2.3. Research on Reinforcement Learning in Rate Control

Reinforcement learning (RL), a significant branch of machine learning, has shown great promise in decision-making and control. Recently, RL has been applied to video encoding rate control, with techniques like Q-learning used to optimize parameter adjustments [28]. However, Q-learning faces limitations in high-dimensional, continuous spaces. The introduction of deep reinforcement learning (DRL) has overcome some of these challenges by using neural networks to approximate value functions or policies, making it suitable for handling high-dimensional state spaces. DRL has been applied to adaptive bitrate adjustments [29,30,31,32,33,34] and encoder parameter selections, improving encoding efficiency [35]. However, RL still faces challenges, such as reward function design, high-dimensional state and action spaces, and training efficiency. Traditional RL methods also struggle with hierarchical decision-making, which presents opportunities for the exploration of hierarchical reinforcement learning (HRL) in rate control.

2.4. Research on Hierarchical Reinforcement Learning

Hierarchical reinforcement learning (HRL) is an approach that decomposes complex tasks into subtasks, improving learning efficiency and policy generalization. The “select–execute” structure, where high-level policies select subtasks and low-level policies execute actions, is widely used in HRL, especially for tasks with clear hierarchical structures [36,37,38,39]. The options framework, introduced by Sutton et al., allows high-level policies to choose from sub-policies, enabling temporal abstraction and long-span decision-making [40]. Despite HRL’s broad applicability, its use in video encoding rate control, especially for adaptive initial QP selection, remains limited. Some initial studies have explored HRL for hierarchical adjustments of encoding parameters, but a systematic approach is yet to be developed [41,42]. Given HRL’s potential, further research is needed to explore its full applicability in video encoding rate control.

3. Problem Statement

3.1. Background

In video encoding, the selection of the initial quantization parameter (QP) is a critical step in rate control. For the H.266/VVC standard, the use of complex coding tools and a variety of coding modes has made initial QP selection increasingly challenging. A well-chosen initial QP significantly impacts the encoder’s output bitrate, video quality, and computational complexity. Traditional methods for initial QP selection often rely on fixed values or simple empirical formulas, which fail to adapt to varying video content and network conditions, resulting in suboptimal encoding efficiency.

3.2. Challenges and Requirements

The diversity of video content and the dynamic nature of network conditions make initial QP selection a high-dimensional and nonlinear optimization problem. Additionally, in practical applications, the encoder must complete the initial QP selection within a limited time frame to meet real-time encoding requirements. Therefore, an effective method is needed to adaptively adjust the initial QP based on diverse video features and network conditions, thereby optimizing encoding performance. The desired approach should improve encoding quality and rate control accuracy without significantly increasing computational complexity.

3.3. Problem Definition

Based on the above challenges, this study addresses the problem of leveraging a hierarchical reinforcement learning (HRL) framework to adaptively select the initial quantization parameter in H.266/VVC encoding. The goal is to enhance encoding quality and rate control accuracy while meeting real-time processing requirements.

Specifically, the objective is to design a hierarchical reinforcement learning model in which the high-level policy, based on global video features and network conditions, quickly determines a rough range for the initial QP. Within this range, the low-level policy, guided by finer-grained video features, selects a specific initial QP value. This hierarchical strategy effectively reduces the search space, improving prediction accuracy and efficiency, thus enabling adaptive selection of the initial QP.

4. Model Construction

4.1. Hierarchical Reinforcement Learning Architecture

To address the adaptive selection of initial QP, this study employs a hierarchical reinforcement learning (HRL) architecture, as shown in Figure 1. Here, the video encoder acts as the environment, with video frame features, encoder information, and network conditions representing the environmental states. To optimize the quantization parameter selection for initial I- and P-frames, an HRL structure is utilized, decomposing the problem into high-level and low-level strategies. The high-level policy is responsible for long-term planning, establishing a coarse range for the initial QP based on global information to meet target rate control and rate-distortion objectives. The low-level policy, focusing on short-term decision-making, selects the specific QP within this range. To address sparse rewards, experience replay is applied, and both the high- and low-level policies are based on the Deep Q-Network (DQN) algorithm. The high-level policy is updated based on external rewards, while the low-level policy is refined with internal rewards from the high-level controller.

The video encoder is an interactive process between the environment and the hierarchical reinforcement learning agent, combined with the structure of the offline data trainer. In this framework, the video encoder is the environment, which is mainly responsible for accepting the actions of the agent (i.e., the selection of the initial quantization parameter QP) and performing the video encoding process based on these parameters. The encoder generates the corresponding video compression result according to the QP value and feeds back the encoded video quality index. The state space of the environment contains the characteristics of the video sequence. The agent is the decision-making core of the entire system and adopts the hierarchical reinforcement learning (HRL) algorithm. The agent consists of two main parts: the high-level policy, which is responsible for quickly determining the rough range of the initial quantization parameter QP based on the global video features and network status provided by the environment. The role of the high-level policy is to reduce the search space and provide a reasonable initial value range for the low-level policy. The low-level policy, within the range determined by the high-level policy, makes a fine QP selection based on finer-grained video and network features. The goal of the low-level policy is to accurately select the final QP value based on the detailed features to achieve the best encoding quality. Action selection is where the agent selects an action based on the feedback from the environment and the current strategy determines the value of the initial QP. The main task of the offline data trainer is to use historical data for agent training and strategy optimization. Offline data usually contain video features, quality indicators, and quantization parameters under various encoding scenarios and network conditions. These data are used to train the high-level and low-level strategies of the agent to enhance the adaptability of the agent when facing different encoding scenarios. Training data, including encoded video sample data, contain information such as encoding quality, bit rate, QP setting, etc., for each video sample. Through offline training, the reinforcement learning algorithm optimizes the strategy of the agent. During the training process, the agent accumulates experience by interacting with the environment and continuously adjusts its behavior strategy to achieve the best QP selection.

4.2. Definition of States, Actions, and Rewards

4.2.1. State Space

(1): High-level State S^h:

The high-level state S^h is defined as follows:

S^{h} = {{{\bar{y}}_{f r a m e}, V a r}_{f r a m e}, D i f f_{y}, {\bar{y}}_{C U}, {V a r}_{C U}, D i f f_{I P}, T b b p}

(1)

The specific parameters in the formula are defined as follows:

Current Frame Average Luminance:

{\bar{y}}_{f r a m e} = \frac{\sum y}{N_{f r a m e}}

(2)

Among them, y is the brightness value and N_frame is the number of pixels in the frame.

Current Frame Luminance Variance:

{V a r}_{f r a m e} = \frac{\sum {(y - {\bar{y}}_{f r a m e})}^{2}}{N_{f r a m e}}

(3)

Max–Min Pixel Difference:

D i f f_{y} = |m a x (y) - m i n (y)|

(4)

Current CU Average Luminance:

{\bar{y}}_{C U} = \frac{\sum y}{N_{C U}}

(5)

Current CU Luminance Variance:

{V a r}_{C U} = \frac{\sum {(y - \bar{y})}^{2}}{N_{C U}}

(6)

where N_CU is the number of pixels in the CU.

Average Difference Between I- and P-Frames:

D i f f_{I P} = \frac{\sum |I_{i j} - P_{i j}|}{N_{f r a m e}}

(7)

where I_ij and P_ij are the pixel values of the I frame and the P frame at position (i, j), respectively.

Initial Target Bits per Pixel:

{T b p p}_{I n i t} = \frac{T B R}{F R \times F P}

(8)

where TBR, FR, and FP are the target bitrate, frame rate, and frame pixel count, respectively.

The current coding unit uses the current target pixel bits as shown below:

{T b p p}_{c u r r e n t} = \frac{{T b p p}_{I n i t} \times (A F N - 1) \times F P - R_{c o d e d}}{(A F N - 1) \times F P}

(9)

(2): Low-level State S^l:

The low-level state S^l is defined as follows:

S^{l} = {{{\bar{y}}_{C U}, V a r}_{C U}, {Q P}_{r a n g e}, {\bar{D}}_{c o d e d}, D_{e x p e c t e d}, R_{e x p e c t e d}, T b b p}

(10)

Average Luminance of Current CU:

For the I-frame, this is the mean luminance of the current CU

{\bar{y}}_{C U} = \frac{\sum y}{N_{C U}}

; for the P-frame, it is the average difference in luminance between the current CU and the corresponding CU in the reference frame

{\bar{y}}_{C U} = \frac{\sum d y}{N_{C U}}

.

Variance of Current CU:

Similarly, for the I-frame, this is the luminance variance of the current CU:

{V a r}_{C U} = \frac{\sum {(y - {\bar{y}}_{C U})}^{2}}{N_{C U}}

(11)

while for the P-frame, it is the variance of the difference between the current CU and the corresponding CU in the reference frame

{V a r}_{C U} = \frac{\sum {(d y - {\bar{y}}_{C U})}^{2}}{N_{C U}}

.

QP Range for High-Level Policy:

{Q P}_{r a n g e} = {Q P}_{u p} - {Q P}_{d o w n}

(12)

where QP_up and QP_down are the upper and lower limits of the QP interval, respectively.

Coded Average Distortion:

{\bar{D}}_{c o d e d} = \frac{D_{c o d e d}}{N_{c o d e d}}

(13)

where D_coded is the distortion of the coded unit and N_coded is the number of coded units.

Predicted Distortion:

D_{e x p e c t e d} = {γ \cdot λ (Q P)}^{τ}

(14)

Predicted Bitrate:

R_{e x p e c t e d} = β \cdot {{(D}_{e x p e c t e d})}^{K}

(15)

where

γ

,

β

, and K are hyperparameters, and

λ (Q P)

is a function of the quantization parameter.

The target bits per pixel (Tbpp) for the first coding unit are calculated as follows:

{T b p p}_{I n i t} = \frac{T B R}{F R \times F P}

(16)

where TBR, FR and FP represent the target bitrate, frame rate, and number of pixels per frame, respectively.

For the current coding unit, the target bits per pixel are as follows:

{T b p p}_{c u r r e n t} = \frac{{T b p p}_{I n i t} \times (A F N - 1) \times F P - R_{c o d e d}}{(A F N - 1) \times F P}

(17)

where AFN and R_coded denote the total number of frames in the video sequence and the number of bits already coded in the CU, respectively.

4.2.2. Action Space

High-Level Action A^h: This action selects a coarse range QP_range for the initial QP by dividing the QP values into fixed intervals and choosing one of these intervals.

Low-Level Action A^l: Within the selected QP_range, the low-level policy selects the specific initial QP value QP_selected.

4.2.3. Reward Function

High-Level Reward R^h:

R^{h} = - (β D_{e x p e c t e d} + (1 - β) |R_{e x p e c t e d} - R_{t a r g e t}|)

(18)

This reward evaluates the impact of the high-level policy’s choice of QP_range on overall encoding performance, considering both encoding quality and rate control accuracy. Here, D_expected represents the average expected distortion within QP_range, and R_expected represents the expected average bitrate, calculated using Formulas (14) and (15).

Low-Level Reward R^l:

R^{l} = - (β D_{a c t u a l} + (1 - β) |R_{a c t u a l} - R_{t a r g e t}|)

(19)

This reward evaluates the effect of the low-level policy’s selection of QP_selected on actual encoding performance, where D_actual and R_actual are the actual encoding distortion and bitrate.

4.3. Reinforcement Learning Objective

The objective function is defined as follows:

\max_{π^{h}, π^{l}} E [\sum_{t = 0}^{T} γ^{t} (R_{t}^{h} + R_{t}^{l})]

(20)

where

π^{h}

and

π^{l}

are the high-level and low-level policies,

γ

is the discount factor, and T is the time step (number of encoding units).

For example, the input video resolution is 1920 × 1080 and the frame rate is 30 fps. The target bit rate is 3 Mbps. The agent selects the initial QP through its high-level strategy and low-level strategy. The high-level strategy quickly determines a rough QP range based on the characteristics of the video encoder [30,35]. The low-level strategy further selects a specific QP value of 32 based on more detailed features within the range determined by the high-level strategy. The encoder encodes the video according to the QP value selected by the agent (QP = 32) and obtains a peak signal-to-noise ratio (PSNR) of 35 dB and a bit rate of 2.8 Mbps, which is close to the target bit rate of 3 Mbps and performs well. The encoder returns these feedbacks (PSNR and bit rate) to the agent as reward signals.

4.4. Model Advantages

The proposed hierarchical approach decomposes the complex initial QP selection problem into two levels, reducing the complexity of each subproblem. The model dynamically adjusts its strategy based on real-time state information, adapting to diverse video content and network conditions. By narrowing the search space through the high-level policy, the decision efficiency of the low-level policy is enhanced, facilitating more accurate and efficient initial QP selection.

5. Algorithm Design

This section integrates a hierarchical reinforcement learning (HRL) method within the H.266/VVC video encoder. Given varying encoding complexities, different video content can consume distinct amounts of bits, even when using the same QP. While content complexity is often a reliable indicator of intra-frame complexity, mismatches between encoding content and distortion can lead to scenarios where the same QP value yields differing results. To address this, the proposed approach employs an HRL-based adaptive initial QP selection method. The rate control problem is formulated as a hierarchical optimization problem, where the encoding environment serves as the environment, the encoder itself acts as the agent, the QP values are the actions, and the rate-distortion performance metrics represent rewards. The state comprises the encoding frame’s characteristics and the network state.

In the HRL framework, shown in Figure 1, two DQN networks are employed, each utilizing a deep neural network to approximate the Q-function. The high-level DQN comprises six fully connected layers: an input layer with 7 nodes representing the seven high-level state features, four hidden layers with 200 nodes each, and an output layer with 4 nodes corresponding to the four temporary QP range targets. The low-level DQN follows a similar structure, with an input layer of 7 nodes, four hidden layers of 200 nodes, and an output layer of 3 nodes representing the specific QP values. Both DQNs use the ReLU function as the activation function.

In this work, an offline reinforcement learning approach is used, leveraging pre-collected interaction data (state, action, reward, and next state) to train the policy. This approach maximizes cumulative reward and reduces the high cost associated with interacting with the video encoder environment, enabling adaptive selection of initial QP based on HRL.

Offline HRL requires addressing the following three key challenges. (1) Dataset Construction: Collecting a sufficient and representative offline dataset that includes states, actions, rewards, and next states. (2) Policy Learning: Developing an offline reinforcement learning algorithm to learn optimal high-level and low-level policies using offline data. (3) Distribution Shift: Addressing distribution mismatches between the learned policy and the offline dataset to avoid unsafe or ineffective policy exploration.

For dataset construction, high-level states include global video features and network conditions, while low-level states encompass local video features and encoding parameters. The actions include a rough initial QP range set by the high-level policy and a specific QP value selected by the low-level policy. Rewards are calculated based on rate-distortion performance metrics, and the next state reflects the system state after executing the chosen actions.

To ensure diverse data, we segment various standard test video sequences into sub-sequences of varying frame counts, covering a range of resolutions, frame rates, content complexities, and frame numbers. Network conditions are simulated by target bitrates derived from different fixed QP settings, modeling different network bandwidths, latencies, and jitters.

Action Selection Strategy: Multiple strategies are employed to generate actions, including random, rule-based, and default encoder strategies, ensuring action space diversity. Data collection is conducted by initializing states for the video sequence, network, and encoder settings, selecting high- and low-level actions, executing encoding, calculating rewards, and storing the tuple (S, A, R, R′) in the dataset. A total of 1664 data points from 24 video sequences were collected, covering a wide range of state and action spaces to ensure data diversity and representativeness, avoiding concentrated data distributions.

Offline Reinforcement Learning: During policy updates, constraints are applied to control policy changes, preventing excessive deviation from the dataset’s distribution. Value function estimates are also corrected to prevent overestimation, which could destabilize the policy. For high-level policy training, offline policy gradient methods are employed to maximize cumulative rewards, while the low-level policy is trained using offline reinforcement learning techniques.

Training Algorithms 1 and 2:

Algorithm 1. Training the High-Level Policy Network

1:

Initialize the high-level network parameters θ^{h}

and target network parameters θ^{h -}

2: Batch sampling: Randomly sample a mini-batch of data (S^h, A^h, S^h, S^h′) from the offline dataset.

3:

Compute target value : y = R^{h} + γ \max_{A^{'}} Q (S^{h^{'}}, A^{'}; θ^{h -})

.

4:

Compute loss function : L (θ^{h}) = E [{(y - Q (S^{h}, A^{h}; θ^{h}))}^{2}]

.

5:

Update policy network parameters : θ^{h} \leftarrow θ^{h} - α \nabla_{θ^{h}} L (θ^{h})

.

6:

Update target network parameters : Update θ^{h -} \leftarrow θ^{h}

every fixed number of steps.

Algorithm 2. Training the Low-Level Policy Network

1:

Initialize the low-level policy network parameters θ^{l}

and target network parameters θ^{l -}

.

2: Batch sampling: Randomly sample a mini-batch of data (S^l, A^l, S^l, S^l′) from the offline dataset.

3:

Compute target value : y = R^{l} + γ \max_{A^{'}} Q (S^{l^{'}}, A^{'}; θ^{l -})

.

4:

Compute loss function : L (θ^{l}) = E [{(y - Q (S^{l}, A^{l}; θ^{l}))}^{2}]

.

5:

Update policy network parameters : θ^{l} \leftarrow θ^{l} - α \nabla_{θ^{l}} L (θ^{l})

.

6:

Update target network parameters : Update θ^{l -} \leftarrow θ^{l}

every fixed number of steps.

6. Experimental Results and Analysis

6.1. Experimental Setup

The experimental environment is constructed with the video encoder serving as the environment, while state features, reward functions, and action spaces are defined to verify the hierarchical reinforcement learning-based adaptive initial quantization selection. To validate the effectiveness of the proposed algorithm, we use the JVET common test conditions (CTCs) [43], specifically the low-delay P configuration where the first frame is an I-frame, followed by P-frames. Initial quantization selection determines the QP values for the I-frame and the first P-frame. For consistent comparison, target bitrates are derived from actual encoding outputs using QP values of 22, 27, 32, and 37. Each video sequence uses 30, 50, 100, and 200 frames, with 16 quantization parameters across these setups generating a dataset of 1664 samples. Out of these, 1165 samples are randomly selected for training and 499 for testing, containing state features, high-level features, high-level QP ranges, low-level features, low-level QP values, and next states. The VVC encoder version used is VTM13 [44], with Python 3.6 employed for hierarchical DQN training and decision algorithms. The experimental environment is as follows: CPU: Intel(R) Xeon(R) Platinum 8280. Reinforcement learning hyperparameters are as follows: state dimension of 9, batch size of 64 based on data and hardware resources, initial epsilon of 0.95 for e-greedy strategy, final epsilon of 0.01, decay rate of 300, discount factor of 0.95, learning rate of 0.1, and target network update every 100 steps.

6.2. Experimental Results Evaluation

The test sequences comprise classes A, B, C, D, E, and F, as shown in Table 1. The experiment tests the QP values (22, 27, 32, 37) for each sequence, using the VTM13.0 [44] encoder. During actual encoding tests, each video sequence begins encoding according to the CTC standard with frame counts set to 30. A decision selection is completed after encoding the first I-frame and P-frame in each sequence. BD-rate and BD-PSNR calculations reference VTM’s default rate control algorithm. The method of Gao et al. [6] is a traditional machine learning method currently used in video coding rate control initialization selection. It is also a single-pass encoding method and is used as a comparison for similar algorithms in this paper. The multi-pass Fixed-QP algorithm, which uses constant QPs across frames, provides an optimal baseline for rate-distortion performance, albeit with higher computational complexity. For fair comparison, inter-frame rate control and CTU-level rate control algorithms are kept consistent with the default standard.

Table 2 presents the comparison results for initial QP selection in rate control. Using single-pass rate control quantization as a baseline, the proposed hierarchical reinforcement learning (Proposed HRL) method achieves an average BD-rate reduction of −10.531% and a BD-PSNR gain of 0.506 dB, significantly outperforming the initial QP approach in VTM13 [44] and approaching the optimal performance of multi-pass Fixed-QP. The Proposed HRL approach demonstrates substantial improvements in encoding quality and efficiency when compared to traditional VTM methods and closely approximates the optimal baseline achieved by Fixed QP.

The main benefits of the Proposed HRL approach over VTM are reflected in the reductions in BD-rate and increases in BD-PSNR. A reduced BD-rate indicates that the Proposed HRL requires less bandwidth at comparable quality levels, enhancing encoding efficiency. Across diverse video types, the Proposed HRL approach reduces the BD-rate by approximately 10%, with the most significant improvements observed in high-motion or complex-background sequences. In terms of BD-PSNR, the Proposed HRL method markedly enhances visual quality, leading to improved stability and lower bitrate fluctuations through adaptive initial quantization parameter selection.

From a category comparison perspective, the Proposed HRL consistently improves performance across video classes A, B, C, D, E, and F. However, differences emerge in specific scenes. In category A’s relatively static scenes, the Proposed HRL method reduces the BD-rate by 7.8% and improves BD-PSNR by 0.36 dB in the “Campfire” sequence, demonstrating strong adaptability to low-complexity scenes. Category B, with more complex scenes and motion, sees further improvement, with a BD-rate reduction of 7.1% in “Cactus”, highlighting the Proposed HRL’s advantage in high-motion contexts. For category C, consisting of medium-complexity videos, the Proposed HRL consistently boosts BD-PSNR by over 0.326 dB, with a BD-rate reduction of 7.3%, indicating adaptability to moderate-complexity scenes. In the complex sequences of categories D, E, and F, the Proposed HRL shows strong control over the BD-rate, with reductions of over 10% and BD-PSNR improvements exceeding 0.35 dB, particularly notable in class F’s high-complexity sequences. Overall, these results affirm the Proposed HRL’s consistent application across low, medium, and high-complexity videos, with stable performance gains in BD-rate reduction and BD-PSNR improvement. The method proposed by Gao et al. [6] is a traditional machine learning method. It is a lightweight linear learning method in terms of feature vector extraction and dataset construction. As can be seen from Table 2, the data of each video sequence of the Proposed HRL method are significantly better than that of Gao et al. [6], and are close to the results of the optimal multi-pass encoding method “Fixed QP”.

The Proposed HRL method approaches the optimal results achieved by “Fixed QP” in most video sequences, nearly matching the BD-rate and BD-PSNR results of Fixed QP in classes A and B. In category A’s “Campfire” sequence, the Proposed HRL achieves a BD-rate of −7.834%, only 0.589% higher than Fixed QP, demonstrating near-optimal performance in low-complexity static scenes. In the category C “BQMall” sequence, the BD-rate and BD-PSNR results of the Proposed HRL are within 1% of Fixed QP, showing competitiveness in moderate-motion complexity. For certain sequences, the Proposed HRL’s BD-PSNR values are within 0.1 dB of Fixed QP; for example, the “Cactus” and “DaylightRoad2” sequences have BD-PSNR differences of only 0.059 dB and 0.108 dB, respectively. These results suggest that the Proposed HRL closely approximates the optimal performance of Fixed QP, particularly in balancing encoding quality and rate control.

Compared to the multi-pass Fixed QP approach, the Proposed HRL is an efficient single-pass method that achieves near-optimal quality and rate control while reducing computational resource requirements. This efficiency enhances its practical application potential, particularly in real-time scenarios with high encoding demands. Additionally, the Proposed HRL’s consistency and stability across varying content complexities underscore its strong generalization capability, effectively handling diverse encoding scenarios and content. By training high- and low-level policy networks offline, the Proposed HRL achieves intelligent initial quantization parameter selection, improving video encoding efficiency, significantly lowering the BD-rate, and enhancing BD-PSNR, thus achieving near-optimal encoding quality with reduced computational costs.

Figure 2 illustrates the rate-distortion curves, showing the rate-distortion performance achieved by the Proposed HRL across various video sequences. Compared to VTM, the Proposed HRL offers substantial improvements and nearly overlaps with the optimal Fixed QP control, meeting the anticipated rate control targets in initial quantization parameter selection. Figure 2 includes PSNR-Bitrate visualizations for eight video sequences (QBTerrace, BasketballDrive, FourPeople, Johnny, BQMall, PartyScene, BasketballPass, and RaceHorse). The Proposed HRL consistently outperforms VTM, achieving higher PSNR at comparable bitrates, reflecting lower distortion and higher encoding efficiency. Our method also significantly outperforms the method of Gao et al. [6]. Additionally, the Proposed HRL’s curve closely aligns with Fixed QP, indicating near-optimal single-pass performance approaching multi-pass optimization.

The Proposed HRL shows adaptability across various sequence types. In high-motion sequences such as RaceHorse and BasketballDrive, HRL effectively manages bitrate and enhances encoding quality, demonstrating suitability for complex scenes. In relatively static sequences, such as FourPeople and Johnny, HRL also achieves near-optimal distortion control, confirming its robustness in low-complexity scenes.

For sequences such as BasketballDrive, BQMall, and RaceHorse, HRL nearly overlaps with Fixed QP, achieving near-optimal encoding quality while conserving bandwidth. Overall, the Proposed HRL consistently surpasses VTM and closely matches the optimal results from Fixed QP, highlighting its adaptability, stability, and the value of hierarchical reinforcement learning in adaptive initial quantization parameter selection.

Figure 3 presents the subjective quality comparison on the BQTerrace sequence, contrasting the VTM13.0 RC default, Fixed-QP, and the Proposed HRL method. The figure shows a typical decoded frame from BQTerrace, revealing finer textures and clearer details in the HRL and Fixed-QP images compared to VTM. This visual clarity, especially in detail reproduction and edge sharpness, demonstrates the effectiveness of the proposed method in enhancing image quality.

6.3. Computational Complexity and HRL Model Convergence Analysis

The computational complexity of the algorithm is challenging for application in practical systems. Table 3 compares our method with other methods in terms of computational complexity. The calculation formula for ΔT is

Δ T = \frac{T_{p r o} - T_{V T M}}{T_{V T M}} \times 100 %

, where T_pro and T_VTM are the cumulative encoding time of VTM and other algorithms used for comparison, respectively. The results show that under conventional encoding conditions, the computational complexity of encoder control optimization module training and testing increases by an additional 189.7%. For the HRL model, most of the computational time is spent on training model parameters. It should be noted that these experiments were performed on a CPU platform, not on a GPU. This paper adopts an offline training method, so the algorithm execution does not involve training costs; excluding training events, the computational complexity of our scheme is only slightly higher than that of VTM or Gao et al.’s method [6], which is negligible, but it achieves better R-D performance than traditional methods.

Figure 4 illustrates the comparison of training rewards for the high-level and low-level strategies in a hierarchical reinforcement learning (HRL) algorithm. The high-level reward (blue curve) demonstrates a consistently higher value compared to the low-level reward (orange curve) throughout the training process, reflecting its broader strategic role in guiding the initial quantization parameter (QP) range. Both curves exhibit a gradual increase during the early training iterations, indicating effective learning of the reward function and improvement in decision-making policies.

The convergence of both curves around iteration 800 suggests that the HRL framework achieves stability and optimal policy performance. The high-level strategy converges faster and stabilizes at a higher reward, showcasing its efficiency in determining the coarse-grained QP range. Meanwhile, the low-level strategy stabilizes slightly later, as it refines QP selection within the range defined by the high-level policy. This hierarchical structure effectively balances exploration and exploitation, ensuring robust performance in optimizing rate-distortion trade-offs. These results validate the network’s convergence and the reward function’s ability to drive both layers toward optimal performance.

7. Conclusions

This paper addresses the critical challenge of initial quantization parameter (QP) selection in H.266/VVC rate control by proposing a hierarchical reinforcement learning (HRL) framework. By decomposing the QP selection task into high-level and low-level strategies, the framework effectively handles coarse- and fine-grained QP decisions. The high-level strategy uses global video features and network conditions to determine a rough QP range, reducing the search space. The low-level strategy refines the QP within this range using finer-grained features, enhancing decision accuracy. Experimental results demonstrate that the proposed method significantly improves encoding quality and rate control accuracy compared to traditional approaches. It achieves a higher peak signal-to-noise ratio (PSNR), reduces the BD-rate, and minimizes frame-level quality fluctuations. These results validate HRL’s effectiveness in solving high-dimensional, nonlinear decision-making challenges, particularly in adaptive QP selection. Key advantages of the method include strong adaptability to varying video content and network conditions, high efficiency through search space reduction, and robust performance across diverse scenarios. The approach consistently delivers results close to the multi-pass Fixed-QP method, achieving near-optimal rate-distortion performance even in high-motion and high-complexity scenes. Furthermore, its single-pass design significantly reduces computational resources, making it suitable for real-time applications. In summary, this study highlights the feasibility and effectiveness of HRL for adaptive initial QP selection, demonstrating significant improvements in encoding quality and rate control accuracy. The proposed method offers valuable insights for advancing video encoding technologies while balancing performance and efficiency.

Author Contributions

Conceptualization and methodology, S.H. and C.S.; software, Z.D. and B.J.; validation, S.H., C.S. and Z.D.; formal analysis, B.J.; investigation, S.H.; data curation, C.S. and Z.D.; writing—original draft preparation, S.H. and Z.D.; writing—review and editing, J.L. and Z.D.; visualization, S.T.; supervision, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Key R&D Program of Hainan Province ZDYF2019010, ZDYF2021GXJS010 and No. WSJK2024MS234, the National Natural Science Foundation of China No. 61562023, and the Major Science and Technology Project of Haikou City No. 2020006.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Wang, S.; Ip, H.; Kwong, S. Rate distortion optimization with adaptive content modeling for random-access versatile video coding. Inf. Sci. 2023, 645, 119325. [Google Scholar] [CrossRef]
Wei, X.; Zhou, M.; Wang, H.; Yang, H.; Chen, L.; Kwong, S. Recent advances in rate control: From optimization to implementation and beyond. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 17–33. [Google Scholar] [CrossRef]
Liu, D.; Li, Y.; Lin, J.; Li, H.; Wu, F. Deep learning-based video coding: A review and a case study. ACM Comput. Surv. (CSUR) 2020, 53, 1–35. [Google Scholar] [CrossRef]
Li, Y.; Chen, Z. Rate Control for VVC, Document, JVET K0390. In Proceedings of the JVET, 11th Meeting, Ljublijana, Slovernia, 10–18 July 2018. [Google Scholar]
Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Gao, W.; Kwong, S.; Jiang, Q.; Fong, C.-K.; Wong, P.H.W.; Yuen, W.Y.F. Data-driven rate control for rate-distortion optimization in HEVC based on simplified effective initial QP learning. IEEE Trans. Broadcast. 2019, 65, 94–108. [Google Scholar] [CrossRef]
Yang, Z.; Gao, W.; Li, G.; Yan, Y. Sur-driven video coding rate control for jointly optimizing perceptual quality and buffer control. IEEE Trans. Image Process. 2023, 32, 5451–5464. [Google Scholar] [CrossRef]
Guo, H.; Zhu, C.; Xu, M.; Li, S. Inter-block dependency-based CTU level rate control for HEVC. IEEE Trans. Broadcast. 2019, 66, 113–126. [Google Scholar] [CrossRef]
Li, Y.; Mou, X. Joint optimization for SSIM-based CTU-level bit allocation and rate distortion optimization. IEEE Trans. Broadcast. 2021, 67, 500–511. [Google Scholar] [CrossRef]
Li, L.; Yan, N.; Li, Z.; Liu, S.; Li, H. λ-domain perceptual rate control for 360-degree video compression. IEEE J. Sel. Top. Signal Process. 2019, 14, 130–145. [Google Scholar] [CrossRef]
Chen, Y.; Wang, M.; Wang, S.; Ni, Z.; Kwong, S. A CTU-level screen content rate control for low-delay versatile video coding. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5227–5241. [Google Scholar] [CrossRef]
Mao, Y.; Wang, M.; Wang, S.; Kwong, S. High efficiency rate control for versatile video coding based on composite Cauchy distribution. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2371–2384. [Google Scholar] [CrossRef]
Lin, J.; Huang, A.; Zhao, T.; Wang, X.; Kwong, S. λ-domain VVC rate control based on nash equilibrium. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 3477–3487. [Google Scholar] [CrossRef]
Mao, Y.; Wang, M.; Ni, Z.; Wang, S.; Kwong, S. Neural network based rate control for versatile video coding. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6072–6085. [Google Scholar] [CrossRef]
Wang, T.; Li, F.; Cosman, P.C. Learning-based rate control for video-based point cloud compression. IEEE Trans. Image Process. 2022, 31, 2175–2189. [Google Scholar] [CrossRef]
Chen, Y.; Mao, Y.; Wang, S.; Zhang, X.; Kwong, S. Learning from Coding Features: High Efficiency Rate Control for AOMedia Video 1. IEEE MultiMedia 2023, 30, 16–25. [Google Scholar] [CrossRef]
Zhao, Z.; He, X.; Xiong, S.; He, L.; Chen, H.; Sheriff, R.E. A high-performance rate control algorithm in versatile video coding based on spatial and temporal feature complexity. IEEE Trans. Broadcast. 2023, 69, 753–766. [Google Scholar] [CrossRef]
Liao, J.; Li, L.; Liu, D.; Li, H. Content-adaptive Rate-Distortion Modeling for Frame-level Rate Control in Versatile Video Coding. IEEE Trans. Multimed. 2024, 26, 6864–6879. [Google Scholar] [CrossRef]
Liu, F.; Chen, Z. Multi-objective optimization of quality in VVC rate control for low-delay video coding. IEEE Trans. Image Process. 2021, 30, 4706–4718. [Google Scholar] [CrossRef]
Liu, H.; Zhu, S.; Zeng, B. Inter-frame dependency-based rate control for vvc low-delay coding. IEEE Signal Process. Lett. 2022, 29, 2727–2731. [Google Scholar] [CrossRef]
Gao, W.; Jiang, Q.; Wang, R.; Ma, S.; Li, G.; Kwong, S. Consistent quality oriented rate control in HEVC via balancing intra and inter frame coding. IEEE Trans. Ind. Inform. 2021, 18, 1594–1604. [Google Scholar] [CrossRef]
Yan, T.; Ra, I.H.; Wen, H.; Weng, M.-H.; Zhang, Q.; Che, Y. CTU layer rate control algorithm in scene change video for free-viewpoint video. IEEE Access 2020, 8, 24549–24560. [Google Scholar] [CrossRef]
Chen, Y.; Kwong, S.; Zhou, M.; Wang, S.; Zhu, G.; Wang, Y. Intra frame rate control for versatile video coding with quadratic rate-distortion modelling. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4422–4426. [Google Scholar]
Zhou, Y.; Xu, G.; Tang, K.; Tian, L.; Sun, Y. Video coding optimization in AVS2. Inf. Process. Manag. 2022, 59, 102808. [Google Scholar] [CrossRef]
Pan, Z.; Yi, X.; Zhang, Y.; Yuan, H.; Wang, F.L.; Kwong, S. Frame-level Bit Allocation Optimization Based on Video Content Characteristics for HEVC. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2020, 16, 1–20. [Google Scholar] [CrossRef]
HoangVan, X. Adaptive quantization parameter estimation for HEVC based surveillance scalable video coding. Electronics 2020, 9, 915. [Google Scholar] [CrossRef]
Chen, Z.; Shi, J.; Li, W. Learned fast HEVC intra coding. IEEE Trans. Image Process. 2020, 29, 5431–5446. [Google Scholar] [CrossRef] [PubMed]
Hu, J.H.; Peng, W.H.; Chung, C.H. Reinforcement learning for HEVC/H. 265 intra-frame rate control. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [Google Scholar]
Smirnov, N.; Tomforde, S. Real-time rate control of webrtc video streams in 5g networks: Improving quality of experience with deep reinforcement learning. J. Syst. Archit. 2024, 148, 103066. [Google Scholar] [CrossRef]
Li, N.; Zhang, Y.; Zhu, L.; Luo, W.; Kwong, S. Reinforcement learning based coding unit early termination algorithm for high efficiency video coding. J. Vis. Commun. Image Represent. 2019, 60, 276–286. [Google Scholar] [CrossRef]
Helle, P.; Schwarz, H.; Wiegand, T.; Müller, K.-R. Reinforcement learning for video encoder control in HEVC. In Proceedings of the 2017 International Conference on Systems, Signals and Image Processing (IWSSIP), Poznań, Poland, 22–24 May 2017; pp. 1–5. [Google Scholar]
Chen, S.; Aramvith, S.; Miyanaga, Y. Learning-Based Rate Control for High Efficiency Video Coding. Sensors 2023, 23, 3607. [Google Scholar] [CrossRef] [PubMed]
Ren, G.; Liu, Z.; Chen, Z.; Liu, S. Reinforcement learning based ROI bit allocation for gaming video coding in VVC. In Proceedings of the 2021 International Conference on Visual Communications and Image Processing (VCIP), Munich, Germany, 5–8 December 2021; pp. 1–5. [Google Scholar]
Zhang, H.; Li, J.; Li, B.; Lu, Y. A deep reinforcement learning approach to multiple streams’ joint bitrate allocation. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2415–2426. [Google Scholar] [CrossRef]
Zhou, M.; Wei, X.; Kwong, S.; Jia, W.; Fang, B. Rate control method based on deep reinforcement learning for dynamic video sequences in HEVC. IEEE Trans. Multimedia. 2020, 23, 1106–1121. [Google Scholar] [CrossRef]
Hutsebaut-Buysse, M.; Mets, K.; Latré, S. Hierarchical reinforcement learning: A survey and open research challenges. Mach. Learn. Knowl. Extr. 2022, 4, 172–221. [Google Scholar] [CrossRef]
Luo, J.; Xu, C.; Geng, X.; Feng, G.; Fang, K.; Tan, L. Multi-stage cable routing through hierarchical imitation learning. IEEE Trans. Robot. 2024, 40, 1476–1491. [Google Scholar] [CrossRef]
Yuan, H.; Gao, W.; Ma, S.; Yan, Y. Divide-and-conquer-based RDO-free CU partitioning for 8K video compression. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
Yuan, H.; Wang, Q.; Liu, Q.; Huo, J.; Li, P. Hybrid distortion-based rate-distortion optimization and rate control for H. 265/HEVC. IEEE Trans. Consum. Electron. 2021, 67, 97–106. [Google Scholar] [CrossRef]
Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef]
Xie, G.; Li, X.; Lin, S.; Chen, Z.; Zhang, L.; Zhang, K. Hierarchical reinforcement learning based video semantic coding for segmentation. In Proceedings of the 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 24 August 2022; pp. 1–5. [Google Scholar]
Lee, J.K.; Kim, N.; Kang, J.W. Reinforcement learning for rate-distortion optimized hierarchical prediction structure. IEEE Access 2023, 11, 20240–20253. [Google Scholar] [CrossRef]
Andersson, K.; Enhorn, J.; Sjöberg, R.; Ström, J.; Litwic, L. Addition of a GOP Hierarchy of 32 for Random Access Configuration for VTM, Document, JVET-S0180. In Proceedings of the JVET, 19th Meeting, Geneva, Swizerland, 22 June–1 July 2020. [Google Scholar]
VVC Software, VTM-13.0. Available online: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/tags/VTM-13.0/ (accessed on 20 February 2022).

Figure 1. Hierarchical reinforcement learning framework for initial quantization parameter selection.

Figure 2. Rate-distortion curves of rate control methods for various video sequences [6].

Figure 3. Subjective quality comparison.

Figure 4. Comparison training rewards.

Table 1. CTC standard test sequences.

Class	Video Sequences	Resolution	Frame Num.	Frame Rate (fps)	Bit Depth (bit)
A	Campfire	3840 × 2160	300	60	10
	ParkRunning3	3840 × 2160	300	50	10
	Tango2	3840 × 2160	294	60	10
	DaylightRoad2	3840 × 2160	300	60	10
B	MarketPlace	1920 × 1080	600	60	8
	RitualDance	1920 × 1080	600	60	8
	Cactus	1920 × 1080	500	50	8
	BasketballDrive	1920 × 1080	500	50	8
	BQTerrace	1920 × 1080	600	60	8
C	RaceHorses	832 × 480	300	30	8
	BQMall	832 × 480	600	60	8
	PartyScene	832 × 480	500	50	8
	BasketballDrill	832 × 480	500	50	8
D	RaceHorses	416 × 240	300	30	8
	BQSquare	416 × 240	600	60	8
	BlowingBubbles	416 × 240	500	50	8
	BasketballPass	416 × 240	500	50	8
E	FourPeople	1280 × 720	600	60	8
	Johnny	1280 × 720	600	60	8
	KristenAndSara	1280 × 720	600	60	8
F	BasketballDrillText	832 × 480	500	50	8
	ChinaSpeed	1024 × 768	500	30	8
	SlideEditing	1280 × 720	300	30	8
	SlideShow	1280 × 720	500	20	8

Table 2. Rate-distortion performance comparison (VTM13.0).

Class	Video Sequences	Gao et al. [6]		Proposed HRL		Fixed QP
Class	Video Sequences	BD-Rate (%)	BD-PSNR (dB)	BD-Rate (%)	BD-PSNR (dB)	BD-Rate (%)	BD-PSNR (dB)
A	Tango2	−11.231	0.254	−12.481	0.276	−12.762	0.312
	ParkRunning3	−6.450	0.462	−6.260	0.473	−7.234	0.512
	Campfire	−7.642	0.352	−7.834	0.364	−8.423	0.457
	DaylightRoad2	−10.534	0.423	−11.725	0.415	−13.542	0.523
	Average	−8.964	0.373	−9.575	0.382	−10.490	0.451
B	Cactus	−6.832	0.261	−7.123	0.265	−10.470	0.324
	BasketballDrive	−18.670	0.681	−17.570	0.683	−22.950	0.760
	BQTerrace	−5.424	0.206	−6.124	0.216	−7.417	0.247
	Average	−10.309	0.383	−10.272	0.388	−13.612	0.444
C	RaceHorses	−8.844	0.325	−9.100	0.327	−11.500	0.481
	BQMall	−6.839	0.322	−7.120	0.328	−7.180	0.331
	PartyScene	−4.131	0.206	−4.250	0.208	−4.871	0.248
	BasketballDrill	−8.845	0.435	−8.749	0.439	−6.690	0.353
	Average	−7.165	0.322	−7.305	0.326	−7.560	0.353
D	RaceHorses	−9.577	0.389	−9.677	0.391	−11.950	0.660
	BQSquare	−9.770	0.412	−9.888	0.415	−6.638	0.371
	BlowingBubbles	−5.011	0.292	−5.120	0.291	−5.393	0.294
	BasketballPass	−5.788	0.357	−5.873	0.361	−6.566	0.378
	Average	−7.537	0.363	−7.640	0.365	−7.637	0.426
E	FourPeople	−6.181	0.465	−6.256	0.470	−9.545	0.646
	Johnny	−15.148	0.491	−15.428	0.489	−17.960	0.578
	KristenAndSara	−10.522	0.612	−11.233	0.611	−15.140	0.728
	Average	−10.617	0.523	−10.972	0.523	−14.215	0.651
F	BasketballDrillText	−4.976	0.256	−5.176	0.257	−6.310	0.304
	ChinaSpeed	−4.280	0.212	−5.190	0.232	−5.493	0.308
	SlideEditing	−25.430	3.153	−24.890	3.183	−32.690	3.292
	SlideShow	−33.560	0.411	−34.610	0.431	−41.940	0.454
	Average	−17.062	1.008	−21.563	1.282	−21.608	1.090
	Total Average	−9.404	0.457	−10.531	0.506	−11.361	0.523

Table 3. Complexity comparison with VTM.

Sequence	Including Training Time		Without Training Time
Sequence	Gao et al. [6]	Ours	Gao et al. [6]	Ours
A	4.5	200.0	4.5	5.1
B	4.4	198.0	4.4	5.2
C	4.3	180.0	4.3	5.3
D	4.6	179.0	4.6	5.2
E	4.3	192.0	4.3	5.3
F	4.2	189.0	4.2	5.4
Average	4.4	189.7	4.4	5.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, S.; Jin, B.; Tian, S.; Liu, J.; Deng, Z.; Shi, C. Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC. Electronics 2024, 13, 5028. https://doi.org/10.3390/electronics13245028

AMA Style

He S, Jin B, Tian S, Liu J, Deng Z, Shi C. Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC. Electronics. 2024; 13(24):5028. https://doi.org/10.3390/electronics13245028

Chicago/Turabian Style

He, Shuqian, Biao Jin, Shangneng Tian, Jiayu Liu, Zhengjie Deng, and Chun Shi. 2024. "Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC" Electronics 13, no. 24: 5028. https://doi.org/10.3390/electronics13245028

APA Style

He, S., Jin, B., Tian, S., Liu, J., Deng, Z., & Shi, C. (2024). Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC. Electronics, 13(24), 5028. https://doi.org/10.3390/electronics13245028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Reinforcement Learning-Based Adaptive Initial QP Selection and Rate Control for H.266/VVC

Abstract

1. Introduction

1.1. The Role of Rate Control in Video Encoding

1.2. The Role of Initial Quantization Parameter Selection in Video Encoding

1.3. Paper Organization

2. Related Work

2.1. Research on Rate Control in H.266/VVC

2.2. Research on Initial Quantization Parameter Selection in Rate Control

2.3. Research on Reinforcement Learning in Rate Control

2.4. Research on Hierarchical Reinforcement Learning

3. Problem Statement

3.1. Background

3.2. Challenges and Requirements

3.3. Problem Definition

4. Model Construction

4.1. Hierarchical Reinforcement Learning Architecture

4.2. Definition of States, Actions, and Rewards

4.2.1. State Space

4.2.2. Action Space

4.2.3. Reward Function

4.3. Reinforcement Learning Objective

4.4. Model Advantages

5. Algorithm Design

6. Experimental Results and Analysis

6.1. Experimental Setup

6.2. Experimental Results Evaluation

6.3. Computational Complexity and HRL Model Convergence Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI