A Trellis Based Temporal Rate Allocation and Virtual Reference Frames for High Efﬁciency Video Coding

: The High Efﬁciency Video Coding (HEVC) standard has now become the most popular video coding solution for video conferencing, broadcasting, and streaming. However, its compression performance is still a critical issue for adopting a large number of emerging video applications with higher spatial and temporal resolutions. To advance the current HEVC performance, we propose an efﬁcient temporal rate allocation solution. The proposed method adaptively allocates the compression bitrate for each coded picture in a group of pictures by using a trellis-based dynamic programming approach. To achieve this task, we trained the trellis-based quantization parameter for each frame in a group of pictures considering the temporal layer position. We further improved coding efﬁciency by incorporating our proposed framework with other inter prediction methods such as a virtual reference frame. Experiments showed around 2% and 5% bitrate savings with our trellis-based rate allocation method with and without a virtual reference frame compared to the conventional HEVC standard, respectively.


Context and Motivations
The wide growth of multimedia applications always demands more powerful video transmission over the internet with high compression performance as well as low complexity. Therefore, the Joint Collaborative Team on Video Coding (JCT-VC) announced the High Efficiency Video Coding (HEVC) [1] with 50% bitrate reduction at the same perceptual quality as the previous H.264/AVC standard [2]. Despite its success, there is still a demand for beyond HEVC coding to meet emerging requirements of higher resolutions (i.e., 4K, 8K) and diverse contents (screen contents, drone, etc.). In this context, in the newest video compression standard, H.266/Versatile Video Coding (VVC), many coding tools have been proposed, such as partitioning with quad-tree plus binary, adaptive transforms, new intra modes, affine motion estimation, dependent quantization, etc. [2]. Among all, inter coding tools are always bringing the highest bitrate reduction due to the high redundancy between consecutive temporal frames. Besides the affine motion estimation, virtual reference frame and temporal rate control are active research topics to improve the inter coding performance [3].
Rate control is one of the most powerful encoding tools to advance the rate-distortion (RD) performance of any video codec. In general, at a given quality, rates are allocated from coarse to fine in (i) group of pictures (GOP) or temporal rate allocation, then (ii) distributed among coding unit or spatial, or (iii) adaptive allocation to reduce the distortion propagation. Adaptive approaches are complicated and demand computational complexity and memory for either pre-analysis or online analysis. As a result, the common test condition of the most recent video coding standards, including H.266/VVC, still favors a simple hierarchical rate allocation at the frame level [2]. For each GOP, HEVC and VVC classified frames into a temporal index (Tid), and depending on their hierarchical location and frame types (i.e., intra or inter), the quantized parameter (QP) is adjusted by a fixed value. This framework is very simple and easy to validate the performance of coding tools during the standardization process. However, with increasing competition from AOMedia Video Coding (AV1) [4] with an adaptive GOP, an effective temporal rate allocation for MPEG codecs is demanded.
HEVC adopts a simple temporal rate control via its hierarchical coding structure [1]. HEVC controls bits for each frame in a group of pictures (GOP) with a temporal index associated with its hierarchical reference model by adjusting the quantization parameter for each frame so that high Tid frames are encoded at a higher QP. However, the simple method in HEVC is not effective in preventing the propagation of distortion. Researchers have introduced adaptively allocate bitrates to each frame via a rate-distortion optimized scheme [5][6][7]. These methods, however, required modification in the decoder, which is noncompliant with the HEVC standard. Additionally, the adaptive quantization parameter at the frame level is an encoder-only tool, easy to modify in the configuration. In addition, adaptive quantization at the frame level would significantly improve the overall compression performance. Therefore, it is often left out from the MPEG (Moving Picture Experts Group) standardization process; thus, there are few works on this research topic.
Virtual reference frame (VRF) is another inter coding improvement approach by (i) generating virtual frames and (ii) utilizing virtual frames as an additional reference or for guided reconstructed frames [8]. Several works utilized the motion estimation or advanced deep learning framework to interpolate frames from decoded frames [9][10][11][12][13]. The deep learning approach shows higher gain but also introduces tremendous complexity in both encoder and decoder [11][12][13]. On the other hand, virtual frames are utilized as additional reference frames for motion estimation or reducing the syntax transmission overhead as a special merge mode. Although there are many works on VRF, there is lacking investigation on the impact of virtual frames on coding performance.

Contributions and Paper Organization
Although the virtual referencing-based method has provided important compression improvement for HEVC, there is still room for further improving HEVC performance. Previous works either researched the virtual reference frame or temporal rate control separately. Firstly, the prior VR creation mainly relied on the motion estimation and interpolation-based approach, which may be ineffective for fast motion content and compressed with high QPs or low rate. Secondly, the prior VR frames are used for all B-slices, which may always be effective in terms of the rate-distortion optimization (RDO) manner, especially for pictures at the low temporal layer positions. Finally, since the quality of the VR frames highly depends on the quality of existing decoded references, adaptive quantized frames at different layers would impact the performance of inter coding. In this context, this paper proposes: (i) a novel virtual reference frame creation where a multiple hypothesis motion estimation method is used for frame generation; (ii) an efficient rate allocation algorithm in which the trellis coding method is used to learn the temporal rate allocation.
We conducted a rich set of experiments with numerous train and test videos and showed around 5% BD-rate gains on top of the most recent HEVC reference software. The rest of this paper is organized as follows. Section 2 briefly describes the background work on HEVC inter coding and the related works on rate control. Section 3 presents the proposed VRF creation and temporal rate allocation, while Section 4 assesses the coding performance of the proposed methods. Finally, we summarize the contributions of this paper in Section 5.

Background and Related Works
HEVC adopted the inter-coding tools to take advantage of temporal correlation among frames in a GOP. This paper introduces novel temporal rate allocation and virtual reference frame exploitation for HEVC inter coding; hence, this section will briefly describe the background work on the HEVC inter coding and then analyzes the related works on HEVC rate control.

HEVC Inter Coding
Compared to the prior inter-coding tools of the H.264/AVC standard [2], the HEVC introduces several improvements in coding tools and coding structure, notably (i) block partitioning and (ii) motion estimation and reference picture management.
Block partition: HEVC inter coding allows compressing picture with more block partition shapes than intra coding, notably the partition modes of PART_2N × 2N, PART_2N × N, and PART_N × 2N indicate the cases when coding block is not split, split into two equalsize prediction blocks horizontally and vertically, respectively [1]. While PART_N × N refers to the case, the coding block is split into four equal-size blocks. In addition, there are four asymmetric motion partitions, PART_2N × nU, PART_2N × nD, PART_nL × 2N, and PART_nR × 2N (see Figure 1). For each coding tree unit (CTU), the rate-distortion optimization (RDO) process is recursively computed to find the optimal coding and prediction unit, notably from 64 × 64 to 8 × 8. posed VRF creation and temporal rate allocation, while Section 4 assesses the coding performance of the proposed methods. Finally, we summarize the contributions of this paper in Section 5.

Background and Related Works
HEVC adopted the inter-coding tools to take advantage of temporal correlation among frames in a GOP. This paper introduces novel temporal rate allocation and virtual reference frame exploitation for HEVC inter coding; hence, this section will briefly describe the background work on the HEVC inter coding and then analyzes the related works on HEVC rate control.

HEVC Inter Coding
Compared to the prior inter-coding tools of the H.264/AVC standard [2], the HEVC introduces several improvements in coding tools and coding structure, notably (i) block partitioning and (ii) motion estimation and reference picture management.
Block partition: HEVC inter coding allows compressing picture with more block partition shapes than intra coding, notably the partition modes of PART_2N × 2N, PART_2N × N, and PART_N × 2N indicate the cases when coding block is not split, split into two equal-size prediction blocks horizontally and vertically, respectively [1]. While PART_N × N refers to the case, the coding block is split into four equal-size blocks. In addition, there are four asymmetric motion partitions, PART_2N × nU, PART_2N × nD, PART_nL × 2N, and PART_nR × 2N (see Figure 1 ). For each coding tree unit (CTU), the rate-distortion optimization (RDO) process is recursively computed to find the optimal coding and prediction unit, notably from 64 × 64 to 8 × 8. Motion estimation and reference picture management: Similar to the prior video coding standard, HEVC inter employs motion estimation to find the best-matched block in reference frames to reduce the redundancy between successive frames. For reference picture management, HEVC inter stores the previously decoded pictures into a decoded picture buffer (DPB). To identify these pictures, a list of picture order count (POC) identifiers is transmitted in each slice header, and the set of reference pictures is called the reference picture set (RPS), as shown in Figure 2. Motion estimation and reference picture management: Similar to the prior video coding standard, HEVC inter employs motion estimation to find the best-matched block in reference frames to reduce the redundancy between successive frames. For reference picture management, HEVC inter stores the previously decoded pictures into a decoded picture buffer (DPB). To identify these pictures, a list of picture order count (POC) identifiers is transmitted in each slice header, and the set of reference pictures is called the reference picture set (RPS), as shown in Figure 2.
The HEVC DPB also contains two lists, list 0 and list 1, which are referred through the reference picture index. For biprediction, i.e., in random access coding configuration, two pictures are selected (one from each list).

HEVC Rate Control
Rate control has been an important research topic in video coding over the years. A temporal domain is an effective way to reduce the transition rate. HEVC uses a fixed hierarchical bit rate allocation. It controls bits for each frame in a group of pictures (GOP) with a temporal index (Tid) associated with its hierarchical reference model, depicted in Figure 2. By simply adjusting the QP parameters for each frame, high Tid frames are encoded at a higher quantization parameter (QP). However, this simple method in HEVC is unable to prevent distortion propagation. Therefore, various rate control algorithms were developed for video coding standards. For HEVC, an early rate control algorithm was proposed using a pixel-wise unified rate-quantization (R-Q) model [14]. This R-Q model is almost the same as the conventional quadratic R-D model in [15]. Afterward, Li et al.,in [16], proposed an R-λ model where λ is the Lagrange Multiplier based on a frame's complexity. This rate control algorithm allocates target bits to a GOP, a frame, or a CTU. Further extending this work, the authors in [17] proposed a gradient-based R-λ model and inter-frame rate control for HEVC. In order to enhance facial detail for video coding conferencing, the work in [18] developed a rate control algorithm based on a weighted R-λ model, which can allocate more target bits to important regions in a video frame. Considering the coding performance of intra coding, a structure similarity (SSIM) based game theory approach was introduced in [19] to optimize the CTU level bit allocation. Similarly, a ρ domain bit allocation and a rate control algorithm were proposed in [20] to optimize bit allocation for key frames. Recently, the authors in [21] presented a highly parallel hardware architecture for rate estimation in HEVC intra coding. Although the aforementioned rate control algorithms [14][15][16][17][18][19][20] show promising performance for HEVC, their high computational complexity may not suitable for a number of emerging video applications. The HEVC DPB also contains two lists, list 0 and list 1, which are referred through the reference picture index. For biprediction, i.e., in random access coding configuration, two pictures are selected (one from each list).

HEVC Rate Control
Rate control has been an important research topic in video coding over the years. A temporal domain is an effective way to reduce the transition rate. HEVC uses a fixed hierarchical bit rate allocation. It controls bits for each frame in a group of pictures (GOP) with a temporal index (Tid) associated with its hierarchical reference model, depicted in Figure 2. By simply adjusting the QP parameters for each frame, high Tid frames are encoded at a higher quantization parameter (QP). However, this simple method in HEVC is unable to prevent distortion propagation. Therefore, various rate control algorithms were developed for video coding standards. For HEVC, an early rate control algorithm was proposed using a pixel-wise unified rate-quantization (R-Q) model [14]. This R-Q model is almost the same as the conventional quadratic R-D model in [15]. Afterward, Li et al., in [16], proposed an R-λ model where λ is the Lagrange Multiplier based on a frame's complexity. This rate control algorithm allocates target bits to a GOP, a frame, or a CTU. Further extending this work, the authors in [17] proposed a gradient-based R-λ model and inter-frame rate control for HEVC. In order to enhance facial detail for video coding conferencing, the work in [18] developed a rate control algorithm based on a weighted Rλ model, which can allocate more target bits to important regions in a video frame. Considering the coding performance of intra coding, a structure similarity (SSIM) based game theory approach was introduced in [19] to optimize the CTU level bit allocation. Similarly, a domain bit allocation and a rate control algorithm were proposed in [20] to optimize bit allocation for key frames. Recently, the authors in [21] presented a highly parallel hardware architecture for rate estimation in HEVC intra coding. Although the aforementioned In this work, we focused on developing a frame-level rate control algorithm. In order to do so, the learning process requires training data to have raw video contents at various sizes and resolutions. Therefore, using recently advanced learning frameworks such as deep learning [13] would require frame-level features for multiple frames in each GOP, thus demands a tremendous amount of computation, not only at training but also at deployment. Additionally, the learning framework often relies on a quality metric such as MSE, which does not well reflect the commonly used BDBR score in codec development. As a result, with our limited computing resources, we were not able to use a more expensive learning algorithm. To address this problem, and motivated by RDOQ, we proposed a simple but efficient temporal rate allocation solution for HEVC following a greedy method based on the trellis coding approach.

Proposed HEVC Improvement Tools
In this work, we aimed to develop a simple, non-normative MPEG video coding tool by integrating a VR creation solution into the HEVC and refining the existing temporal bitrate allocation scheme. We avoided the pre-analysis approach, which requires looking ahead several frames or GOPs, thus subsequently introduces significant complexity and is not suitable for relative applications. To achieve this target, we readjusted the predefined heuristic quantization level by a learned method. Unfortunately, the learning approach usually faces the challenge of the huge computational complexity of video coding, various datasets, as well as the nonlinearity of the encoding process [12]. We addressed these challenges by using a dynamic programming approach-trellis-based rate allocation (TRA). The proposed TRA algorithm was integrated into both original HEVC and HEVC with virtual reference frames (VRF). Hence, this section introduces a novel HEVC with a virtual reference frame framework and is followed by the proposed TRA solution.

Proposed HEVC Architecture
The proposed HEVC encoding architecture is described in Figure 3 with highlighted modified modules. Virtual reference frames are generated from the reference decoded frames based on the multiple hypothesis motion estimation and then put to the DPB. Both the previously decoded references and the virtual references are used as the reference for motion estimation of the current frame.
by integrating a VR creation solution into the HEVC and refining the existing temporal bitrate allocation scheme. We avoided the pre-analysis approach, which requires looking ahead several frames or GOPs, thus subsequently introduces significant complexity and is not suitable for relative applications. To achieve this target, we readjusted the predefined heuristic quantization level by a learned method. Unfortunately, the learning approach usually faces the challenge of the huge computational complexity of video coding, various datasets, as well as the nonlinearity of the encoding process [12]. We addressed these challenges by using a dynamic programming approach-trellis-based rate allocation (TRA). The proposed TRA algorithm was integrated into both original HEVC and HEVC with virtual reference frames (VRF). Hence, this section introduces a novel HEVC with a virtual reference frame framework and is followed by the proposed TRA solution.

Proposed HEVC Architecture
The proposed HEVC encoding architecture is described in Figure 3 with highlighted modified modules. Virtual reference frames are generated from the reference decoded frames based on the multiple hypothesis motion estimation and then put to the DPB. Both the previously decoded references and the virtual references are used as the reference for motion estimation of the current frame.

Virtual Reference Frame Creation
To achieve high efficient encoder, we created a new virtual reference frame for each inter coding frame based on a hierarchical motion estimation (HME) and compensation technique. To fully exploit the statistic information from decoded data and the texture correlation between two consecutive references, we proposed an advanced motion-compensated temporal interpolation (MCTI) based VR frame creation, in which the motion vector field is adaptively generated using the hierarchical motion estimation (ME) solution with a coarse block size of 16 × 16 or 32 × 32 initialized in the backward ME. In this stage, the minimization of regularized mean absolute difference (RMAD) was adopted [22]. Figure 3 highlights the proposed MCTI structure. Since a large block size is hard to

Virtual Reference Frame Creation
To achieve high efficient encoder, we created a new virtual reference frame for each inter coding frame based on a hierarchical motion estimation (HME) and compensation technique. To fully exploit the statistic information from decoded data and the texture correlation between two consecutive references, we proposed an advanced motion-compensated temporal interpolation (MCTI) based VR frame creation, in which the motion vector field is adaptively generated using the hierarchical motion estimation (ME) solution with a coarse block size of 16 × 16 or 32 × 32 initialized in the backward ME. In this stage, the minimization of regularized mean absolute difference (RMAD) was adopted [22]. Figure 3 highlights the proposed MCTI structure. Since a large block size is hard to cover the motion information of tiny activity or video taken from a far camera, a finer block size may be chosen. Afterward, the motion vector refinement was adopted to refine the motion field. Finally, the motion compensation was used to create the VR frame. The creation of the VR frame can be performed as follows: • Hierarchical ME: First, decoded frames obtained from two reference lists, list 0 and list 1, are low pass filtered and used as references in a motion estimation process. Our proposed VR frame creation uses both forward and backward ME to generate the forward interpolated frame and the backward interpolated frame. In these modules, a block matching algorithm is used to estimate the motion between the next and previous decoded frames. To achieve more accurate motion vector information, an up-sample motion vector field (MVF) was performed. The motion vector value obtained from the previous step is an integer number; there will be jagged edges in the interpolated frame. To alleviate this problem, we up-sampled the reference frames by a factor of 2, both in horizontal and vertical directions, as shown in Figure 4. After up-sampling the MVF, coarsely refine the MVF considering its 8-neighboring MVs and choose that MVF has the smallest error cost.
• MV refinement: To achieve better motion field, a motion vector refinement (MVR) process is employed. In MVR, the temporal bidirectional ME (BiME) and the spatial weighted vector median filtering (WVMF) are chosen to refine the motion information derived from the hierarchical ME stage [22,23]. In BiME, the motion vectors of each interpolated block are refined in a small search area and following an assumption that the motion trajectory between consecutive frames is linear. While the spatial WVMF improves the motion field spatially coherent by looking, for each interpolated block, a candidate motion vector at neighboring blocks can better represent the motion trajectory. This filter is also adjustable by a set of weights, controlling the filter strength and depending on the block distortion for each candidate motion vector. Since the quality of decoded references and video content highly affect the final VR frame quality, we adopted a statistical learning-based parameter optimization solution to initialize the block size, search range, and search refinement areas for the proposed MCTI method [24]. • Motion compensation: Finally, the motion compensation process is applied to the decoded frames and obtained MV to achieve the VR frames.
Electronics 2021, 10, x FOR PEER REVIEW 6 of 19 cover the motion information of tiny activity or video taken from a far camera, a finer block size may be chosen. Afterward, the motion vector refinement was adopted to refine the motion field. Finally, the motion compensation was used to create the VR frame. The creation of the VR frame can be performed as follows: • Hierarchical ME: First, decoded frames obtained from two reference lists, list 0 and list 1, are low pass filtered and used as references in a motion estimation process. Our proposed VR frame creation uses both forward and backward ME to generate the forward interpolated frame and the backward interpolated frame. In these modules, a block matching algorithm is used to estimate the motion between the next and previous decoded frames.
To achieve more accurate motion vector information, an up-sample motion vector field (MVF) was performed. The motion vector value obtained from the previous step is an integer number; there will be jagged edges in the interpolated frame. To alleviate this problem, we up-sampled the reference frames by a factor of 2, both in horizontal and vertical directions, as shown in Figure 4. After up-sampling the MVF, coarsely refine the MVF considering its 8-neighboring MVs and choose that MVF has the smallest error cost.  • MV refinement: To achieve better motion field, a motion vector refinement (MVR) process is employed. In MVR, the temporal bidirectional ME (BiME) and the spatial weighted vector median filtering (WVMF) are chosen to refine the motion information derived from the hierarchical ME stage [22,23]. In BiME, the motion vectors of each interpolated block are refined in a small search area and following an assumption that the motion trajectory between consecutive frames is linear. While the spatial WVMF improves the motion field spatially coherent by looking, for each interpolated block, a candidate motion vector at neighboring blocks can better represent the motion trajectory. This filter is also adjustable by a set of weights, controlling the filter strength and depending on the block distortion for each candidate motion vector. Since the quality of decoded references and video content highly affect the final VR frame quality, we adopted a statistical learning-based parameter optimization solution to initialize the block size, search range, and search refinement areas for the proposed MCTI method [24]. • Motion compensation: Finally, the motion compensation process is applied to the decoded frames and obtained MV to achieve the VR frames.

Virtual Reference Frame Exploitation
To better use this new reference, we conducted exhaustive experiments to identify which frame position should be employed to generate VR frames. The generation of the VR frame benefits from the low complexity motion estimation and compensation approach combining with a data-driven optimal parameter configuration. The proposed VR is applied to the B-frame with bi-directional references as in [25].

Trellis-Based Rate Allocation (TRA)
Trellis search is a type of dynamic programming algorithm to find the optimal encoded sequence [26]. In video coding, trellis search is used for optimal quantization in ratedistortion optimized quantization (RDOQ) [27] of HEVC. Starting with initial scal5ar quantization (i.e., x), RDOQ recursively decides the optimal quantization level at each coefficient location between candidates of {x, x − 1} (additional candidate of 0 can be added) toward minimizing rate and distortion cost.
The overall step of our TRA is given in Algorithm 1. We modeled the bitrate allocation as an optimal search path of QP adjustment between temporal ID (Tid; illustrated in Figure 5). The details are explained as follows.  We then performed an iterative greedy search with the maximum K phases. In each phase, we searched the optimal QP offsets starting from Tid 0 to Tid 3 with respect to the best BDBR reduction. By starting from the most (Tid 0) to the least important picture (Tid 3), we avoided the great fluctuation in the algorithm (i.e., the BDBR gain tends to reduce after each step). An example of Tid structure in HEVC is shown in Figure 5. The lower Tid To avoid the worst case of the greedy algorithm, we have: (1) initialized the search with ∆QP 0 = [0, 0, 0, 0], which is the same as HEVC configuration; (2) using BDBR as the loss function to pick up the highest BDBR improvement path. Find exclude candidate c tid ex ∈ c tid ex , ∆QP k tid ± k > 1 6 Current candidate c = {±k}\{c ex } 7 Find the best ∆QP k tid ∈ c at tid w.r.t BDBR 8 Add exclude candidates: c tid ex ∈ c tid ex , c 9 Output ∆QP K We then performed an iterative greedy search with the maximum K phases. In each phase, we searched the optimal QP offsets starting from Tid 0 to Tid 3 with respect to the best BDBR reduction. By starting from the most (Tid 0) to the least important picture (Tid 3), we avoided the great fluctuation in the algorithm (i.e., the BDBR gain tends to reduce after each step). An example of Tid structure in HEVC is shown in Figure 5. The lower Tid is, the more important it is, as being referred by more picture at lower Tid. In our practice, we used only two-phase, K = 2.
The best QP offset at the iteration k, and Tid i (i.e., denoted as ∆QP k i ) is added to the optimal list ∆QP K from the full candidate list of {0, ±1, . . . , ±k}, which already removed candidates from the previous iteration (i.e., ∆QP k−1 i ). The maximum offset range was set to k in this work to limit the complexity. Based on the bitrate reduction rate (BDBR) of a given video dataset, we selected the best offset at high and low bitrate scenarios.
Given four Tids, at the maximum range of ±2 (thus five available options), with six QPs for low and high rate, the number of tests for each sequence in the greedy search is nQP × r 4 = 6 × 5 4 = 3750.
(1) This still requires a tremendous amount of computation. Therefore, we used trellisbased dynamic programing to recursively search for the best offset of each TID. For one phase round of search (i.e., we named as a phase), it requires for each sequence that significantly reduces simulations. Even though the proposed TRA could significantly reduce the searching space, it still demands a huge computation given multiple test/train sequences and several iterations. Therefore, we proposed a simplified version of TRA with a two-phase iterative approach based on the reference candidates.
In the first phase, we limited the candidate list to {0, ±1} which requires half of experiments as in TRA. In the second phase, we only tested additional configuration, which is ±1 difference from phase 1. If the optimal offset is +1, then only +2 is evaluated (i.e., 0 is evaluated in Phase 1). We skipped the case of the initial offset in Phase 1 being 0, as the difference to +-2 is larger than 1. Therefore, the maximum additional experiments per iteration is The combination of experiments in (3) and (4) is significantly less than the original TRA in (2).
To train TRA, a decision was made to adjust the quantization level or ∆QP at each Tid related to the RD performance of the whole sequences. Fortunately, with many sequences, we adopted an offline simple training scheme, in which a decision is made based on the final BDBR performance. Then, we applied the learned TRA offset to test sequences without fine-tuning to avoid the complexity overhead. Therefore, we learned a data-dependent group of quantization ∆QP i , i = 0, 1, 2, 3 while previous works independently learn to adjust ∆QP at a given QP setting.

Training and Testing Conditions
To evaluate the coding performance of the proposed methods, Y-BDBR [28] is calculated under the Random Access (RA) configuration at the common test conditions [29]. We used the HM 16.20 [30] with GOP of 8, intra period of 32, and various quantization parameters for high rate (QP: 22, 27, 32, 37), which corresponds to the common test condition in HEVC and low rate (QP: 32, 37, 42, 45). For trellis-based rate allocation, we chose 16 sequences at various resolutions and frame rates while the test sequences are other common sequences in HEVC [31], and additional sequences can be found in [32,33]. Only the first 64 frames are used to reduce the training time but still produce high performance. The training and test sequences can be seen in Table 1, while Figures 6 and 7 illustrate the first frame of each video sequence.
To illustrate the effectiveness of the TRA method, its performance on the training dataset in BDBR is shown in Figure 8. It is easy to observe that TRA already achieves good performance even with phase one, which includes ∆QP set of {±1, 0}. Moreover, we observed that one iteration was enough for TRA to converge in both phases. For the 1080p training set, over Phase 1, an additional ∆QP set of {±2, ±1, 0} in Phase 2 showed no BDBR improvement for TRA. Interestingly, the TRA and VRF combination slightly (i.e., −0.28%) and greatly reduced (i.e., 1.54%) rates for TRA + VRF at a high and low rate, respectively. The reason for no improvement on TRA at Phase 2 is due to the change ∆QP at frame level, which will greatly impact overall performance. Therefore, ±2 might depart too much from the conventional rate. However, while combining with the VRF, ∆QP ±2 showed a significant gain of 1% BDBR in the training set on average. The loss in quality at low is compensated by additional virtual frames generated by VRF. Please refer to Table 2 for more results on the different training classes. We observed that using the same ∆QP set for all training data does show a minor quality improvement at a high rate. This is due to the nature of the local optima of dynamic programming. At a low rate, an additional 0.31% Y-BDBR is achieved for low rate but almost identical results for high rate. Therefore, we adopted a simple resolution-based adaptive ∆QP selection. As the resolution is available as the input, the proposed method requires no additional complexity compared to other frameworks. For complex adaptive methods (i.e., fast/slow motion, simple/details scene, etc.), this could be an interesting topic for our future work.  To illustrate the effectiveness of the TRA method, its performance on the training dataset in BDBR is shown in Figure 8. It is easy to observe that TRA already achieves good performance even with phase one, which includes Δ set of {±1, 0}. Moreover, we observed that one iteration was enough for TRA to converge in both phases. For the 1080p training set, over Phase 1, an additional Δ set of {±2, ±1, 0} in Phase 2 showed no BDBR improvement for TRA. Interestingly, the TRA and VRF combination slightly (i.e., −0.28%) and greatly reduced (i.e., 1.54%) rates for TRA + VRF at a high and low rate, respectively. The reason for no improvement on TRA at Phase 2 is due to the change Δ at frame level, which will greatly impact overall performance. Therefore, ±2 might depart too much from the conventional rate. However, while combining with the VRF, Δ ±2 showed a The learned ∆QP set is given in Table 2. Phase 2 only changes the Tid 0 in class 1080p, 720p and 240p. The main reason is related to the GOP 8, which has more Tid 0 frames and thus consumes a high rate. We also observed very similar results of the learned ∆QP with and without VRF with the exception of 1080p. A more detailed rate-distribution can be found in Section 4.3.
the nature of the local optima of dynamic programming. At a low rate, an additional 0.31% Y-BDBR is achieved for low rate but almost identical results for high rate. Therefore, we adopted a simple resolution-based adaptive Δ selection. As the resolution is available as the input, the proposed method requires no additional complexity compared to other frameworks. For complex adaptive methods (i.e., fast/slow motion, simple/details scene, etc.), this could be an interesting topic for our future work.    the nature of the local optima of dynamic programming. At a low rate, an additional 0.31% Y-BDBR is achieved for low rate but almost identical results for high rate. Therefore, we adopted a simple resolution-based adaptive Δ selection. As the resolution is available as the input, the proposed method requires no additional complexity compared to other frameworks. For complex adaptive methods (i.e., fast/slow motion, simple/details scene, etc.), this could be an interesting topic for our future work.

Compression Performance Assessment
The coding performance on test sequences is evaluated in Table 3 for BDBR and Table 4 for BDPSNR comparisons. We compared our TRA method with and without VRF to HEVC with (TRA) and without VRF (TRA + VRF), and the most relevant adaptive quantization method [3] as well as the standard HEVC. Similar to the training set, low-rate and high-rate BDBR are used for comparison. Overall, both TRA and VRF provide a consistent improvement over the baseline HEVC. In addition, to explore the effectiveness of the proposed TRA method, we conducted further experiments for relevant frame-based QP adaptation methods, notably the QPλ model proposed in [34] where the QP and λ are modeled as a linear function and our recent work in [35] where the QP and RD cost is modeled as a polynomial function. The model parameters were selected as in [34,35]. We also slightly modified the TRA to change the QP for only the Tid 0, named TRA_0. The BDBR comparison is illustrated in Table 5.  Tables 3-5, some conclusions can be obtained as:

•
Overall, the compression performance of the HEVC with proposed TRA and VRF methods outperforms both HEVC with and without QPA benchmarks; • The HEVC with TRA and VRF methods achieved a significant coding improvement for all test sequences, notably by 5.2% and 2.7% of BDBR on average for the low and high rates regions, respectively; • The TRA method provides around 1.55% and 0.82% of BDBR saving for test sequences at the low and high rate regions, respectively, while the VRF method provides around 3.32% and 1.86% of BDBR saving; • The proposed methods, both TRA and VRF, achieve better compression performance for the low rate region than for the high rate region; • In most cases, the combination of TRA and VRF methods achieves even better compression performance than a simple addition method where the compression gain of TRA is added with that of VRF. This composed effect motivates the use of both TRA and VRF in improving HEVC performance; • Experimental results also show that no compression gain is achieved for screen content videos such as SlideEditing. This may come from the fact that the training set does not include any video with this content, and the VRF creation may also not work well for screen-captured videos; • Compared to other frame-based QPA algorithms, the proposed method achieved better BDBR (see Table 5). It should be noted that the QPA proposed in [35] was mainly designed for surveillance video content and its high-order polynomial model is highly sensitive to the selected parameters. Similarly, the QPλ linear model proposed in [34] is also unable to achieve good BDBR performance even with the original model parameters used in [34] or our new parameters; • A similar compression achievement is also observed for the BDPSNR comparison.
In addition, similar to other temporal rate allocation methods, TRA and TRA + VRF show better performance for slow-motion sequences while still greatly improve performance for fast-moving sequences such as BasketballDrive. It should be noted that the testing sequences are different from the training sequences, thus validating the generalization of our TRA scheme.
To further visualize the performance of our proposed method, Figure 9 shows the RD comparison between the original HEVC and the proposed TRA + VRF. • The HEVC with TRA and VRF methods achieved a significant coding improvement for all test sequences, notably by 5.2% and 2.7% of BDBR on average for the low and high rates regions, respectively; • The TRA method provides around 1.55% and 0.82% of BDBR saving for test sequences at the low and high rate regions, respectively, while the VRF method provides around 3.32% and 1.86% of BDBR saving; • The proposed methods, both TRA and VRF, achieve better compression performance for the low rate region than for the high rate region; • In most cases, the combination of TRA and VRF methods achieves even better compression performance than a simple addition method where the compression gain of TRA is added with that of VRF. This composed effect motivates the use of both TRA and VRF in improving HEVC performance; • Experimental results also show that no compression gain is achieved for screen content videos such as SlideEditing. This may come from the fact that the training set does not include any video with this content, and the VRF creation may also not work well for screen-captured videos; • Compared to other frame-based QPA algorithms, the proposed method achieved better BDBR (see Table 5). It should be noted that the QPA proposed in [35] was mainly designed for surveillance video content and its high-order polynomial model is highly sensitive to the selected parameters. Similarly, the QP-λ linear model proposed in [34] is also unable to achieve good BDBR performance even with the original model parameters used in [34] or our new parameters; • A similar compression achievement is also observed for the BDPSNR comparison.
In addition, similar to other temporal rate allocation methods, TRA and TRA + VRF show better performance for slow-motion sequences while still greatly improve performance for fast-moving sequences such as BasketballDrive. It should be noted that the testing sequences are different from the training sequences, thus validating the generalization of our TRA scheme.
To further visualize the performance of our proposed method, Figure 9 shows the RD comparison between the original HEVC and the proposed TRA + VRF.

Rate Allocation Asessment
To explore the rate allocation performance with the proposed method, the rate distribution with different temporal layers is measured and shown in Table 6 and Figure 10 for both low rate and high rate.

Rate Allocation Asessment
To explore the rate allocation performance with the proposed method, the rate distribution with different temporal layers is measured and shown in Table 6 and Figure 10 for both low rate and high rate.  The proposed TRA redistributed the rate among temporal layers. The proposed method tends to give more bits for pictures at the lower temporal layer indexes, i.e., Tid 0. In fact, the pictures at these positions have a higher impact than the remaining ones as they can be referred to when coding the pictures at the higher Tid. In general, our proposed TRA tends to reduce the rate while maintaining a similar level of quality.
On the other hand, VRF shows a virtually similar rate distribution compared to the baseline HEVC. Therefore, the rate redistribution is only impacted by our TRA. Details on the rate distribution can be seen in Table 6.

Complexity Assessment
To complete the performance evaluation, we evaluated the encoding time of our proposed TRA with and without VRF. The time increase with the proposed tools was assessed and is shown in Table 7. As obtained, The TRA method does not affect the computational complexity of the encoder.
On the other hand, TRA can even reduce the encoding time, especially at a high rate. The proposed TRA redistributed the rate among temporal layers. The proposed method tends to give more bits for pictures at the lower temporal layer indexes, i.e., Tid 0. In fact, the pictures at these positions have a higher impact than the remaining ones as they can be referred to when coding the pictures at the higher Tid. In general, our proposed TRA tends to reduce the rate while maintaining a similar level of quality.
On the other hand, VRF shows a virtually similar rate distribution compared to the baseline HEVC. Therefore, the rate redistribution is only impacted by our TRA. Details on the rate distribution can be seen in Table 6.

Complexity Assessment
To complete the performance evaluation, we evaluated the encoding time of our proposed TRA with and without VRF. The time increase with the proposed tools was assessed and is shown in Table 7. As obtained, The TRA method does not affect the computational complexity of the encoder. On the other hand, TRA can even reduce the encoding time, especially at a high rate. The main reason is due to the reduction in total rate, which greatly impacts other coding tools. However, the computation time caused by the VRF method introduces nearly three times the encoding increase to the proposed HEVC. This mainly comes from the creation of additional references, and more reference frames also lead to more encoding time for motion estimation. This complexity-compression performance trade-off may prevent the wide deployment of VRF in some video coding applications where the computational complexity is constrained. However, as shown in Table 4, BDBR with the combination of TRA and VRF methods is mostly higher than the total BDBR of TRA and VRF when they are individually applied. This effect motivates to include in HEVC not only the rate control solution but also other VRF creation approach in future video coding development. Additionally, the complexity issue in VRF can be addressed by looking for low complexity VRF creation methods.

VRF Assessment
Finally, although VRF has shown an impressive compression performance when adopted in the HEVC architecture, it also introduced a remarkable complexity overhead. To further explore the creation of VRF in the proposed HEVC, this sub-section will assess the visual quality and the computational complexity of VRF.
As presented in Section 3.1.1, the VRF creation adopted in this paper includes three main steps, (i) the hierarchical motion estimation (HME) to initialize the motion vector field, (ii) the motion vector refinement (MVR) to mitigate the quantization errors which frequently happen in decoded references and (iii) motion compensation (MC) to interpolate the VRF. To reveal the contribution of these modules, we compute and illustrate in Table 8 the percentage of computation time (%) measured for VRF over overall HEVC (%VRF HEVC ), and for the HME, MVR, and MC over the VRF (%HME VRF , %MVR VRF , %MC VRF ), while Figure 11 illustrates the visual quality comparison for the proposed VRF with and without MVR and the related work in [25].
%HME VRF = Time HME × 100/Time VRF %MC VRF = Time MC × 100/Time VRF The complexity results obtained in Table 8 shows that the VRF consumes around 11.70% of computational complexity in overall HEVC encoding. In VRF, the HME required the largest computational complexity, i.e., 95.24%. In this case, reducing the time of the HME is critical to make the VRF more potential for widespread adoption in HEVC.
As observed in Figure 11, both HME and MVR contribute to the final quality of VRF. Naturally, HME exploits the video content to initialize the motion vector information; the good MV can mitigate the hole and occlusion problems in the interpolated frame. Meanwhile, the MVR can partly reduce the quantization noise and blocking artifacts in interpolated pictures by adaptively refining the MVF obtained in the HME. It should be noted that the blocking artifact typically happens in HEVC compressed videos due to the block-based coding approach, while the translational motion model adopted in BiME may also introduce further blocking artifact (see Figure 11 visual quality for VRF w/o MVR). In this case, WVMF can eliminate the "outlier" motion information, resulting in smoother MVF for final VRF creation (see Figure 11 visual quality for proposed VRF).  As observed in Figure 11, both HME and MVR contribute to the final quality of VRF. Naturally, HME exploits the video content to initialize the motion vector information; the good MV can mitigate the hole and occlusion problems in the interpolated frame. Meanwhile, the MVR can partly reduce the quantization noise and blocking artifacts in interpolated pictures by adaptively refining the MVF obtained in the HME. It should be noted that the blocking artifact typically happens in HEVC compressed videos due to the blockbased coding approach, while the translational motion model adopted in BiME may also introduce further blocking artifact (see Figure 11 visual quality for VRF w/o MVR). In this case, WVMF can eliminate the "outlier" motion information, resulting in smoother MVF for final VRF creation (see Figure 11 visual quality for proposed VRF).

Conclusions
This paper presented a novel trellis-based bitrate allocation algorithm for HEVC inter coding where a virtual reference frame is considered. The proposed VRF-HEVC relies on the decoded information available at both the encoder and decoder; thus, no syntax elements need to change in the standard specification, and no overhead bitrate needs to concern. To optimize the VRF quality, a statistical learning-based VRF frame creation is adopted. This paper is the first work that considers the impact of quantization error and the video content on the VR frame quality. In addition, to achieve higher HEVC compression performance, a novel set of quantization parameters is introduced for the VRF-HEVC

Conclusions
This paper presented a novel trellis-based bitrate allocation algorithm for HEVC inter coding where a virtual reference frame is considered. The proposed VRF-HEVC relies on the decoded information available at both the encoder and decoder; thus, no syntax elements need to change in the standard specification, and no overhead bitrate needs to concern. To optimize the VRF quality, a statistical learning-based VRF frame creation is adopted. This paper is the first work that considers the impact of quantization error and the video content on the VR frame quality. In addition, to achieve higher HEVC compression performance, a novel set of quantization parameters is introduced for the VRF-HEVC framework based on a dynamic programing based bitrate allocation approach. Experimental results obtained for a rich set of test sequences revealed that the proposed TRA and VRF based HEVC solutions significantly outperform relevant HEVC improvement methods and the standard HEVC.

Conflicts of Interest:
The authors declare no conflict of interest.