Context-Based Inter Mode Decision Method for Fast Afﬁne Prediction in Versatile Video Coding

: Versatile Video Coding (VVC) is the most recent video coding standard developed by Joint Video Experts Team (JVET) that can achieve a bit-rate reduction of 50% with perceptually similar quality compared to the previous method, namely High Efﬁciency Video Coding (HEVC). Although VVC can support the signiﬁcant coding performance, it leads to the tremendous computational complexity of VVC encoder. In particular, VVC has newly adopted an afﬁne motion estimation (AME) method to overcome the limitations of the translational motion model at the expense of higher encoding complexity. In this paper, we proposed a context-based inter mode decision method for fast afﬁne prediction that determines whether the AME is performed or not in the process of rate-distortion (RD) optimization for optimal CU-mode decision. Experimental results showed that the proposed method signiﬁcantly reduced the encoding complexity of AME up to 33% with unnoticeable coding loss compared to the VVC Test Model (VTM).


Introduction
As the state-of-the-art video coding standard, Versatile Video Coding (VVC) [1] was developed by Joint Video Experts Team (JVET), which was organized by ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG). VVC can support significant coding performance compared to High Efficiency Video Coding (HEVC) [2]. According to [3], the VVC test model (VTM) [4] accomplished twice the coding efficiency compared to HEVC test model (HM) [5] in the random-access (RA) configuration of JVET common test conditions (CTC) [6]. In order to realize this improvement, VVC has adopted new coding tools for more elaborated inter prediction, such as affine inter prediction [7] and bi-prediction with CU-level weight (BCW) [8]. In addition, VVC also has developed Adaptive Motion Vector Resolution (AMVR) [9] and Symmetric Motion Vector Difference (SMVD) [10] as well as Advanced Motion Vector Prediction (AMVP) to save the bitrate of the motion parameters. With regard to merge estimation, VVC has adopted Geometric Partitioning Mode (GPM) [11], Merge with Motion Vector Difference (MMVD) [12], Decoder-Side Motion Vector Refinement (DMVR) [13], and Combined Inter/Intra Prediction (CIIP) [14]. Although these tools can improve the coding performance, the computational complexity of the encoder was approximately increased up to 10 times under the RA configuration compared to that of HEVC [3].
One of the main differences between HEVC and VVC is the block structure scheme. Both HEVC and VVC specify the largest coding unit as coding tree unit (CTU) with an interchangeable size based on the encoder configuration. In addition, to adapt to the various block properties, CTU could be split into four coding units (CUs) using a quadtree structure (QT). HEVC specifies the multiple partition unit types, namely coding unit (CU), prediction unit (PU), and transform unit (TU). On the other hand, VVC substitutes them with another block structure named QT-based multi-type tree (MTT) structure so that MTT blocks can be partitioned into either binary trees (BTs) or ternary trees (TTs) to provide a variety of CU partitioning shapes. As shown in Figure 1, VVC specifies four MTT splitting types, which are vertical binary split (SPLIT_BT_VER), horizontal binary split (SPLIT_BT_HOR), vertical ternary split (SPLIT_TT_VER), and horizontal ternary split (SPLIT_TT_HOR). After a CTU is divided into four QTs (SPLIT_QT) by QT structure, each QT block can be further split into either sub-QTs or MTT blocks, while an MTT block can be only partitioned into sub-MTT blocks, as depicted in Figure 2. Note that a QT leaf node or a MTT leaf node is regarded as a CU, which is used as the basic unit of the prediction and transform processes without any further partitioning schemes. Therefore, a CU in VVC can have either a square or rectangular shape, while a CU in HEVC always has a square shape. It means that the block structure of VVC can provide better coding performance than HEVC by supporting flexible CU partitioning shapes.
Electronics 2021, 10, x FOR PEER REVIEW 2 of 13 structure (QT). HEVC specifies the multiple partition unit types, namely coding unit (CU), prediction unit (PU), and transform unit (TU). On the other hand, VVC substitutes them with another block structure named QT-based multi-type tree (MTT) structure so that MTT blocks can be partitioned into either binary trees (BTs) or ternary trees (TTs) to provide a variety of CU partitioning shapes. As shown in Figure 1, VVC specifies four MTT splitting types, which are vertical binary split (SPLIT_BT_VER), horizontal binary split (SPLIT_BT_HOR), vertical ternary split (SPLIT_TT_VER), and horizontal ternary split (SPLIT_TT_HOR). After a CTU is divided into four QTs (SPLIT_QT) by QT structure, each QT block can be further split into either sub-QTs or MTT blocks, while an MTT block can be only partitioned into sub-MTT blocks, as depicted in Figure 2. Note that a QT leaf node or a MTT leaf node is regarded as a CU, which is used as the basic unit of the prediction and transform processes without any further partitioning schemes. Therefore, a CU in VVC can have either a square or rectangular shape, while a CU in HEVC always has a square shape. It means that the block structure of VVC can provide better coding performance than HEVC by supporting flexible CU partitioning shapes. In this paper, we define the relationship between the upper CU and current CUs. The upper CU can be either a QT node with square shape or a MTT node with square or rectangular shapes covering the area of the current CU. For example, the divided QT, BT, and TT CUs can have the same upper CU, which is the QT node with 2N × 2N, as illustrated in Figure 2a. Similarly, a MTT node is also regarded as an upper CU for further partitioned sub-MTT nodes, as depicted in Figure 2b.  In this paper, we define the relationship between the upper CU and current CUs. The upper CU can be either a QT node with square shape or a MTT node with square or rectangular shapes covering the area of the current CU. For example, the divided QT, BT, and TT CUs can have the same upper CU, which is the QT node with 2N × 2N, as illustrated in Figure 2a. Similarly, a MTT node is also regarded as an upper CU for further partitioned sub-MTT nodes, as depicted in Figure 2b.
In general, inter prediction searches the motion-compensated block from the previously decoded reference frames by using conventional motion estimation (CME). CME conducts a kind of block-matching algorithm based on the translational motion model. Because CME cannot efficiently describe complex motions in natural videos, such as zooming and rotation, VVC has newly adopted not only CME but also affine motion estimation (AME) to overcome the limitations of the translational motion model at the expense of higher encoding complexity. VVC provides two affine motion models in the process of AME, which are the four-parameter affine model (affine_4P) and the six-parameter (affine_6P). Because VVC performs the affine prediction in the unit of sub-block with the size of 4 × 4, the affine MVs of each sub-block are derived from a two-control-point MV (CPMV 0 , CPMV 1 ) or a three-control-point MV (CPMV 0 , CPMV 1 , CPMV 2 ) depending on affine_4P or affine_6P, respectively. As shown in Figure 3, CPMV 0 , CPMV 1 , and CPMV 2 indicate the motion vectors of the top-left corner control point, the top-right corner control point, and the bottom-left corner control point, respectively. In this paper, the horizontal and vertical MVs of CPMV 0 , CPMV 1 , and CPMV 2 are notated as (mv 0x , mv 0y ), (mv 1x , mv 1y ), and (mv 2x , mv 2y ), respectively. For the affine_4P, MV (mv x , mv y ) at center sample location (x, y) for a sub-block is derived as in Equation (1). Similarly, in case of the affine_6P, MV (mv x , mv y ) at center sample location (x, y) for a sub-block is derived as in Equation (2): where W and H indicate the width and the height of current CU, respectively. After computing affine MVs with 1/16 fractional MV accuracy, the predicted block of current CU is fetched from the reference frames in the unit of sub-block.  In general, inter prediction searches the motion-compensated block from the previously decoded reference frames by using conventional motion estimation (CME). CME conducts a kind of block-matching algorithm based on the translational motion model. Because CME cannot efficiently describe complex motions in natural videos, such as zooming and rotation, VVC has newly adopted not only CME but also affine motion estimation (AME) to overcome the limitations of the translational motion model at the expense of higher encoding complexity. VVC provides two affine motion models in the process of AME, which are the four-parameter affine model (affine_4P) and the sixparameter (affine_6P). Because VVC performs the affine prediction in the unit of sub-block with the size of 4 × 4, the affine MVs of each sub-block are derived from a two-controlpoint MV (CPMV0, CPMV1) or a three-control-point MV (CPMV0, CPMV1, CPMV2) depending on affine_4P or affine_6P, respectively. As shown in Figure 3, CPMV0, CPMV1, In general, inter prediction searches the motion-compensated block from the previously decoded reference frames by using conventional motion estimation (CME). CME conducts a kind of block-matching algorithm based on the translational motion model. Because CME cannot efficiently describe complex motions in natural videos, such as zooming and rotation, VVC has newly adopted not only CME but also affine motion estimation (AME) to overcome the limitations of the translational motion model at the expense of higher encoding complexity. VVC provides two affine motion models in the process of AME, which are the four-parameter affine model (affine_4P) and the sixparameter (affine_6P). Because VVC performs the affine prediction in the unit of sub-block with the size of 4 × 4, the affine MVs of each sub-block are derived from a two-controlpoint MV (CPMV0, CPMV1) or a three-control-point MV (CPMV0, CPMV1, CPMV2) depending on affine_4P or affine_6P, respectively. As shown in Figure 3, CPMV0, CPMV1, Similarly, in case of the affine_6P, MV ( , ) xy mv mv at center sample location (x, y) for a sub-block is derived as in Equation (2): where W and H indicate the width and the height of current CU, respectively. After computing affine MVs with 1/16 fractional MV accuracy, the predicted block of current CU is fetched from the reference frames in the unit of sub-block.  ) xy mv mv at center sample location (x, y) for a sub-block is derived as in Equation (1).
Similarly, in case of the affine_6P, MV ( , ) xy mv mv at center sample location (x, y) for a sub-block is derived as in Equation (2): where W and H indicate the width and the height of current CU, respectively. After computing affine MVs with 1/16 fractional MV accuracy, the predicted block of current CU is fetched from the reference frames in the unit of sub-block. While AME enables efficient prediction of the complex motions beyond the reach of translational motions, it causes heavy computational complexity in the VVC encoder. According to the tool-off test of affine prediction [15], the coding loss and the time savings are 3.11% and 20%, respectively. Figure 4a shows the distribution between affine inter mode using AME and AMVP mode using CME in the RA configuration of JVET CTC. In addition, we investigated the distribution of affine inter mode on the reference and non-reference frames, as shown in Figure 4b. In RA configuration of VVC, it specifies group-of-picture (GOP) size to support the hierarchical encoding structure along the different temporal layers (TLs). The frames with the highest TL value are treated as the non-reference frames in VVC. Figure 4b indicates that affine inter modes rarely occurred in non-reference frames. Although affine inter mode does not frequently occur during the VVC encoding process, it contributes to improving the coding efficiency. It means that the fast encoding scheme of AME should be carefully designed in order to minimize the coding loss. In this paper, we propose a fast affine prediction method to reduce the computational complexity of AME with unnoticeable coding loss.
The remainder of this paper is organized as follows. In Section 2, we review the related fast encoding schemes to reduce the computational complexity of motion estimation in HEVC or VVC. Then, the proposed method is described in Section 3. Finally, experimental results and conclusions are given in Sections 4 and 5, respectively. VVC encoding process, it contributes to improving the coding efficiency. It means that the fast encoding scheme of AME should be carefully designed in order to minimize the coding loss. In this paper, we propose a fast affine prediction method to reduce the computational complexity of AME with unnoticeable coding loss.
The remainder of this paper is organized as follows. In Section 2, we review the related fast encoding schemes to reduce the computational complexity of motion estimation in HEVC or VVC. Then, the proposed method is described in Section 3. Finally, experimental results and conclusions are given in Sections 4 and 5, respectively.

Related Work
Although newly adopted VVC coding tools show the powerful compression performance compared to HEVC; it causes heavy computational complexity mainly due to the ME processes, including AME. Therefore, many researchers have studied fast encoding algorithms in the areas of ME without noticeable quality degradation. As discussed in [16,17], the increases of coding efficiency and computational complexity depend on how many reference frames are used. Therefore, reference frames should be carefully determined from the perspective of the trade-off between coding loss and complexity reduction. Pan et al. [18] reduced the number of reference frames based on the measurement with content similarity. In HEVC, a directional search pattern method [19], a rotating hexagonal pattern [20], and an adaptive search range [21] showed reasonable trade-off with the 64 × 64 search range to reduce the computational complexity of ME. VVC inherited several ME algorithms of HM, such as diamond search, raster search, and refinement processes [22], for fast ME process.
Another approach is the early termination of the ME process when the predefined conditions are satisfied at a certain ME process. For example, the ME process can be terminated with high probability because the recursive block-partitioning process has a

Related Work
Although newly adopted VVC coding tools show the powerful compression performance compared to HEVC; it causes heavy computational complexity mainly due to the ME processes, including AME. Therefore, many researchers have studied fast encoding algorithms in the areas of ME without noticeable quality degradation. As discussed in [16,17], the increases of coding efficiency and computational complexity depend on how many reference frames are used. Therefore, reference frames should be carefully determined from the perspective of the trade-off between coding loss and complexity reduction. Pan et al. [18] reduced the number of reference frames based on the measurement with content similarity. In HEVC, a directional search pattern method [19], a rotating hexagonal pattern [20], and an adaptive search range [21] showed reasonable trade-off with the 64 × 64 search range to reduce the computational complexity of ME. VVC inherited several ME algorithms of HM, such as diamond search, raster search, and refinement processes [22], for fast ME process.
Another approach is the early termination of the ME process when the predefined conditions are satisfied at a certain ME process. For example, the ME process can be terminated with high probability because the recursive block-partitioning process has a strong correlation between motion parameters. In addition, the bi-directional ME process can be terminated with the PU correlations between QT structures with the different CU depth [23,24]. Shen et al. [25] proposed a fast CU split-decision algorithm and a CU depthrange estimation algorithm. It determines the early termination of CU partitioning using the neighboring CU-partitioning structure and motion amounts. In addition, Shen et al. [26] proposed a fast PU decision method using the motion activity. Tan et al. [27] addressed a three-step early CU-depth decision method by comparing the RD cost of the current CU with the average RD cost of the previously encoded CUs. Xiong et al. [28] presented a fast CU decision method using pyramid motion divergence, which exploits the correlation between the motion parameters and the CU split structure. Lee et al. [29] presented a fast CU size-selection method by comparing the average RD cost of the SKIP mode with an adaptive threshold.
Since AME for affine prediction was integrated on top of VTM 1.0 in 2018, several works have been developed to enhance the affine prediction in the middle of VVC standardization. In particular, Zhou et al. [30] proposed the concept of affine merge estimation to infer the affine motion parameters from the neighboring blocks without the explicit derivation of affine MVs. In [31], Zhang et al. proposed to omit the derivation of affine motion parameters in the case of chroma blocks with small block size. Those methods enable reduction of the total memory bandwidth and the encoding complexity compared to the initial affine prediction on VTM 1.0. Recently, Park et al. [32] reduced the encoding complexity of AME consisting of two steps: in the first step, if the best mode of the upper CU is encoded as SKIP mode, the AME of current CU is omitted to avoid the unnecessary RD computations caused by the affine prediction. In the second step, when the best reference frame of CME is selected as the nearest reference frame in the uni-prediction of list 0 (L0), AME is only performed on the same reference frame in the prediction of L0. While this method showed marginal coding loss compared to VTM 3.0, it slightly reduced the computational complexity of AME.

Proposed Method
In order to design the fast affine prediction, we investigated the context correlation between upper and current CU with regard to the affine prediction. Based on the encoding information of upper CU, we checked the reasonable conditions to skip the affine inter mode of current CU. Because the posterior probability by Bayes' theorem shows the effectiveness of the conditions, we computed it with the likelihood and prior probability as (3):

f f ine_inter and p(C a f f ine_inter ) by Equation
where C, U, "&&", and "!" are denoted as current CU, upper CU, AND, and NOT logic operator, respectively. In addition, cbf 0 means that the transformed non-zero coefficients do not exist in the CU. In other words, if current CU is encoded as cbf 0, the CU could be generally regarded as a static area with stationary motions, which might not require sophisticated coding tools, such as affine prediction. Note that affine includes affine_merge as well as affine_inter in this paper. For example, C a f f ine_inter indicates that the best prediction mode of current CU is selected as affine inter mode in terms of rate-distortion optimization (RDO), as in Equation (4): where D, λ, and R represent the distortion, Lagrange multiplier, and the required bitrates in current CU, respectively. We obtained the prior and posterior probabilities from two full-HD (FHD) test sequences encoded by VTM 10.0 under RA configuration. Note that those statistics are derived from the different three QPs to avoid the same conditions specified by JVET CTC. Table 1 shows that the current CU has low probabilities to be encoded as affine inter mode when the upper CU is not encoded as affine and cbf 0. Because the distribution of the affine prediction is affected by the motion properties of the video sequences, the prior probabilities vary according to the different sequences and QP values. On the contrary, the posterior probabilities maintain similar distributions in Table 1. For example, while the prior probability of RitualDance sequence is increased up to 27%, the posterior probability is much smaller than the prior. Those statistics imply that a VVC encoder can avoid much of the unnecessary and redundant AME processes based on the conditions of upper CU before encoding the current CU. Based on the properties of the described contexts, the proposed method is composed of two steps: one is the early termination of affine inter mode of current CU after checking the conditions of upper CU, and the other is to determine the optimal affine model between the four-parameter and six-parameter affine model. Because an upper CU can have many current CUs, the proposed method was designed to allow for current CUs to exploit the previously encoded information derived from the QTMTT block structure. In this framework, the encoding information of either the upper or current CU are used to determine whether the AME of current CU is performed or not. If it does not satisfy with the conditions to skip the AME process, the second step can be applied for further computational complexity reduction to select the optimal affine model between affine_4P and affine_6P.  Figure 5 shows the block diagram of the proposed method, which modified the existing VVC encoder with the gray-shadowed early-termination rules. Firstly, if the upper CU is not encoded as affine mode and cbf 0, it means that the current CU is likely to have the translational motion. In that case, the affine inter mode of the current CU is skipped without any AME process, as shown in Figure 5. In addition, if the current frame is the nonreference frame with the highest TL, and the best mode of the current CU is not encoded as affine merge, AME is also skipped based on the distribution of Figure 4b. Secondly, we assumed that there are correlations between the current CU and upper CU in terms of the affine model for AME. Therefore, the current CU only performs the AME using affine_4P if the RD cost of affine_4P is smaller than that of affine_6P during the encoding of the upper CU, and the best mode of the current CU is encoded as the affine merge mode with affine_4P before encoding the AME process.

Experiment Results
The proposed method was evaluated under JVET CTC [6], as presented in Table 2, and we compared the proposed method with Park's [32] method. Our experiments were performed on top of VTM 10.0 as an anchor and were run on the experimental environments, as in Table 3.
For comparison of the computational complexity, we measured the time saving (TS) of the AME encoding time (AMT), as in Equation (5):

Experiment Results
The proposed method was evaluated under JVET CTC [6], as presented in Table 2, and we compared the proposed method with Park's [32] method. Our experiments were performed on top of VTM 10.0 as an anchor and were run on the experimental environments, as in Table 3.  For comparison of the computational complexity, we measured the time saving (TS) of the AME encoding time (AMT), as in Equation (5): where AMT org and AMT f ast are the AMT of the anchor and the fast method to be tested, respectively. Since the time taken for AMT can be fluctuated depending on the QP value, the TS of each test sequences was represented as the average of AMT as well as the total encoding time (TET) compared with the anchor. Similar to calculating AMT, the average TS was computed from the four TET results corresponding to the different QP values. Table 4 shows the complexity reduction between the proposed method and the previous method [32]. The proposed method reduces the AMT of the VTM up to 33%, on average, compared to the anchor. The maximum and minimum TS achieve 54% and 27% in BQTerrace and Cactus sequences with unnoticeable coding loss. Compared to the previous method, the TS of the proposed method is as fast as 15% and 3% in terms of AMT and TET, on average, respectively. According to [33], there might be a difference between the percentage of computational complexity and the actual encoding time. Table 5 shows both AMT and TET on top of VTM 10.0, which represents the sum of the encoding time from QP 22, 27, 32, and 37.  In order to evaluate the coding loss, we measured the Bjontegaard Delta Bit Rate (BDBR) [34]. In general, BDBR increase of 1% corresponds to BD-PSNR decrease of 0.05 dB where the positive increment of BDBR indicates the coding loss. Table 6 shows the coding loss both of the proposed method and the previous method compared to VTM 10.0. As shown in Table 6, the coding loss between the anchor and the proposed method is marginal for all test sequences. In particular, BDBR-U or BDBR-V shows better coding gain than the anchor in several sequences. For example, DaylightRoad2 sequence showed a 1.14% coding gain in terms of BDBR-U, and MarketPlace sequence showed a 0.76% coding gain in BDBR-V component. It implies that the proposed method can sustainably maintain the coding loss under the deployment of our fast affine scheme. In addition, Table 7 shows the percentage of CUs that perform the early termination of AME (affine_inter_off ) and only AME with four parameters (affine_inter_4P), when the proposed method was integrated on top of VTM 10.0.

Conclusions
VVC has newly adopted an affine motion estimation (AME) method to overcome the limitations of the translational motion model at the expense of higher encoding complexity. In this paper, we proposed a context-based inter mode decision method for fast affine prediction that determines whether AME is performed or not in the process of rate-distortion (RD) optimization for optimal CU-mode decision. Because the fast encoding scheme of AME should be carefully designed in order to minimize the coding loss, we investigated the context correlation between upper and current CU with regard to the affine prediction. After we defined the relation between an upper CU and current CUs, we checked the reasonable conditions to skip the affine inter mode of current CU using the statistics of context correlations between upper CU and current CU. The proposed method was evaluated under JVET CTC, and we compared the proposed method with Park's [32] method on top of VTM 10.0 as an anchor. For comparison of the computational complexity, we measured the time saving (TS) of the AME time (AMT). Experimental results show that the proposed method significantly reduced the encoding complexity of AME up to 33% with unnoticeable coding loss compared to VVC Test Model (VTM). In addition, the AMT of the proposed method is as fast as 15% compared to the previous method, on average.

Conflicts of Interest:
The authors declare no conflict of interest.