A Fast Gradient Iterative Afﬁne Motion Estimation Algorithm Based on Edge Detection for Versatile Video Coding

: In the Versatile Video Coding (VVC) standard, afﬁne motion models have been applied to enhance the resolution of complex motion patterns. However, due to the high computational complexity involved in afﬁne motion estimation, real-time video processing applications face signiﬁcant challenges. This paper focuses on optimizing afﬁne motion estimation algorithms in the VVC environment and proposes a fast gradient iterative algorithm based on edge detection for efﬁ-cient computation. Firstly, we establish judging conditions during the construction of afﬁne motion candidate lists to streamline the redundant judging process. Secondly, we employ the Canny edge detection method for gradient assessment in the afﬁne motion estimation process, thereby enhancing the iteration speed of afﬁne motion vectors. The experimentalresults show that the encoding time of the afﬁne motion estimation algorithm is about 15–35% lower than the overall encoding time of the anchor algorithm encoder, the average encoding time of the afﬁne motion estimation part of the inter-frame prediction part is reduced by 24.79%, and the peak signal-to-noise ratio (PSNR) is only reduced by 0.04.


Introduction
With the increasing amount of video data and the growing demand for high-quality video services, efficient video coding technology plays a crucial role in reducing bandwidth requirements and improving video compression performance. The H.266/VVC standard is the latest video coding standard developed by the Joint Video Expert Group (JVET) [1][2][3], aiming to provide significantly improved coding efficiency compared to previous standards and achieve a better coding performance than high-efficiency video coding (HEVC) [4][5][6][7]. The goal of the H.266/VVC standard is to provide higher compression rates under the same video quality. To achieve this goal, H.266/VVC adopts a series of innovative technologies, including the fast affine motion estimation algorithm (AME), advanced motion vector prediction (AMVP), prediction value correction based on the optical flow field, and inter-frame weighted prediction. For bidirectional prediction, decoder-side motion vector refinement (DMVR) [8], bidirectional optical flow (BDOF) [9,10], and affine motion compensation (AMC) [11][12][13] are employed at the decoding end to optimize the precision of the prediction, thereby enhancing its overall accuracy.
Affine motion estimation is a crucial step in video coding, serving to characterize inter-frame motion and facilitate differential frame encoding. Compared to previous motion estimation algorithms, affine motion models are more suitable for processing highdefinition video content due to their ability to handle complex video scenes that involve the translation, rotation, and scaling of objects. However, traditional affine motion estimation algorithms suffer from high computational complexity and insufficient accuracy, which limits the efficiency and quality of video coding. VVC provides two types of affine motion estimation models in the affine motion estimation module, namely the four-parameter affine model and the six-parameter affine model. In the affine motion estimation module, the structure of the current coding unit (CU) is 4 × 4 sub-blocks, and the motion vector of each sub-block can be obtained from the control point motion vector (CPMV) of the two affine motion models. Figure 1 shows these three transformations of the image. The three images (a-c) below describe the rotation, translation, and scaling of an image, where α denotes the rotation angle of the image, (∆x, ∆y) denotes the translation of the image, and s denotes the scaling. The affine motion model depicts the motion of an object or image in two-dimensional space under transformations such as translation, rotation, scaling, and misalignment. Affine transformations can be expressed as a combination of linear transformations and translations while preserving their affine properties. Through affine transformation, the image can be geometrically corrected, scaled, and aligned to improve the accuracy and stability of image processing and analysis. The affine transformation model of an image can be obtained through dimensionality reduction, utilizing the translation properties of Fourier transform and estimation theory for log-polar transformed images. Assuming that image f p2 (x, y) is the result of a translation of the image f p1 (x, y) by (∆x, ∆y), the relationship between the two images is shown in Formula (1), with the Fourier transform relationship as Formula (2) and the energy spectrum (amplitude spectrum) represented by Formula (3).
where M 2 and M 1 are the energy spectra of F 2 and F 1 . Firstly, through spectral transformation, we convert the geometric transformations (rotation and scaling) of the image into frequency domain translations. This allows for simpler and more efficient operations in the frequency domain, where image rotation and scaling can be achieved by straightforwardly shifting frequency components, reducing the complexity of geometric transformation calculations. The statement implies that the amplitude spectrum of the image translation remains unchanged in the frequency domain, which can be utilized for dimensionality reduction estimation of image transformation. Assuming that the positional relationship of the image satisfies Formula (4), Formula (5) is the relationship expression between two images.
(x 1 , y 1 ) = sR(α)(x, y) + T x 1 = sx cos α + sy sin α + ∆x y 1 = −sx sin α + sy cos α + ∆y (5) where R(α) = cos α sin α − sin α cos α , T = (∆x, ∆y) , (x 1 , y 1 ) represents the coordinates in the transformed image, (x, y) represents the coordinates in the original image, and s is the zooming factor, describing how the point (x, y) is stretched or shrunk in the horizontal and vertical directions after the rotation. If s is greater than 1, it indicates an enlargement; if s is between 0 and 1, it represents a reduction; if s is a negative value, it might involve a reflection. α is the rotation angle, describing the degree to which the point (x, y) is rotated around the origin. If α is a positive value, it indicates a counterclockwise rotation; if α is a negative value, it represents a clockwise rotation. The unit of angle is typically in radians, and (∆x, ∆y) are displacements in x and y directions. mv h (x,y) and mv v (x,y) are horizontal and vertical MVs determined by a, b, and (∆x, ∆y). Replacing s cos α and s sin α with a and b can obtain a concise form of MVs.
The nonlinear mapping of the image is realized by bilinear interpolation. In this way, the rotation and scaling transformation of the image are reduced to the translation transformation, and the four-parameter motion model can be obtained.
The motion vectors of the sub-blocks based on the current coding block can be obtained by an affine motion model with four and six parameters of two control point motion vectors CPMV1 and CPMV2, or three control point motion vectors CPMV1, CPMV2, and CPMV3, as shown in Figure 2, where the x-coordinate and y-coordinate denote the horizontal and vertical components of the motion vector, (a) is a four-parameter affine motion model and (b) is a six-parameter affine motion model, and CPMV1, CPMV2, and CPMV3 are the motion vectors of the upper left corner control point, upper right corner motion vector, and lower left corner control point motion vectors of the current coding unit. The motion vectors of the sub-blocks centered on (x, y) in Figure 2 can be calculated by the following equation: where mv h (x,y) and mv v (x,y) represent the motion vectors in the horizontal and vertical directions of the current encoding sub-blocks (x, y), and W is the width of the current encoding unit; thus, (W − 1) is the distance between Control-point 1 and Control-point 2, and mv h 2 , mv h 1 , mv v 2 , and mv v 1 are the MV components of Control-point 1 and Controlpoint 2. The results of the four parameters (a, b, T) are delivered only by mv 1 and mv 2 . Through this method, the current encoding unit can be obtained from the motion vectors of 4 × 4 sub-blocks. Formula (7) can also be rewritten in the form of a matrix: The six-parameter affine motion model has one more CPMV than the four-parameter affine model. The motion vectors of the current coding block of the six-parameter affine motion model are derived in a similar way to that of the four-parameter affine motion model. The formula for calculating the motion vector centered on the current block (x, y) is as follows: where mv v 3 and mv h 3 represent the motion vectors in the horizontal and vertical directions motion vectors, H is the height of the block, and (H − 1) is the distance between Controlpoint 1 and Control-point 3. This formula can also be rewritten in matrix form. The mv 1x , mv 2x , mv 3x in Figure 2 represent mv h 1 , mv h 2 , mv h 3 , and mv 1y , mv 2y , mv 3y in Figure 2 represent CPMV 2 (mv 2x , mv 2y ) Introducing an affine motion estimation model in VVC can effectively describe complex video content and improve the performance of the encoder. However, the computational complexity of AME is high, and the motion estimation algorithm in the interprediction module takes up a long encoding time [14]. To achieve better encoding gain while reducing its computational complexity, many researchers have attempted to reduce its computational complexity in traditional affine motion estimation modules [15][16][17][18][19][20][21]. However, there is still little work in the inter-frame prediction affine motion estimation module of VVC and there is still a large optimization space. This paper proposes a fast gradient iterative affine motion estimation algorithm based on edge detection that can effectively accelerate the affine motion estimation process of inter-frame prediction and achieve the goal of shortening the overall encoding time of the encoder. This method consists of two steps. The first process is to set judging conditions in the candidate set of affine motion vectors and achieve the goal of skipping redundant judging when the candidate set meets the conditions. The second process is to use the Canny edge detection operator in the affine motion estimation to obtain the gradient, thereby accelerating the gradient iteration. While ensuring the coding performance and image quality, it accelerates the time for the affine motion estimation part of inter-frame prediction and reduces the overall encoding time of the encoder.
The arrangement of the entire article is as follows: Section 2 reviews the relevant research achievements and progress, Section 3 provides a detailed introduction to the fast affine motion estimation algorithm proposed in this paper, Section 4 provides experimental analysis and results, and Section 5 provides the conclusion of this paper.

Related Work
Modern multimedia applications have increasingly high requirements for video encoders, requiring both high coding efficiency and low computational complexity to ensure low latency and high transmission speed in real-time applications. To meet this demand, researchers are committed to designing efficient and low-complexity video encoders. Although there has been some research work on reducing the computational complexity of VVC encoders, most of the research has focused on accelerating early decision making in the partitioning process [15][16][17][18][19][20][21]. Reference [15] proposed a fast partitioning algorithm for intra and inter-frame encoding. For intra-frame encoding, the Canny edge detection algorithm is used to extract the features of image encoding and the features are used to determine whether to skip vertical or horizontal partitioning, achieving the goal of early termination. For inter-frame encoding, the three-frame difference method is used to determine whether an object is a moving target. Reference [16] proposes a fast texture-based CU partitioning method that evaluates the complexity of the current CU to determine whether to skip subsequent partitioning. At the same time, an improved Canny operator is used to extract edge information to exclude horizontal or vertical partitioning patterns. Reference [17] analyzes the probability of affine motion estimation mode in bidirectional prediction, explores the mutual exclusion between skip mode and affine mode, and proposes a VVC fast affine motion estimation mode based on near coding information. Reference [18] studied a fast motion estimation algorithm for the early termination of partial blocks in the CU, using skip mode for the CU that does not require affine changes. In reference [19], Zhao et al. extracted the standard deviation and edge ratio to accelerate the division of the CU. The CU split information and the time position of the encoded frame are used for low-complexity encoders [20]. Reference [21] checks whether the optimal prediction mode of the current encoding block is skip mode. If it is, it skips the entire affine motion estimation process and checks the direction of the optimal prediction. The detection results determine whether to reduce the size of the reference sequence, thereby reducing computational complexity. Reference [22] proposes an adaptive affine four-parameter and six-parameter encoding architecture where the encoder can adaptively select between two affine motion models. Reference [23] proposes an affine motion estimation model that iteratively searches for affine motion vectors and a method for constructing an affine advanced motion vector prediction candidate (AAMVP) list, which has been adopted by the H.266/VVC standard. Reference [14] proposes an affine motion compensation based on feature matching that can further improve the efficiency of video coding. Reference [24] carries out affine motion estimation through block division and predicts each pixel using a reference coordinate system to achieve the purpose of predicting affine transformation. Reference [25] proposes an affine motion estimation scheme that does not require additional alternating segmentation and estimation, described by applying a segmented function of the parameter field, and derives a specific splitting optimization scheme at close range. Reference [26] proposes a method of using rate-distortion theory and displacement estimation error to determine the minimum bit rate required for the information transmission of prediction error in the coding process. Reference [27] proposes a method for solving the problem of relative pose estimation by using the affine transformation between feature points. Reference [28] proposes a motion compensation scheme for three-zone segmentation. Based on segmentation information, three motion compensation regions are divided, namely the edge region, foreground region, and background region. By using the information from these three regions, the accuracy and encoding efficiency of motion compensation are improved. Reference [29] proposes a method of edge video compression texture synthesis based on a generative adversarial network to obtain the most authentic texture information. Reference [30] proposes an affine parameter model that utilizes matching algorithms to discover and extract feature point pairs from edges within consecutive frames and selects the optimal set of three sets of point pairs to describe global motion. Reference [31] proposes linear applications of traditional intra-prediction modes based on a pattern correlation processing sequence, region-based template matching prediction methods, and neural-network-based intra-prediction modes. Reference [32] proposes a context-based inter-mode judging method that skips affine modes by determining whether radial motion estimation is performed during the rate-distortion optimization process of the optimal CU mode decision. Reference [33] adds momentum parameters to accelerate the iterative process based on the symmetry of the affine motion estimation iterative process.
Overall, most of the current research work is focused on the skip judging of the affine motion estimation module. However, there have not been many improvements and optimizations to the architecture of the affine motion itself. In H.266/VVC, the affine motion estimation algorithm obtains the optimal radiative motion vector through gradient iteration. However, in the current research, the gradient iteration method adopts the traditional traversal algorithm to obtain gradient information through traversing images. This approach is not suitable for scenarios where there is a large amount of affine motion in high-definition video, resulting in the high computational complexity and complexity of the affine motion estimation module itself not being well addressed. This is what current research work needs to achieve.

Materials and Methods
Affine motion estimation is located in the inter-prediction module of the H.266/VVC encoder, which uses a method similar to the inter-prediction motion estimation in H.265/HEVC to search for motion vectors. The affine motion estimation of the VVC encoder first uses affine advanced motion vector prediction technology (AAMVP) to obtain the starting candidate list of affine motion vectors for the current encoding block. Then, a set of optimal candidate running vectors is selected as the starting search points in the list, and the optimal motion vector combination for the current encoding block is determined through iterative search.

Affine Advanced Motion Vector Prediction
The AAMVP technology is used in the inter-frame prediction of VVC to construct a candidate list of starting vector groups while using judging conditions to select the optimal set of vector combinations as the starting position for the iterative search. In VVC, the candidate length of AAMVP is defined as two, and the candidate list is established for each predicted image. Each list only contains unidirectional motion information. Figure 3 is the AAMVP candidate list build diagram. The encoder first checks the inheritance of available information adjacent to the current encoding unit in the order of bottom left, bottom, top right, and top left. If affine motion estimation mode is used in adjacent blocks, the affine information of adjacent encoded blocks is directly inherited. If the candidate set of the previous operation is not filled, adjacent blocks at the motion vectors of the three control points of the current encoding block are checked separately, and the motion vectors are combined using the first nonaffine motion mode translation motion vector at each CPMV.
If it is still not satisfied, the time-domain translation motion vector and zero-value MV are combined to fill the candidate list. It can be seen that the candidate list of AAMVP adopts five steps to construct motion vector combinations, namely: spatial adjacent affine mode CU inheritance; translation construction of adjacent CU in airspace; translated MV filling of adjacent CU in airspace; time domain translation MV filling; zero-value MV padding. In the process of constructing the candidate list for AAMVP, the adjacent blocks of adjacent affine modes in the spatial domain need to meet three conditions: first, they must be in inter-frame encoding mode, then in affine encoding mode, and, finally, the reference image must be the same as the current CU reference image. The translation construction of the adjacent CU in the same spatial domain also needs to meet three conditions: first, the inter-frame encoding mode, then the nonaffine encoding mode, and, finally, the reference image must be the same as the reference image of the current CU. We define the first condition as Condition_1. The second condition is Condition_2. When the conditions are not met, the judging of the current neighboring block is skipped in advance, and there is no need to perform other complex condition calculations and judging. At the same time, in the process of constructing the candidate set, after each step, a judging is made on whether the candidate set is filled. If the candidate set has already been filled in the current step, the subsequent judging steps are skipped, which can achieve the early termination of the candidate list construction process and reduce the computational burden of the encoder in this step. Figure 4 below is the optimization flowchart of AMMVP.

The Iterative Search of Affine Motion Vectors
In the VVC standard, the encoder uses the AAMVP technique from the previous step to obtain the optimal affine motion vector combination as the starting search motion vector group and obtains the optimal affine motion vector combination for the current encoding block through iterative search. Fast affine motion estimation needs to calculate a set of optimal affine motion vectors, usually two or three, so the VVC encoder uses mean squared error (MSE) as the matching criterion. The formula definition for MSE is as follows: where w and h are the width and height of the current encoding block, P cur (x, y) is the image where the current encoding block is located, and P re f (x, y) is the reference image for the current encoding block. Define the change in motion vector after the ith iteration as d i MV . The expression of the motion vector at the ith iteration can be defined as follows: where (x, y) represents the position of the current encoding block, and the change in the motion vector d i MV is a row matrix. Now, its transpose is given as follows: After i iterations, the pixel values of the current reference point can be obtained as follows: where (x i−1 , y i−1 ) is the position of the matching block during the previous iteration search, and Taylor polynomial expansion is performed on Formula (13) while ignoring higher-order polynomials to obtain Formula (14): To minimize the value of MSE during the iteration process, the pixel value P re f (x, y) of the reference point needs to be as close as possible to the original pixel value P cur (x, y). Set the relative gradient of the relative change d i MV of the motion vector to zero. If the value is zero during the iteration process, it indicates that the current reference pixel value is closest to the original pixel value and is the best-matched result. In the encoder model of VVC, the Sobel operator is used to convolute the pixel matrix to obtain the gradient. The formula is as follows: Considering the complexity of high-definition video content, using the Sobel operator to traverse images to obtain gradients greatly increases computational complexity, and the encoder has a high time consumption. This article uses the Canny edge detection algorithm to optimize the operation of traversing images. In affine motion models, the Canny edge detection algorithm is more efficient than simply using the Sobel algorithm to obtain the gradient changes of motion vector groups. However, the computational complexity of the Canny edge detection algorithm is higher. Considering that the affine motion estimation model is only used in the inter-prediction module of the VVC encoder, the affine motion model is only used when it meets specific conditions. Therefore, when the encoder chooses to use the affine motion mode, the Canny edge detection algorithm is activated. At the same time, when performing affine motion estimation, the Canny algorithm performs global gradient calculation during the processing of the first frame image and then makes corresponding marks. When processing subsequent images, it detects whether the current encoding block has already been calculated as a gradient in the first frame image. If it has already been calculated, the gradient value is directly read from the cache without the need for global calculations. The purpose of this operation is to skip the image area with an unchanged background and focus the image processing on the changing area, which is the calculation of motion vectors. Algorithm 1 is the C++ Pseudocode proposed for gradient calculation. The specific steps of Algorithm 1 are as follows: • Obtain the original data of the image, initialize the variable, cache, and mark the image with calculated edges.
• By traversing each pixel of the image, calculate the gradient and error at each pixel position. If isCannyComputed is false, it indicates that the Canny edge image needs to be recalculated for the first time; otherwise, skip. • Traverse the image and repeat the calculation. if isCannyComputed is false then 7 Compute Canny edge image (predMat, cannyEdges); 8 Set isCannyComputed to true; 9 end 10 Calculate horizontal gradient using Canny edges; 11 Calculate vertical gradient using Canny edges; 12 end 13 Set Compute gradient using Canny edges; 21 Update motion vectors using gradient; 22 Set isCannyComputed to true; 23 end

Simulation Setup
In order to analyze and evaluate the performance of the algorithm proposed in this paper, the official testing software VTM10.0 of H.266/VVC was used as the anchor for testing, and the algorithm proposed in this article was implemented using JVET Common Test Condition (CTC) [34] configuration. The compiling environment is VS 2019, and Microsoft Windows 10 64-bit Bitwise operation operating system was adopted. The configuration file in the experiment uses low latency P-frames, and the quantization parameters (QP) are 22,27,32,37. Table 1 shows the experimental environment parameters.

Performance and Analysis
At the same time, the evaluation index uses the Bjøntegard delta bitrate (BDBR) [35] to measure the encoding performance of the proposed algorithm and the original algorithm on bitrate. A negative number of data indicates that the proposed algorithm can save the corresponding data volume, while a negative number indicates an increase in data volume and poor performance. In addition, the Bjøntegard delta peak signal-to-noise rate (BD-PSNR) was used to evaluate the performance index of the proposed algorithm and the original algorithm in encoding image quality. A positive value represents an enhancement of the processed image quality, while a negative value indicates significant distortion and poor performance compared to the original image. Table 2 provides parameter information for the test sequence. Firstly, VTM10.0 and the proposed algorithm are compared. In order to compare the impact of the proposed algorithm and the algorithm in VTM10.0 on encoder encoding time, a formula is defined to calculate the average time of each algorithm: where EncT all,a f f represents the overall encoding time of the encoder or the affine motion estimation encoding time. Due to the corresponding changes in the algorithm testing time based on changes in QP, the method of taking the average value is adopted for evaluation and measurement. Table 3 shows the experimental results of the algorithm proposed in this article. In the improved algorithm, the overall encoding time of the encoder was saved by 6.22%, and the encoding time in the affine motion estimation module was reduced by an average of 24.79%. Among them, the reduction in encoding time for sequences BasketballPass, BQSquare, and KristenAndSara affine motion estimation was all over 30%, indicating that the improved algorithm has a good optimization effect on processing video sequences with a large amount of affine motion, effectively reducing the encoding time of the encoder. On the contrary, the bitrate of the Basketball Drive video sequence increased too significantly, indicating a large amount of data when processing certain high-definition videos. In order to further compare the performance of the algorithm proposed in this article, we will conduct a comparative analysis between the algorithm proposed in this article and the current research methods for related work. As shown in Table 4, compared with Ren et al. [33], the algorithm proposed in this paper has a better effect in reducing the overall encoding time of VVC encoders, with little loss in BDPSNR and little increase in bitrate. The method proposed by Ren et al. [33]. effectively reduces the computational complexity of affine motion estimation, but when the CU adopts affine mode, the process of affine motion estimation still needs to be executed, although momentum parameters are added to accelerate the iteration process. At the same time, Figure 5 shows the reconstructed and original frames of the video sequence processed by the algorithm in this paper. From the subjective naked eye observation, the distortion of the image is almost invisible. In summary, the algorithm proposed in this article can shorten the encoding time while ensuring video quality and bitrate. In order to more intuitively observe the compression and distortion degree of the proposed algorithm on video images, Figure 6 shows the RD curves of the test sequences KristenAndSara and Cactus. From the RD curve, it can be seen that the method proposed in this paper coincides with the algorithm curve in the official testing sequence VTM10.0 of VVC. This means that the algorithm proposed in this article greatly shortens the encoding time required by the encoder while maintaining an almost constant video quality and bit rate.

Conclusions
Due to the complexity of high-definition video content and the addition of multiple optimization algorithms to the encoder, the processing of affine motion estimation in VVC inter-frame prediction needs to be optimized. To address the above issues, this paper proposes a fast affine motion algorithm based on edge detection to accelerate the encoder's processing of affine motion patterns. By adding pre-judging conditions in the process of constructing affine candidate lists, unnecessary judging steps are skipped to achieve acceleration. In the iterative search process of affine motion vectors, the Canny edge detection algorithm is used to accelerate the iterative gradient, making full use of the readily calculated image gradient, avoiding repeated calculations, and making the calculation process more efficient. The experimental results show that BDBR only increases by 0.76% and BDPSNR only loses 0.04 db. Compared with the anchor algorithm, the overall coding time of the proposed algorithm is reduced by 6.22%, and the coding time of the affine motion estimation part of the inter-frame prediction part is reduced by 24.79%. In future research work, we will focus on the overall model of affine motion estimation, which can be combined with hardware to achieve the goal of saving more encoding time.

Conflicts of Interest:
The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.