Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network

Li, Nana; Wang, Zhenyi; Zhang, Qiuwen; He, Lei; Zhang, Weizheng

doi:10.3390/electronics14102031

Open AccessArticle

Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network

by

Nana Li

,

Zhenyi Wang

,

Qiuwen Zhang

^*,

Lei He

and

Weizheng Zhang

College of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(10), 2031; https://doi.org/10.3390/electronics14102031

Submission received: 26 March 2025 / Revised: 12 May 2025 / Accepted: 13 May 2025 / Published: 16 May 2025

Download

Browse Figures

Versions Notes

Abstract

The latest Versatile Video Coding(H.266/VVC) standard introduces the QTMT structure, enabling more flexible block partitioning and significantly enhancing coding efficiency compared to its predecessor, High-Efficiency Video Coding (H.265/HEVC). However, this new structure results in changes to the size of Coding Units (CUs). To accommodate this, VVC increases the number of intra-prediction modes from 35 to 67, leading to a substantial rise in computational demands. This study presents a fast intra-prediction mode selection algorithm that combines gradient analysis and CNN. First, the Laplace operator is employed to estimate the texture direction of the current CU block, identifying the most probable prediction direction and skipping over half of the redundant candidate modes, thereby significantly reducing the number of mode searches. Second, to further minimize computational complexity, two efficient neural network models, MIP-NET and ISP-NET, are developed to determine whether to terminate the prediction process for Matrix Intra Prediction(MIP) and Intra Sub-Partitioning(ISP) modes early, avoiding unnecessary calculations. This approach maintains coding performance while significantly lowering the time complexity of intra-prediction mode selection. Experimental results demonstrate that the algorithm achieves a 35.04% reduction in encoding time with only a 0.69% increase in BD-BR, striking a balance between video quality and coding efficiency.

Keywords:

versatile video coding (VVC); mode prediction; convolutional neural network (CNN); gradient

1. Introduction

Compared to other forms of information, video typically contains a higher amount of redundant data. Uncompressed high-definition (HD) video not only requires substantial storage space but also leads to significant resource waste if left uncompressed. The rise of Ultra High-Definition (UHD) and Virtual Reality (VR) content has created a pressing need for advanced video compression methods. The existing H.265/HEVC standard struggles to address these increasing demands, prompting the Joint Video Exploration Team (JVET) to develop the Versatile Video Coding (VVC) standard [1]. Research indicates that the VVC encoder (VTM) achieves approximately a 50% reduction in bitrate compared to the HEVC encoder (HM) while preserving equivalent video quality. Nevertheless, this enhancement is accompanied by a substantial rise in computational complexity, which significantly restricts the real-world applicability of VVC [2,3]. H.266/VVC added the Quad-Tree with Nested Multi-Type Tree (QTMT) structure, expanding the partition modes of Coding Units (CUs) to five types. This structure allows CUs to add Binary Tree (BT) and Ternary Tree (TT) partition types on the basis of Quad-Tree (QT) partition. This greatly enhances the flexibility of CU block partitioning [4]. This structure enables VVC to better adapt to varying image content, making it particularly effective for ultra-high-definition video, but at the cost of increased encoding time. Research indicates that QTMT partitioning alone accounts for over 90% of VVC’s total encoding time [5].

The introduction of the QTMT structure results in a greater number of irregularly shaped CU blocks. To better accommodate these varying CU sizes and shapes, VVC increases the number of intra-prediction modes to 65. Additionally, new tools such as Matrix Intra-Prediction (MIP) [6], Intra-Sub-Partitioning (ISP) [7], and Multi-Reference Line (MRL) [8] have been introduced. These newly added angular modes allow VVC to better adapt to the changes in CU structure, leveraging spatial correlation between pixels more effectively under the same bitrate to minimize the amount of data required for transmission in video encoding. Figure 1 shows the intra-prediction modes of h.266/VVC.

VVC employs Rate-Distortion Optimization (RDO) to identify the optimal intra-prediction mode. However, performing an exhaustive RDO search across all intra-prediction modes incurs substantial computational costs. To mitigate this, a subset of potentially optimal modes is evaluated during the RDO process, referred to as the Candidate Mode List (CML). The process of constructing the CML is illustrated in Figure 2.

The construction of the CML involves two primary stages. The first stage is the Rough Mode Decision (RMD), where the encoder compares the Sum of Absolute Transform Difference (SATD) costs of normal angle modes and MIP modes. Using the SATD competition mechanism, a specific number of modes with the lowest SATD costs and a certain number of MIP modes are selected from the 67 available modes to form the initial CML. The specific comparison formula is as follows:

R M D_{C O S T} = D_{S A T D} + λ B i t_{S_{m}} .

(1)

where

λ

represents the Lagrange multiplier,

D_{S A T D}

denotes the distortion calculated based on the SATD, and

B i t_{S_{m}}

represents the bit cost of the prediction mode.

In the second stage, the Intra-Sub-Partitioning (ISP) modes corresponding to the chosen angular modes, along with the angular modes from the Most Probable Modes (MPMs) list, are directly incorporated into the CML. The MPM list is constructed using encoding information from neighboring CUs. VVC expands the number of MPM candidates from 3 (as in HEVC) to 6. Most CUs tend to select a mode from the MPM list as their optimal prediction mode. Utilizing MPM modes reduces the required bitrate, as the partitioning mode can be encoded using only 3 bits. The CML includes both traditional angular modes and newly introduced prediction modes, which are further assessed through the RDO process. The mode with the lowest Rate-Distortion (RD) cost is ultimately chosen as the optimal intra-prediction mode. However, the RDO process is highly time-intensive, and most research efforts have concentrated on minimizing the number of modes in the CML to speed up the encoding process. This paper introduces a fast intra-mode decision algorithm based on gradient analysis and CNNs, which significantly decreases the computational complexity of intra-mode selection by utilizing both gradient information and CNN-based prediction.

The main contributions of this paper are as follows:

A fast intra-normal mode selection algorithm based on gradient analysis is proposed. In this paper, the gradient of the current CU block is calculated using the Laplace operator, and the most likely prediction direction is determined based on the image gradient, thereby skipping less probable prediction modes.
An early termination algorithm for intra-advanced modes based on CNN is proposed. Two efficient neural network models, MIP-NET and ISP-NET, are developed to assess whether to halt the prediction process for MIP and ISP modes prematurely. By leveraging the neural network’s intelligent decision-making capabilities, redundant mode prediction computations are minimized, thereby enhancing the efficiency of intra-prediction mode selection without compromising coding performance.

The structure of this paper is outlined as follows. Section 2 offers a summary of the background and related studies. Section 3 delves into the data analysis. Section 4 elaborates on the comprehensive algorithm. Section 5 examines the experimental outcomes. Section 6 provides the concluding remarks.

2. Related Works

Prior research has developed a range of methods focused on enhancing the efficiency of VVC encoding by reducing computational complexity. These technologies are mainly used in two key areas: optimizing CU partitioning and simplifying pattern decision-making.

2.1. Fast CU Partitioning Methods in VVC

The introduction of the QTMT structure has led to substantial changes in the CU partitioning process in VVC compared to HEVC. Consequently, numerous researchers have focused on accelerating the CU partitioning process to reduce time complexity.

In [9], a novel soft target and restriction-sensitive neural network (STRANet) was proposed to predict the partitioning of the CU. By combining CNN with an attention mechanism, the Window Attention Module achieved excellent results. In [10], CTUs were divided into 32 × 32 blocks and CNN was used to directly predict the partition of 32 × 32 CUs, avoiding unnecessary RDO calculations. In [11], CNN was employed to predict the depth of the partition of the CU and combined it with Random Forest Classifiers (RFCtype) to determine the final partitioning. In [12,13], the CU partition results are mapped to a partition map, and CNN is used to predict the final outcome, thus replacing the cumbersome RDO process. In [14], an algorithm for a fast split mode and directional mode decision is proposed to reduce the complexity of the encoding. In [15], the authors use the entropy and variance of the image to calculate the complexity of the texture of the current image, thus determining whether to terminate the partitioning early. In [16], the authors use the partitioning-related features of the CU to build a decision tree and employ multiple decision tree classifiers to determine different partitioning types. In [17], an HG-FCN model is proposed to predict the probabilities of all boundaries in 32 × 32 blocks, thereby inferring the most likely partitioning. In [18], CUs are classified into three types based on texture features and use random forest classifiers to guide their partitioning. In [19,20], the time complexity of the TT partition is reduced by the early termination of the nested TT block structures after quadtree partitioning. In [21], the authors employ support vector machines to determine CU partitioning, thereby reducing RDO calculations. These approaches highlight the varied strategies researchers have employed to enhance CU partitioning in VVC, utilizing machine learning, deep learning, and heuristic methods to achieve notable reductions in encoding time without compromising coding efficiency.

2.2. Fast Mode Decision Methods in VVC

In VVC, the number of intra-prediction modes has increased from 35 in traditional HEVC to 67, enhancing adaptability to diverse texture features and image content. While this expansion boosts coding efficiency, particularly for complex textures, it also significantly raises computational demands. Consequently, recent research has concentrated on creating efficient fast mode prediction techniques, aiming to minimize unnecessary mode searches through intelligent screening and optimization strategies, thus maintaining coding performance while lowering computational complexity. In [22], pixel value deviation was utilized to predict the texture direction of a coding unit (CU), and an improved fast intra-prediction mode decision scheme was proposed, reducing the number of intra-prediction modes added to the rate-distortion optimization (RDO) mode set. In [23,24], a gradient-based intra-mode selection method was established by leveraging the prediction direction of CUs to skip unlikely modes for each CU. In [25], an improved Lagrange multiplier based on intra-mode decision was proposed for mode prediction of chroma components in video coding. In [26], a fast intra-mode selection algorithm based on gradient descent search was introduced, with an investigation into the optimal initial search points and step sizes. However, these methods only explored the skipping of angular modes and did not consider the newly introduced prediction tools in VVC, resulting in a limited reduction in computational complexity.

In [27], the distribution of newly introduced prediction modes in VVC was analyzed, and a fast intra-mode decision method based on deep multi-task learning was proposed to determine whether to use these new prediction modes. In [16], classifiers based on MRL (Multiple Reference Line) and Hadamard Cost were employed to estimate the probability of prediction modes, and a conditional mode list (CML) was constructed to reduce computational overhead. In [28], a fast video coding algorithm based on mode selection and prediction termination was proposed. A learning-based classifier was designed to intelligently eliminate Intra-Block Copy (IBC) and Intra-Sub-Partition (ISP) modes, and an ensemble decision strategy was used to rank candidate modes, increasing the probability that the top candidates would be optimal, thereby more effectively removing redundant modes. In [29], the correlation between QP (Quantization Parameter), BT (Binary Tree) depth, and the distribution of intra-modes was analyzed, and empirical rules were derived from extensive encoding data to determine the most probable prediction modes based on QP and BT depth. In [30], a multi-task learning (MTL)-based intra-mode decision framework was proposed to reduce the number of final intra-prediction modes, thereby lowering computational complexity. Although these algorithms have achieved significant progress in optimizing VVC encoding, further research is still needed to address challenges such as model complexity, generalization capability, and the trade-off between accuracy and efficiency, to better meet the demands of practical applications.

These methods demonstrate the diverse approaches researchers have taken to optimize mode decision in VVC, leveraging machine learning, deep learning, and heuristic techniques to achieve significant reductions in encoding time while maintaining coding efficiency. Table 1 shows the differences between this study and existing research.

3. Preliminaries: Data Analaysis

To further explore the characteristics of intra-prediction modes in VVC, we conducted encoding statistics on 13 common video sequences and collected encoding information. Table 2 provides detailed information and parameters of the encoded videos.

3.1. Modal Distribution Analysis of Angular Mode

Based on the encoding information, we summarized the angular modes of all CU blocks in Table 2, as shown in Figure 3. It is evident that the distribution of different angular modes is highly uneven. For most CUs, there is a strong preference for selecting Planner, DC, horizontal mode (Mode 18), and vertical mode (Mode 50), which together account for 45% of the total. In contrast, the remaining angular modes (Modes 2–66) are selected by only 55% of the CUs.

This trend is even more pronounced in test sequences with moderate motion and uniform texture, such as the ParkScene sequence in Figure 3, where a higher proportion of CUs tend to select Planner and DC modes. This is likely because uniform sequences contain more flat CU blocks, further demonstrating the strong correlation between prediction modes and texture characteristics.

3.2. Analysis of MIP and ISP Mode Distribution

As an advanced mode in VVC, ISP is an intra-prediction technique designed for the luma component, aiming to enhance prediction accuracy by dividing CUs into smaller sub-blocks. The MIP mode represents a further application of deep learning in VVC, utilizing matrices trained through deep learning to generate prediction signals.

We statistically analyzed the usage percentages of ISP and MIP modes across all sequences, as shown in Table 3. According to statistics, the results of ISP and MIP modes are not used in most cases. However, even if they are not the best mode, the encoder still calculates the complicated ISP/MIP process, so terminating the redundant calculation time in advance can obtain a lot of time complexity reduction.

Experimental results indicate a strong relationship between the distribution of ISP and MIP modes and texture properties. Figure 4 provides an example, with rectangles highlighting regions encoded using ISP mode. As shown in Figure 4 (left), the ISP mode is typically distributed in regions with distinct directional features, such as vertical lines, horizontal stripes, or diagonal edges. These CUs often exhibit clear dividing lines, with similar textures on either side. In such CUs, ISP combined with directional intra-prediction modes can better adapt to local texture orientations.

The MIP mode in VVC represents a new concept for intra-prediction. It first samples adjacent pixels to form a vector, which is then multiplied by a matrix derived from the video sequence to produce a scatter array of partial prediction values, ultimately generating the final prediction. Figure 4 (right) illustrates an example of MIP mode, where it is observed that CUs encoded with MIP mode are not in flat regions and lack a clear directional gradient, instead exhibiting a granular, scattered texture. Therefore, both ISP and MIP modes are highly correlated with image texture features.

Based on the above analysis, it can be concluded that

Most angle modes, ISP, and MIP advanced modes are not often selected.
The selection of both angle and advanced modes is highly correlated with image texture.

So, we can skip some angle modes early and terminate advanced modes early based on the texture features of the image.

4. The Proposed Method

This section introduces a rapid intra-mode decision algorithm leveraging gradient analysis and CNN, effectively reducing the time complexity of intra-mode decision-making through the integration of gradient data and CNN predictions. The algorithm’s comprehensive workflow is depicted in Figure 5.

The proposed algorithm mainly consists of two parts. First, the texture direction is predicted based on gradient information, and the most likely angular modes are selected. These modes are then combined with the commonly used DC and Planar modes to generate a preliminary candidate mode list (CML) through SATD comparison. Second, a trained CNN model is utilized to determine whether to enable or terminate the advanced prediction modes, MIP and ISP. If the CNN prediction result indicates that MIP or ISP modes should be enabled, the full mode search process is executed, and the results are combined with the preliminary CML to form the final candidate mode list. If the prediction result suggests terminating MIP or ISP modes, the preliminary CML is directly used for subsequent RDO. By integrating gradient analysis with CNN prediction, the proposed algorithm greatly improves the efficiency of intra-mode decision-making without compromising coding performance.

4.1. Intra-Angular Mode Prediction

In order to address the challenge of high time complexity brought by pattern prediction, early skipping has become an effective optimization strategy, which significantly reduces computational complexity by skipping angle patterns that do not match image texture features in advance. In video coding, there is a close relationship between gradient information and intra-angular prediction modes. Gradients reflect the direction and intensity of pixel value changes in an image, effectively characterizing texture features and edge information. In intra-prediction, angular prediction modes predict pixel values along specific directions, and gradient information can precisely indicate the main direction of image texture, providing crucial guidance for selecting angular prediction modes. This study introduces a rapid mode decision algorithm utilizing the Laplace operator, which decreases the quantity of RMD (Rough Mode Decision) modes by omitting specific angular modes.

As shown in Figure 6, we broadly categorize video textures into four directions:

D \in {0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}}

, corresponding to horizontal, vertical, and two diagonal directions, respectively. The distribution of the VVC intra-prediction mode in the four main directions (horizontal, vertical, diagonal, etc.) is relatively concentrated, and a finer division will increase computational complexity, but the benefit is limited. The angular modes are also divided into four regions based on these directions. The gradient for these four directions is calculated using a one-dimensional Laplace operator [31], with the formula as follows:

L_{0^{\circ}} = | 2 f (x, y) - f (x - 1, y) - f (x + 1, y) | .

(2)

L_{45^{\circ}} = | 2 f (x, y) - f (x - 1, y + 1) - f (x + 1, y - 1) | .

(3)

L_{90^{\circ}} = | 2 f (x, y) - f (x, y - 1) - f (x, y + 1) | .

(4)

L_{135^{\circ}} = | 2 f (x, y) - f (x - 1, y - 1) - f (x + 1, y + 1) | .

(5)

where

f (x, y)

represents the pixel value at coordinate

(x, y)

, and

L_{D}

represents the gradient of the point in direction D. The gradient of the CU block is the sum of the gradients of all points, and its formula is as follows:

| G_{D} | = \frac{\sum_{x = 1}^{W - 1} \sum_{y = 1}^{H - 1} L_{D}}{H \times W} .

(6)

where the indices H and W refer to the height and width of the CU block, respectively, and

G_{D}

represents the gradient of the CU block in direction D.

Gradient information can indicate the primary direction of image texture, and directions with smaller Laplacian gradients are more likely to be the final prediction directions. Based on this characteristic, we designed an angular mode skipping strategy to reduce the time complexity of the intra-prediction mode decision. After calculating the gradient, we follow the process below to skip other angular modes (as shown in the flowchart in Figure 7).

If condition $G_{0^{\circ}} < G_{90^{\circ}}$ is met, we further compare the gradient values of $G_{45^{\circ}}$ and $G_{135^{\circ}}$ . If condition $G_{45^{\circ}} < G_{135^{\circ}}$ is satisfied, Regions 2, 3, and 4 are skipped; otherwise, Regions 1, 3, and 4 are skipped.
If condition $G_{0^{\circ}} > G_{90^{\circ}}$ is met, we further compare the gradient values of $G_{45^{\circ}}$ and $G_{135^{\circ}}$ . If condition $G_{45^{\circ}} > G_{135^{\circ}}$ is satisfied, Regions 1, 2, and 3 are skipped; otherwise, Regions 1, 2, and 4 are skipped.

According to the data analysis in Section 2, most CUs tend to select DC or Planar modes. Therefore, when constructing the most probable reference mode list, we always include DC and Planar modes in the candidate list to ensure efficient encoding of flat regions. Through the above strategy, we can significantly reduce unnecessary angular mode searches, thereby lowering computational complexity while maintaining high coding efficiency. This method is particularly suitable for CU blocks with strong directional textures, enabling the acceleration of intra-prediction mode selection without significantly affecting coding performance.

4.2. CNN Models for ISP and MIP

In order to reduce the computational complexity caused by ISP and MIP, this paper designs a CNN-based early termination mechanism aimed at minimizing redundant calculations for ISP and MIP modes. As shown in Figure 8, the convolution layer extracts features, then fuses the extracted features with the external feature QP and performs a half-mask operation to reduce the robustness of the model to QP. Finally, the full connection layer performs a nonlinear transformation and dimensional mapping on the features, and uses Dropout to prevent over-fitting. Since VVC supports multiple CU sizes (a total of 17) during the encoding process, we designed a universal network structure capable of adaptively handling all possible CU sizes. The input of this model is an image block composed of the CU and its adjacent reference pixels with dimensions (w + 1) × (h + 1), while the output is the probability value of whether to terminate ISP or MIP modes. The design of this lightweight CNN structure is based on the following considerations: the termination of ISP/MIP modes is essentially a binary classification problem that only requires determining whether to enable the mode, and since the input data consist of local pixel blocks with limited spatial scope, a simple CNN structure is sufficient to effectively capture directional texture features. Meanwhile, to meet the real-time requirements of VVC encoding, we employ an adaptive-depth network design—for instance, using four residual units for deep feature extraction on 64 × 64 CUs while utilizing only one residual unit for 4 × 4 CUs—significantly reducing computational complexity while maintaining feature extraction capability. This design not only accounts for differences in feature information carried by CUs of varying sizes but also ensures efficient operation across all CU size categories.

In the model design, we combine the current CU with its adjacent reference pixels to form an image of size (w + 1) × (h + 1) as input, leveraging the significant influence of reference pixels above and to the left of the CU on intra-prediction mode selection. Subsequently, a 2 × 2 convolutional kernel is applied to the expanded CU block to preliminarily extract features from the current CU and its neighboring reference pixels.

To further extract deep features, the feature maps are processed through one or more Residual Units (the specific parameters of the model are shown in Table 4). Residual units were first introduced by He et al. in 2015 in ResNet [32] and have been widely validated for their effectiveness in deep learning. Since CU blocks of different sizes carry varying amounts of feature information, larger CU blocks typically contain more pixel information and thus require deeper network structures to fully extract their texture features. Therefore, we designed more residual units for larger CU blocks to enhance the network’s representational capacity, while using shallower network structures for smaller CU blocks to reduce computational overhead.

Additionally, for CU blocks of different shapes, we designed different residual unit structures. For square CU blocks where w = h, a residual unit structure with a 3 × 3 convolutional kernel and a padding size of 1 × 1 is used to ensure consistent input and output feature map sizes. For rectangular CU blocks where w > h or w < h, 3 × 1 or 1 × 3 convolutional kernels are used, respectively, along with asymmetric padding strategies of (2, 1) or (1, 2). This design helps alleviate the imbalance in the width and height dimensions of CU blocks.

To further optimize computational efficiency, for 64 × 64 and 32 × 32 coding units, a max-pooling layer is introduced after the residual units to reduce the size of the feature maps, thereby avoiding a significant increase in computational complexity due to excessively large feature maps. This design effectively controls the computational overhead of the model while ensuring feature extraction capability. After completing the adaptive feature extraction of the CU block, we convert the extracted feature maps into feature vectors. These vectors are then passed through the fully connected layer F1, followed by the addition of a Dropout layer to reduce the risk of model overfitting. Additionally, considering the significant correlation between encoding results and the quantization parameter (QP, denoted as q), a half-masking operation based on normalized QP values is introduced before and after the fully connected layer F1. This operation enhances the representational capacity of the feature vectors and improves the model’s robustness to QP variations. The relevant formula is presented below.

\tilde{q} = \frac{q}{51} + 0.5 .

(7)

In the feature fusion stage, we scale 50% of the features in the vector using weight coefficients

\tilde{q}

, while the remaining 50% retain their original values. This approach enhances the model’s sensitivity to key features. Subsequently, the feature vector sequentially passes through fully connected layers, outputting a two-dimensional vector whose components represent the probability distribution of terminating or enabling ISP/MIP modes. It should be noted that, except for the output layers, which use the Softmax function for probability normalization, all convolutional layers and intermediate fully connected layers in the network employ PReLU as the activation function.

5. Experimental Results and Analyses

5.1. Experimental Configuration

The experiments in this paper were conducted using the AI configuration of the standard test software VTM10.0 (using the encoder configuration file in vtm.cfg with QP values of 22, 27, 32, 37), and the encoding data were the JVET test sequences [33]. We used Bjøntegaard Delta Bit Rate (BDBR) and the time saving rate (TS) to evaluate the performance of the algorithm. The formula for calculating TS, which represents the time saved relative to the original encoder, is as follows:

T S (%) = \frac{T_{VTM 10.0} - T_{proposed}}{T_{VTM 10.0}} \times 100 % .

(8)

where

T_{VTM 10.0}

represents the encoding time of the original encoder, and

T_{proposed}

denotes the encoding time of the proposed method in this paper.

5.2. CNN Training

We analyzed the usage percentages of ISP/MIP in the Div2K public image dataset [34], as shown in Figure 9. When encoding the Div2K dataset using the VTM encoder in AI configuration, we obtained over 2.5 million CU instances.

Figure 9 illustrates the distribution of labels in the dataset. It is evident that there is a significant imbalance in samples regarding the ISP/MIP flags, regardless of the CU block size, which poses a considerable challenge for CNN training. To facilitate training, the data were partially balanced in advance. Specifically, for the ISP/MIP task in our method, assuming the number of samples using ISP/MIP is Min, we always select Min samples from the Max samples where ISP/MIP is disabled for training. We employ the Cross-Entropy loss function to quantify the difference between the predicted probability distribution and the true label distribution, thereby optimizing the model’s classification performance. The Cross-Entropy loss function is widely used in deep learning due to its effectiveness in measuring the uncertainty of model outputs in classification tasks and guiding model parameters toward the global optimal solution through gradient descent. The formula is as follows:

L_{C E} = - \frac{1}{N} \sum_{n = 1}^{N} y_{n} log ({\hat{y}}_{n}) .

(9)

where N is the size of the mini-batch,

y_{n}

represents the true split mode of the n-th CU, and

{\hat{y}}_{n}

represents the predicted probability of the n-th CU. In our model, the trainable parameters include the weights and biases of all convolutional and fully connected layers, as well as the learnable parameters in the PReLU function after each layer.

In practical applications, the accuracy of the model directly affects the probability of misjudgment. If the termination of ISP/MIP is mistakenly judged as using ISP/MIP, it will not lead to a reduction in complexity [35]. Conversely, if the use of ISP/MIP is mistakenly judged as termination, the BD-BR will increase. Both types of misjudgment degrade encoding performance, making model accuracy crucial. Figure 10 shows the accuracy of ISP/MIP-NET for different CU sizes. It can be observed that, regardless of the CU size, the models achieve excellent performance through training, with accuracy rates above 70% for determining whether to terminate ISP/MIP. This significantly ensures the performance of the proposed method.

5.3. Ablation Experiments

To better demonstrate the overall performance and feasibility of the algorithm proposed in this chapter, experiments were conducted on 18 video sequences from Class A to Class E, encompassing various resolutions and texture complexities. As shown in Table 5, the proposed algorithm achieves an average complexity reduction of 35.04% with only a 0.69% increase in BD-BR. Additionally, the algorithm can be divided into two parts: gradient-based angular mode decision and CNN-based early termination mechanism for ISP/MIP. From Table 5, it can be observed that in the first stage of the algorithm, the TS is reduced by 27.27%, while the BD-BR increase is only 0.55, indicating that the first-stage algorithm effectively skips unnecessary angular modes. In the second stage of the algorithm, the average encoding time saving is 10.39%, and the BD-BR increase remains negligible. This result demonstrates that the second-stage algorithm can effectively terminate unnecessary ISP/MIP modes, significantly reducing encoding complexity. The data in the table illustrate the performance of each stage of the algorithm. The experiments prove that both stages of the proposed algorithm effectively accelerate the video encoding process, with negligible increases in BD-BR. Furthermore, it is demonstrated that the two stages of the algorithm do not interfere with each other and can be integrated to achieve excellent performance.

5.4. Comparison with State-of-the-Art Algorithms

Table 6 presents a comparative analysis of the proposed method against other leading-edge techniques. The findings reveal that our algorithm substantially decreases computational complexity, achieving an average reduction of 35.04%, while the BD-BR increase is limited to just 0.69%. In comparison, Li’s algorithm achieves an average time saving of 33.51% with a BD-BR increase of 0.51%, Z’s algorithm achieves an average time saving of 21.69% with a BD-BR increase of 1.77%, and Ni’s algorithm achieves an average time saving of 17.30% with a BD-BR increase of 0.19%. From the comparison, it is evident that our proposed algorithm outperforms others in terms of average time complexity reduction. Compared to Li’s algorithm, we slightly improve computational complexity reduction, and the BD-BR increase is negligible. Compared to Z’s algorithm, we achieve better time complexity and lower BD-BR. Although the BD-BR increase is higher compared to Ni’s algorithm, the improvement in time complexity is substantial.

Figure 11 provides a more intuitive comparison of our algorithm with others across all test sequences. It can be observed that, except for lower-resolution D-class videos, the distribution of our results is generally better than that of other algorithms. Based on the above analysis, it is clear that the performance of the algorithm proposed in this chapter is significantly better than those in references [22,24,30].

5.5. Experimental Results and Fast Coding Discussion

The new generation of h.266/VVC standard integrates a number of innovative coding technologies and has made significant improvements in coding efficiency and video quality. However, these technical improvements lead to a sharp increase in computational complexity, which seriously restricts their practical application. To solve this problem, this paper focuses on the decision-making optimization of VVC intra-coding mode and proposes a fast coding algorithm based on machine learning. The experimental results show that the algorithm can achieve 35.04% coding time saving under the premise of keeping the rate-distortion performance unchanged, which significantly improves the practicability of VVC.

Through the experimental analysis, it is found that reducing the number of candidate modes in RDO calculation can effectively reduce the coding complexity, which is mainly due to the time cost of gradient calculation and CNN reasoning is much less than that of traditional RDO calculation. Further research shows that skipping more angle modes can bring a more significant acceleration effect than advanced modes such as MIP and ISP, which is due to the higher computational complexity of angle modes. Experiments on standard test sequences verify the effectiveness of the proposed algorithm. The encoding time savings of 42.62% and 40.15% are achieved on foodmarket4 and parkrunning3 sequences with relatively simple texture features, respectively. This shows that the mode decision method based on gradient and CNN has better adaptability to low complexity texture video.

Although this method has achieved some results, there are still other methods worth verifying:

Feature extraction optimization: The current algorithm mainly depends on texture features. In the future, more coding context information (such as MPM reference list) can be combined to improve the accuracy of mode prediction;
Extension of machine learning method: In addition to CNN, lightweight models such as decision tree and support vector machine (SVM) can be explored to adapt to the computing power constraints of mobile terminals;
Multi link joint optimization: in addition to intra-mode decision-making, future work can further explore the acceleration strategies of coding unit (CU) division, loop filtering, and inter coding to achieve more comprehensive VVC complexity optimization.

This research provides a feasible scheme for VVC real-time coding, and lays a foundation for the subsequent research on video coding optimization based on machine learning.

6. Conclusions

This study introduces a rapid intra-mode decision approach leveraging gradient analysis and CNN to address the high computational demands imposed by the QTMT structure in VVC. Initially, the distribution of various angular modes and the attributes and utilization rates of the newly introduced ISP and MIP modes in VVC are examined. Subsequently, the correlation between image texture orientation and mode division is explored, and the gradient of the current CU block is computed using the Laplace operator. Utilizing these calculations, the most probable prediction direction is identified, allowing for the elimination of over half of the redundant candidate modes. To further diminish computational complexity, two CNN models—MIP-NET and ISP-NET—are developed to ascertain whether to halt the partitioning process of MIP and ISP modes, respectively. The architecture and training techniques of the CNN are also elaborated. Ultimately, ablation experiments validate the efficacy of the proposed two-stage algorithm. In comparison to VTM 10.0, the suggested algorithm achieves a 35.04% reduction in encoding time with a mere 0.69% rise in BD-BR. However, the current work still has certain limitations, and some challenges faced in practical applications need to be further addressed. Future research could optimize mode decision accuracy and reduce prediction errors by incorporating more contextual information and combining multi-scale spatiotemporal joint features.

Author Contributions

Conceptualization, N.L. and Z.W.; methodology, N.L.; software, L.H.; validation, N.L., Q.Z., Z.W., L.H. and W.Z.; formal analysis, W.Z.; investigation, Z.W.; resources, Q.Z.; data curation, Z.W. and L.H.; writing—original draft preparation, N.L.; writing—review and editing, N.L. and W.Z.; visualization, N.L.; supervision, Q.Z.; project administration, Q.Z.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China No. 61771432, and 61302118, the Basic Research Projects of Education Department of Henan No. 21zx003, and the Key projects Natural Science Foundation of Henan 232300421150, Henan Provincial Science and Technology Research Project 242102211020, the Scientific and Technological Project of Henan Province 232102211014, and the Postgraduate Education Reform and Quality Improvement Project of Henan Province YJS2023JC08,Zhongyuan Science and Technology Innovation Leadership Program 244200510026.

Data Availability Statement

The data can be shared upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bross, B.; Wang, Y.K.; Ye, Y. Overview of the Versatile Video Coding (VVC) Standard and Its Applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
Wang, M.; Zhang, J.; Huang, L.; Xiong, J. Machine Learning-Based Rate Distortion Modeling for VVC/H.266 Intra-Frame. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Filipe, J.N.; Tavora, L.M.; Faria, S.M.; Navarro, A.; Assuncao, P.A. Complexity Reduction Methods for Versatile Video Coding: A Comparative Review. Digit. Signal Process. 2025, 160, 105021. [Google Scholar] [CrossRef]
Pfaff, J.; Filippov, A.; Liu, S.; Zhao, X.; Chen, J.; De-Luxan-Hernandez, S.; Wiegand, T.; Rufitskiy, V.; Ramasubramonian, A.K.; Van Der Auwera, G. Intra Prediction and Mode Coding in VVC. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3834–3847. [Google Scholar] [CrossRef]
Tissier, A.; Mercat, A.; Amestoy, T.; Hamidouche, W.; Vanne, J.; Menard, D. Complexity Reduction Opportunities in the Future VVC Intra Encoder. In Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia, 27–29 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
Huo, J.; Sun, Y.; Wang, H. Unified Matrix Coding for NN Originated MIP in H.266/VVC. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 1635–1639. [Google Scholar] [CrossRef]
De-Luxan-Hernandez, S.; George, V.; Ma, J.; Nguyen, T.; Schwarz, H.; Marpe, D.; Wiegand, T. An Intra Subpartition Coding Mode for VVC. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1203–1207. [Google Scholar] [CrossRef]
Chang, Y.J.; Jhu, H.J.; Jiang, H.Y. Multiple Reference Line Coding for Most Probable Modes in Intra Prediction. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; p. 559. [Google Scholar] [CrossRef]
Sun, T.; Wang, Y.; Huang, Z.; Sun, J. STRANet: Soft-Target and Restriction-Aware Neural Network for Efficient VVC Intra Coding. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11993–12005. [Google Scholar] [CrossRef]
Huang, Y.H.; Chen, J.J.; Tsai, Y.H. Speed Up H.266/QTMT Intra-Coding Based on Predictions of ResNet and Random Forest Classifier. In Proceedings of the 2021 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 10–12 January 2021; pp. 1–6. [Google Scholar] [CrossRef]
Chen, J.J.; Huang, Y.H.; Yu, H.Y.; Tsai, Y.H. A Fast H.266/QTMT Intra Coding Scheme Based on Predictions of Learned Models. J. Chin. Inst. Eng. 2024, 47, 703–718. [Google Scholar] [CrossRef]
Li, T.; Xu, M.; Tang, R.; Chen, Y.; Xing, Q. DeepQTMT: A Deep Learning Approach for Fast QTMT-based CU Partition of Intra-mode VVC. IEEE Trans. Image Process. 2021, 30, 5377–5390. [Google Scholar] [CrossRef]
Feng, A.; Liu, K.; Liu, D.; Li, L.; Wu, F. Partition Map Prediction for Fast Block Partitioning in VVC Intra-Frame Coding. IEEE Trans. Image Process. 2023, 32, 2237–2251. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Yu, J.; Wang, D.; Lu, X. Learning-Based Fast Splitting and Directional Mode Decision for VVC Intra Prediction. IEEE Trans. Broadcast. 2024, 70, 681–692. [Google Scholar] [CrossRef]
Si, L.; Yan, A.; Zhang, Q. Fast CU Decision Method Based on Texture Characteristics and Decision Tree for Depth Map Intra-Coding. EURASIP J. Image Video Process. 2024, 2024, 34. [Google Scholar] [CrossRef]
Li, Y.; He, Z.; Zhang, Q. Fast Decision-Tree-Based Series Partitioning and Mode Prediction Termination Algorithm for H.266/VVC. Electronics 2024, 13, 1250. [Google Scholar] [CrossRef]
Wu, S.; Shi, J.; Chen, Z. HG-FCN: Hierarchical Grid Fully Convolutional Network for Fast VVC Intra Coding. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5638–5649. [Google Scholar] [CrossRef]
Chen, L.; Cheng, B.; Zhu, H.; Qin, H.; Deng, L.; Luo, L. Fast Versatile Video Coding (VVC) Intra Coding for Power-Constrained Applications. Electronics 2024, 13, 2150. [Google Scholar] [CrossRef]
Park, S.H.; Kang, J.W. Context-Based Ternary Tree Decision Method in Versatile Video Coding for Fast Intra Coding. IEEE Access 2019, 7, 172597–172605. [Google Scholar] [CrossRef]
Park, S.h.; Kang, J.W. Fast Multi-Type Tree Partitioning for Versatile Video Coding Using a Lightweight Neural Network. IEEE Trans. Multimed. 2021, 23, 4388–4399. [Google Scholar] [CrossRef]
Zheng, W.; Yang, C.; An, P.; Huang, X.; Shen, L. Learning-Based CU Partition Prediction for Fast Panoramic Video Intra Coding. Expert Syst. Appl. 2024, 258, 125187. [Google Scholar] [CrossRef]
Li, M.; Wang, Z.; Zhang, Q. Fast CU Size Decision and Intra-Prediction Mode Decision Method for H.266/VVC. EURASIP J. Image Video Process. 2024, 7, 16–52. [Google Scholar] [CrossRef]
Ding, G.; Lin, X.; Wang, J.; Ding, D. Accelerating QTMT-based CU Partition and Intra Mode Decision for Versatile Video Coding. J. Vis. Commun. Image Represent. 2023, 94, 103832. [Google Scholar] [CrossRef]
Ni, C.T.; Lin, S.H.; Chen, P.Y.; Chu, Y.T. High Efficiency Intra CU Partition and Mode Decision Method for VVC. IEEE Access 2022, 10, 77759–77771. [Google Scholar] [CrossRef]
Li, W.; Fan, C. Intra-Mode Decision Based on Lagrange Optimization Regarding Chroma Coding. Appl. Sci. 2024, 14, 6480. [Google Scholar] [CrossRef]
Yang, H.; Shen, L.; Dong, X.; Ding, Q.; An, P.; Jiang, G. Low-Complexity CTU Partition Structure Decision and Fast Intra Mode Decision for Versatile Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1668–1682. [Google Scholar] [CrossRef]
Liu, Z.; Li, T.; Chen, Y.; Wei, K.; Xu, M.; Qi, H. Deep Multi-Task Learning Based Fast Intra-Mode Decision for Versatile Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 6101–6116. [Google Scholar] [CrossRef]
Dong, X.; Shen, L.; Yu, M.; Yang, H. Fast Intra Mode Decision Algorithm for Versatile Video Coding. IEEE Trans. Multimed. 2022, 24, 400–414. [Google Scholar] [CrossRef]
Zouidi, N.; Belghith, F.; Kessentini, A.; Masmoudi, N. Fast Intra Prediction Decision Algorithm for the QTBT Structure. In Proceedings of the 2019 IEEE International Conference on Design & Test of Integrated Micro & Nano-Systems (DTS), Gammarth, Tunisia, 28 April–1 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
Zouidi, N.; Kessentini, A.; Hamidouche, W.; Masmoudi, N.; Menard, D. Multitask Learning Based Intra-Mode Decision Framework for Versatile Video Coding. Electronics 2022, 11, 4001. [Google Scholar] [CrossRef]
Finley, J.P. The Differential Virial Theorem with Gradient- and Laplacian-dependent Operator Formulas. Chem. Phys. Lett. 2017, 667, 244–246. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Boyce, J.; Suehring, K.; Li, X. JVET-J1010: JVET Common Test Conditions and Software Reference Configurations. In Proceedings of the Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11 10th Meeting, San Diego, CA, USA, 10–20 April 2018; pp. 10–20. [Google Scholar]
Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1122–1131. [Google Scholar] [CrossRef]
Gysel, P.; Pimentel, J.; Motamedi, M.; Ghiasi, S. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 5784–5789. [Google Scholar] [CrossRef]

Figure 1. The 65 angle prediction modes (the blue dotted line indicates the new forecast direction).

Figure 2. Flow chart of intra-mode coding in VVC.

Figure 3. Intra-mode angular mode distribution in VVC.

Figure 4. Example of using ISP mode and MIP mode (ISP on the left and MIP on the right).

Figure 5. Overall algorithm flow.

Figure 6. The angular pattern is divided into four regions by four basic directions

{0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}}

.

Figure 6. The angular pattern is divided into four regions by four basic directions

{0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}}

.

Figure 7. Angle mode skip policy.

Figure 8. ISP/MIP net model architecture.

Figure 9. ISP/MIP-net dataset distribution.

Figure 10. ISP/MIP-net accuracy (on the dataset in Table 1).

Figure 11. Distribution diagram of test sequence results of different algorithms.

Table 1. Performance comparison of different solutions. (X stands for no change to the original encoder VTM).

Solution	Angular Modes	Advanced Mode	BDBR (%)	TS (%)
[22]	Pixel value deviation	x	0.72	39.86
[26]	Gradient descent	x	0.54	25.51
[24]	Sobel operator	x	0.19	17.30
[27]	Convolutional neural network	Convolutional neural network	1.56	30.12
[16]	Ensemble learning	Ensemble learning	0.25	12.82
[30]	Multi-task learning	Multi-task learning	1.77	21.69
Proposed	Laplace operator	Convolutional neural network	0.69	35.04

Table 2. Specific information of encoded video.

VTM version	10.0
Video sequence	Tango2, FoodMarket4, CatRobot1, ParkRunning3, MarketPlace, RitualDance, Cactus, BasketballDrive, RaceHorses, BQMall, BasketballDrill, FourPeople, Johnny, BasketballDrillText, SlideEditing
Frames	80
QP	{22, 27, 32, 37}

Table 3. Mode distribution.

Angle Mode	ISP Mode	MIP Mode
70%	5%	25%

Table 4. CU size configuration table.

CU Size W × h	Layer 1 Channel	Residual I	Residual II	Residual III	Residual IV	F1	F2
64 × 64	32	P	P-	P	P-	256	64
32 × 32	32	P	P	P	P-	128	64
16 × 16	16	P	P	P	P	128	64
32 × 16	16	H	P	P		128	64
16 × 32	16	V	P	P		128	64
32 × 8	16	H	H			128	64
8 × 32	16	V	V			128	64
32 × 4	16	H	H			128	64
4 × 32	16	V	V			128	64
16 × 8	8	H	P			96	48
8 × 16	8	V	P			96	48
16 × 4	8	H				96	48
4 × 16	8	V				96	48
8 × 8	8	P				64	32
8 × 4	4	P				64	32
4 × 8	4	P				64	32
4 × 4	4	P				64	32

Table 5. Performance of the proposed algorithm at each stage.

Class	Test Sequence	1		2		Proposed
Class	Test Sequence	BDBR (%)	TS (%)	BDBR (%)	TS (%)	BDBR (%)	TS (%)
	Tango2	0.76	23.56	0.28	12.7	0.92	36.84
A1	FoodMarket4	0.62	29.27	0.32	13.1	0.90	42.62
	Campfire	0.75	31.75	0.36	11.5	0.83	38.15
	DaylightRoad2	0.59	26.81	0.26	11.2	0.68	34.24
A2	ParkRunning3	0.61	33.32	0.28	10.9	0.64	40.15
	CatRobot	0.56	27.89	0.30	11.4	0.78	38.42
	Kimono	0.60	25.96	0.25	10.5	0.59	36.18
B	Cactus	0.62	29.05	0.16	10.7	0.71	37.61
	BQTerrace	0.52	26.57	0.23	11.0	0.73	35.18
	BasketballDrill	0.37	23.56	0.22	8.7	0.47	29.48
C	PartyScene	0.42	29.42	0.25	10.6	0.64	36.52
	RaceHorsesC	0.51	30.25	0.24	9.5	0.69	32.55
	BasketballPass	0.46	21.57	0.17	6.3	0.49	26.17
D	BlowingBubble	0.32	24.92	0.20	7.5	0.44	25.69
	RaceHorses	0.48	20.23	0.13	5.7	0.61	24.62
	FourPeople	0.59	31.28	0.31	12.3	0.78	36.14
E	Johnny	0.62	27.41	0.34	11.3	0.81	41.68
	KristenAndSara	0.56	28.46	0.34	12.3	0.72	39.54
	Average	0.55	27.27	0.26	10.39	0.69	35.04

Table 6. Performance comparison with state-of-the-art methods.

Class	Test Sequence	Li [22]		Z [30]		Ni [24]		Proposed
Class	Test Sequence	BDBR (%)	TS (%)	BDBR (%)	TS (%)	BDBR (%)	TS (%)	BDBR (%)	TS (%)
	Tango2	0.49	31.79	0.98	23.13	–	–	0.92	36.84
A1	FoodMarket4	0.47	35.64	0.91	22.43	0.09	17.04	0.90	42.62
	Campfire	0.49	38.47	0.78	24.54	–	–	0.83	38.15
	DaylightRoad2	0.51	33.78	1.59	24.64	–	–	0.68	34.24
A2	ParkRunning3	0.52	36.82	0.59	20.63	–	–	0.64	40.15
	CatRobot	0.54	32.08	1.13	23.43	0.21	22.21	0.78	38.42
	Kimono	0.59	36.85	–	–	0.08	20.18	0.59	36.18
B	Cactus	–	–	1.36	27.11	0.15	15.66	0.71	37.61
	BQTerrace	0.45	30.99	0.49	26.94	–	–	0.73	35.18
	BasketballDrill	0.39	29.27	1.52	28.12	–	–	0.47	29.48
C	PartyScene	0.53	38.71	1.24	27.97	0.18	17.03	0.64	36.52
	RaceHorsesC	0.51	31.79	2.04	28.74	0.11	16.50	0.69	32.55
	BasketballPass	–	–	1.41	22.96	0.38	13.28	0.49	26.17
D	BlowingBubbles	0.41	31.45	1.56	26.53	–	–	0.44	25.69
	RaceHorses	0.51	31.02	2.04	28.74	–	–	0.61	24.62
	FourPeople	0.53	30.49	1.73	23.63	0.20	16.52	0.78	36.14
E	Johnny	0.42	34.56	1.72	22.95	0.31	17.34	0.81	41.68
	KristenAndSara	0.95	32.24	1.95	23.50	–	–	0.72	39.54
	Average	0.51	33.51	1.77	21.69	0.19	17.30	0.69	35.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, N.; Wang, Z.; Zhang, Q.; He, L.; Zhang, W. Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network. Electronics 2025, 14, 2031. https://doi.org/10.3390/electronics14102031

AMA Style

Li N, Wang Z, Zhang Q, He L, Zhang W. Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network. Electronics. 2025; 14(10):2031. https://doi.org/10.3390/electronics14102031

Chicago/Turabian Style

Li, Nana, Zhenyi Wang, Qiuwen Zhang, Lei He, and Weizheng Zhang. 2025. "Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network" Electronics 14, no. 10: 2031. https://doi.org/10.3390/electronics14102031

APA Style

Li, N., Wang, Z., Zhang, Q., He, L., & Zhang, W. (2025). Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network. Electronics, 14(10), 2031. https://doi.org/10.3390/electronics14102031

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Intra-Prediction Mode Decision Algorithm for Versatile Video Coding Based on Gradient and Convolutional Neural Network

Abstract

1. Introduction

2. Related Works

2.1. Fast CU Partitioning Methods in VVC

2.2. Fast Mode Decision Methods in VVC

3. Preliminaries: Data Analaysis

3.1. Modal Distribution Analysis of Angular Mode

3.2. Analysis of MIP and ISP Mode Distribution

4. The Proposed Method

4.1. Intra-Angular Mode Prediction

4.2. CNN Models for ISP and MIP

5. Experimental Results and Analyses

5.1. Experimental Configuration

5.2. CNN Training

5.3. Ablation Experiments

5.4. Comparison with State-of-the-Art Algorithms

5.5. Experimental Results and Fast Coding Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI