1. Introduction
The continuous expansion of Internet-based video services, together with the widespread adoption of advanced video formats such as ultra-high-definition (UHD) 4K/8K, high frame rate (HFR), wide color gamut (WCG), high dynamic range (HDR), and immersive virtual reality (VR), has imposed increasingly stringent requirements on video compression efficiency. Consequently, the design of video coding algorithms that jointly achieve high compression performance and manageable computational complexity has become a critical research challenge in both academia and industry.
Versatile Video Coding (VVC) [
1,
2], the most recent international video coding standard, represents a major advancement over previous standards, including Advanced Video Coding (AVC) [
3,
4] and High Efficiency Video Coding (HEVC) [
5,
6]. Developed jointly by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) under the Joint Video Experts Team (JVET), VVC was finalized in July 2020 to address the compression requirements of next-generation video applications. By integrating a wide range of advanced coding tools, VVC achieves a substantial bit-rate reduction compared to HEVC for equivalent perceptual quality.
This gain in coding efficiency, however, is accompanied by a significant increase in encoder computational complexity. A major contributor to this complexity is the highly flexible block partitioning structure based on the Quadtree with nested Multi-Type Tree (QTMT) [
7]. Unlike HEVC, which relies exclusively on quadtree-based partitioning, QTMT enables a diverse set of partitioning configurations that adapt to local texture characteristics and motion patterns [
8]. Starting from a Coding Tree Unit (CTU) of size 128 × 128, the recursive exploration of multiple partitioning structures leads to a large number of candidate Coding Unit (CU) configurations that must be evaluated during rate–distortion optimization (RDO).
In particular, during inter-coding, the encoder performs an exhaustive search over the QTMT decision space to determine the partitioning structure that minimizes the rate–distortion cost for each CU. Although this exhaustive strategy ensures optimal coding decisions, it results in a substantial increase in encoding time, posing a major challenge for real-time applications and resource-constrained encoder implementations. Consequently, reducing the complexity of QTMT partition decision while preserving rate–distortion performance has become an important research problem in VVC optimization [
9].
In this paper, a fast QTMT partition decision algorithm for VVC inter-coding is proposed to address this challenge. The proposed approach focuses on the 64 × 64 CU level and exploits texture information derived from the Gray-Level Co-occurrence Matrix (GLCM). A feature selection process identifies homogeneity as the most relevant descriptor for characterizing CU texture properties. Based on this descriptor, a Gradient Boosting Machine (GBM) model is employed to learn adaptive decision thresholds that guide a homogeneity-driven restriction of QTMT partition candidates. By progressively limiting unnecessary partition evaluations according to CU texture characteristics, the proposed method significantly reduces the reliance on exhaustive RDO checks through a lightweight and content-aware decision strategy, while preserving coding behavior consistent with the VVC reference encoder.
The remainder of this paper is organized as follows.
Section 2 reviews related work on VVC complexity reduction.
Section 3 presents the statistical analysis motivating the proposed approach.
Section 4 introduces the Gradient Boosting Machine model.
Section 5 describes the GLCM-based feature extraction and selection process. The proposed QTMT decision algorithm is detailed in
Section 6. Experimental results are discussed in
Section 7, and
Section 8 concludes the paper.
2. Related Works
In recent years, considerable research efforts have been devoted to reducing the computational complexity of VVC encoders [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]. Most existing approaches primarily focus on fast partition decision strategies and can be broadly classified into methods targeting intra-coding and those addressing inter-prediction complexity.
For intra-coding, numerous fast QTMT decision algorithms have been proposed to alleviate the high computational burden introduced by the flexible partitioning structure. In [
10], a gradient-based QTMT decision approach employing the Scharr operator was proposed to capture local texture variations and enable early termination of partitioning. A multistage QTMT decision framework was introduced in [
11], where partition decisions were formulated as a sequence of binary classification problems to dynamically adapt CU sizes. Shang et al. [
12] presented a fast CU size decision method that combines coding and texture features to accelerate quadtree and multi-type tree exploration. Lightweight learning-based strategies were further investigated in [
13], where a compact neural network was employed to avoid redundant partition checks. Similarly, CNN-based fast intra-partitioning schemes were reported in [
14,
15], demonstrating the effectiveness of deep learning models for predicting QTMT partition modes.
In contrast, research on complexity reduction for VVC inter-prediction remains comparatively limited. Early studies mainly focused on accelerating motion estimation, such as the bypass zone search (BZS) algorithm proposed in [
16], which integrates learning-based concepts with efficient search strategies. More recent works have explored learning-based early QTMT decision schemes for inter-coding. In [
17], a multi-information fusion CNN combined with content complexity analysis was proposed to enable early CU termination and accelerate inter-prediction. Tissier et al. [
18] employed CNN-based split probability estimation to prune unlikely partition candidates. In [
19], a joint classification–prediction framework was introduced, where CTUs were assigned to subnetworks of different complexities based on a partition homogeneity map. Additionally, a GBM-based fast QTMT decision method for inter-coding was presented in [
20], using Average Local Variance as a texture descriptor to guide partition decisions.
It is worth emphasizing that most existing learning-based approaches achieve high prediction accuracy at the cost of increased computational complexity. This overhead mainly originates from the use of deep neural networks or the extraction of multiple handcrafted features, which require expensive online inference and substantially increase encoder runtime. Although these methods effectively reduce the partition search space, their high computational burden limits practical deployment, particularly in real-time and resource-constrained encoding scenarios.
In contrast, the proposed method deliberately adopts a lightweight GLCM-based homogeneity descriptor, which can be computed with very low computational cost while still providing sufficient discriminative power for QTMT decision making. This design choice enables a more favorable trade-off between feature extraction overhead and prediction accuracy, thereby making the proposed framework more suitable for practical and real-world encoder implementations.
Although the above methods achieve notable reductions in encoder complexity, fast QTMT decision techniques for VVC inter-coding remain relatively underexplored. Moreover, many existing approaches rely on complex decision pipelines or computationally demanding inference models, which may limit their robustness and practicality in real encoder implementations. Motivated by these limitations, this paper proposes a lightweight QTMT partition decision algorithm for VVC inter-coding based on statistical texture analysis and machine learning-based thresholding. The proposed approach aims to effectively reduce encoding complexity while preserving rate–distortion performance through a conservative and content-aware pruning strategy.
3. Motivation and Statistical Analysis
This section details the motivation for the proposed fast QTMT decision strategy by revisiting the QTMT partitioning process in inter-prediction and by analyzing the statistical behavior of partition modes at the 64 × 64 CU level. The objective is to highlight consistent partitioning patterns that justify the design of a simplified and content-aware QTMT decision mechanism with reduced computational complexity.
3.1. QTMT Partitioning in Inter-Coding
The flexible QTMT partitioning structure constitutes one of the key components enabling the high compression efficiency of new video coding standards. By allowing coding units to be recursively divided using multiple partition shapes, the encoder can effectively adapt to diverse texture distributions and motion characteristics encountered in inter-coded sequences.
In the inter-coding process, each frame is initially partitioned into CTUs of size 128 × 128, which represent the root level of the partitioning hierarchy. At this level, partitioning is restricted to quadtree splitting, and each CTU is therefore mandatorily divided into four 64 × 64 CUs [
21]. This initial QT decomposition establishes a uniform and fixed entry point for subsequent partitioning decisions and ensures consistent processing across all coding blocks.
Once the 64 × 64 CU level is reached, the full QTMT decision space becomes available. At this stage, the encoder evaluates whether a CU should remain unsplit or be further partitioned. When further partitioning is considered, different partitioning structures are explored and can be broadly categorized into square and rectangular decompositions. Square partitions are generated through additional QT splits, whereas rectangular partitions are obtained using multi-type tree (MTT) structures. The MTT framework enables both binary and ternary splits along horizontal and vertical directions, allowing the encoder to efficiently represent directional textures and elongated motion patterns [
22].
Figure 1 illustrates the complete QTMT partitioning structures available at the CU level, including QT, horizontal and vertical BT, and horizontal and vertical TT modes. While this flexibility provides strong adaptation capability, it also significantly increases the number of candidate configurations evaluated during rate–distortion optimization.
To determine the optimal partitioning configuration, the encoder applies a top-down recursive splitting process until the minimum CU size constraint is reached. For each candidate CU generated during this process, the corresponding rate–distortion cost is evaluated, followed by a bottom-up pruning stage that selects the configuration minimizing the overall cost at the CTU level. Although this exhaustive strategy ensures optimal partitioning from a rate–distortion perspective, it results in a substantial increase in encoding complexity.
In practice, statistical observations indicate that only a limited subset of QTMT partition modes is frequently selected at the 64 × 64 CU level, while many evaluated partition candidates do not contribute to the final optimal structure. This redundancy suggests that early and reliable identification of unfavorable partition candidates can effectively reduce computational complexity without noticeably affecting coding efficiency. These observations form the basis for the fast QTMT decision strategy proposed in this work, which focuses on simplifying partition decisions at the 64 × 64 CU level during inter-prediction.
3.2. Statistical Analysis of QTMT Partitioning
To further investigate the QTMT partitioning behavior in inter prediction and its dependency on quantization strength, a detailed statistical analysis was performed at the 64 × 64 CU level. This level constitutes a critical decision stage in the VVC partitioning hierarchy, as it represents the first depth at which all QTMT partitioning modes are enabled. The analysis was conducted using VTM 23.5 (VVC Test Model) [
23] under the Random Access configuration and the Common Test Conditions (CTC) [
24]. Four representative sequences, Tango, DaylightRoad2, Cactus, and PartyScene were selected to cover a wide range of spatial complexity, texture regularity, and motion characteristics.
Figure 2 illustrates the distribution of QTMT partitioning modes (No-Split, QT, BT, and TT) for 64 × 64 CUs under different quantization parameters (QP 22, 27, 32, and 37). Several consistent and sequence-independent trends can be observed.
First, the proportion of No-Split CUs exhibits a monotonic increase as the quantization parameter increases. For instance, in the Tango sequence, the No-Split ratio increases from 14.76% at QP22 to 39.56% at QP37. A similar trend is observed in DaylightRoad2, where the No-Split percentage rises from 13.94% to 34.33%. In the Cactus sequence, No-Split also increases from 17.51% to 24.57%, while in PartyScene it grows from 16.67% to 29.63%. This behavior indicates that higher quantization levels reduce the coding benefit of finer spatial partitioning, encouraging the encoder to preserve larger CUs, especially in relatively smooth or slowly varying regions.
Second, QT partitioning is predominant at lower quantization levels but gradually loses importance as the QP increases. At QP22, QT accounts for a substantial portion of the selected modes across all sequences (e.g., 33.59% for Tango and 70.69% for PartyScene). However, as the quantization strength increases, the QT usage consistently decreases. For example, in Tango, the QT ratio drops from 33.59% at QP22 to 22.76% at QP37, while in DaylightRoad2 it decreases from 25.73% to 15.66%. A similar decreasing trend is observed in Cactus (from 40.73% to 26.97%) and in PartyScene (from 70.69% to 46.19%). This trend suggests that square partitions become less efficient under coarse quantization, particularly when high-frequency texture details are suppressed.
Conversely, the utilization of MTT-based partitions, namely BT and TT, generally increases with the quantization parameter. In the Tango sequence, the BT ratio grows from 31.69% at QP22 to 44.44% at QP37, while TT remains relatively stable with values around 34%. In DaylightRoad2, both BT and TT exhibit a clear upward trend, with TT becoming the dominant mode at high QPs (reaching 48.01% at QP32). In Cactus, TT increases from 36.19% to 43.33%, while in PartyScene BT increases significantly from 9.72% to 29.11%. This shift reflects the encoder’s preference for rectangular and directional partitions when finer texture information is diminished, allowing better adaptation to elongated structures and motion boundaries.
Overall, the statistical evidence reveals a clear and consistent quantization-dependent transition in QTMT partitioning behavior at the 64 × 64 CU level. As the quantization parameter increases, QT partitions are progressively replaced by BT and TT structures, while the proportion of No-Split CUs also increases. These observations indicate that a large fraction of QT evaluations performed at higher QPs are unlikely to be selected by the encoder, highlighting substantial redundancy in the exhaustive QTMT search process [
25].
This analysis provides strong empirical justification for the proposed fast QTMT decision strategy. By exploiting the predictable evolution of partitioning behavior with respect to quantization strength and content characteristics, unnecessary QTMT evaluations can be safely pruned, enabling significant complexity reduction with minimal impact on rate–distortion performance.
4. Gradient Boosting Machine Algorithm
Machine learning techniques provide an effective framework for data-driven modeling of complex relationships by automatically learning discriminative patterns from observed data, without relying on explicitly handcrafted decision rules [
26,
27,
28]. Among these techniques, ensemble learning has demonstrated strong performance for both classification and regression tasks by combining multiple weak learners into a single, more robust predictive model. In particular, boosting-based methods iteratively construct a sequence of weak predictors, where each newly added learner is trained to compensate for the prediction errors of the current ensemble [
29,
30].
Among boosting algorithms, AdaBoost [
31] and GBM [
32,
33] are widely adopted. Compared to AdaBoost, GBM offers greater flexibility by directly optimizing a differentiable loss function through gradient descent in function space [
34]. This formulation enables GBM to effectively capture nonlinear relationships between input features and decision variables, making it particularly suitable for learning adaptive decision boundaries in complex coding processes.
In practice, GBM is commonly implemented using an ensemble of regression trees [
35,
36,
37]. The model is constructed in an additive and stage-wise manner, where each weak learner typically corresponds to a shallow regression tree trained to approximate the residual errors of the current ensemble. By sequentially adding such learners, GBM progressively refines the prediction function while preserving good generalization capability. In lightweight configurations, the weak learners may consist of shallow trees or even decision stumps [
38,
39], which helps to limit computational overhead and facilitates practical integration into real-time systems.
Let the training dataset be defined as
where
denotes the input feature vector and
represents the corresponding class label. The objective of GBM is to learn a predictive function
that minimizes a differentiable loss function
over the training samples.
At iteration
m, the model update is obtained by solving
where
denotes the
m-th weak learner.
Each weak learner is represented as a regression tree composed of
terminal regions and can be expressed as
where
denotes the
j-th terminal region and
is the prediction value associated with that region.
Following the formulation introduced by Friedman [
32], an individual optimal step size
is computed for each terminal region, leading to the ensemble update
Through this iterative optimization process, GBM implicitly explores feature thresholds and decision boundaries that minimize the residual loss, resulting in a compact yet effective decision model.
Within the proposed framework, the GBM model is trained using a single statistical texture descriptor extracted from GLCM analysis. This descriptor exhibits a strong correlation with coding unit partitioning behavior at the 64 × 64 level in inter coding. To explicitly reflect the hierarchical nature of QTMT decisions, two training datasets are constructed. The first dataset is designed to learn the decision boundary between split and no-split cases, while the second dataset focuses on discriminating between QT and MTT partitioning when further splitting is required. For both datasets, the selected texture descriptor serves as the sole input feature, and the corresponding partition decision is used as the target label.
By training the GBM model on these datasets across different quantization parameters, adaptive decision thresholds are learned for each stage of the partitioning process. These thresholds are subsequently exploited to guide fast QTMT decisions during inter coding, enabling early termination or selective partition evaluation without resorting to exhaustive rate–distortion optimization.
In our implementation, the GBM hyperparameters were empirically selected based on preliminary experiments to achieve a good balance between model complexity and generalization capability. Specifically, shallow regression trees with a maximum depth of 3 were adopted as weak learners to avoid overfitting. The learning rate was set to 0.1 to ensure stable convergence, while the number of boosting iterations was fixed to 100. These values were chosen to maintain a lightweight model suitable for practical encoder integration.
The complete training procedure of the Gradient Boosting Machine is summarized in Algorithm 1.
| Algorithm 1 Gradient Boosting Machine (GBM) |
Input: Training set , loss function , number of iterations T Output: Final prediction model
- 1:
Initialize model: - 2:
for to T do - 3:
Compute pseudo-residuals: - 4:
Fit regression tree to - 5:
for each terminal region do - 6:
Compute optimal step size: - 7:
end for - 8:
Update model: - 9:
end for
|
The trained GBM model is subsequently exploited in the proposed fast QTMT decision framework to provide adaptive thresholds for hierarchical partition selection in inter coding.
5. Analysis of QTMT Partitioning Features for Inter-Coding Decision Learning
The efficiency of QTMT partitioning in inter coding is strongly influenced by the encoder’s ability to accurately characterize local texture properties of coding units. During inter prediction, the decision to further partition a CU or to preserve its current structure directly affects both rate–distortion efficiency and computational complexity. Therefore, identifying statistical texture descriptors that reliably reflect local structural variations is a key requirement for the design of fast and content-adaptive QTMT decision strategies.
In this section, texture descriptors derived from GLCM analysis are investigated to examine their relationship with CU partitioning behavior at the 64 × 64 level. These second-order statistical features capture spatial dependencies between pixel intensities and provide a richer description of texture characteristics than first-order measures. A correlation-based analysis is subsequently conducted to assess the relevance of each feature with respect to CU splitting decisions, which forms the basis for the learning-based threshold estimation employed in the proposed framework.
5.1. GLCM Feature Extraction
Conventional fast decision approaches in video coding commonly rely on first-order statistical measures, such as variance or gradient magnitude, to estimate texture complexity. While these descriptors are computationally efficient, they only characterize the distribution of gray-level intensities within a region and do not account for spatial dependencies between neighboring pixels. As a consequence, regions exhibiting similar intensity distributions but different spatial arrangements may not be reliably distinguished, which limits the accuracy of coding unit partitioning decisions.
To address this limitation, second-order statistical features derived from GLCM analysis are employed, as extensively reported in the literature for both image and video texture characterization [
40,
41]. These features capture spatial relationships between pixel intensities and provide a richer and more discriminative representation of local texture structure. The GLCM models the joint probability of occurrence of two pixels with gray levels
i and
j, separated by a spatial distance
d along a given direction
. Let
R denote the luminance intensity matrix of a coding unit with dimensions
, and let
G represent the corresponding GLCM [
42].
Each pixel is associated with eight neighboring directions, including horizontal, vertical, and diagonal orientations, as illustrated in
Figure 3. To limit computational overhead while preserving sufficient discriminative capability for partition decision modeling, the GLCM is computed only along the horizontal direction (
) with a pixel distance of
.
The GLCM element at position
is defined as
where # denotes the counting operator,
and
represent pixel coordinates in
R, and
.
To further reduce computational complexity, the luminance range
is uniformly quantized into
gray levels by dividing each pixel value by 32. This quantization strategy significantly reduces the size of the GLCM while preserving essential texture characteristics. The resulting matrix is expressed as
From the normalized GLCM, four texture features are extracted, namely Homogeneity, Contrast, Entropy, and Angular Second Moment (ASM). The mathematical definitions of these features are provided in [
43] and are expressed as follows:
These features provide complementary and interpretable descriptions of texture characteristics. Homogeneity reflects spatial uniformity and generally assumes higher values in smooth regions. Contrast captures local gray-level variations and increases with texture complexity. Entropy quantifies the degree of randomness in the intensity distribution, while ASM represents texture energy and decreases as structural irregularity increases. Collectively, these descriptors form a robust statistical representation of coding unit texture complexity, making them well suited for learning-based QTMT partitioning analysis in inter coding.
5.2. GLCM Feature Selection and Correlation Analysis
To evaluate the relevance of the extracted GLCM features for QTMT partitioning decisions, a correlation-based statistical analysis is conducted between each feature and the CU split flag at the level. Three representative training video sequences are selected to construct the dataset used for the feature selection process, namely BasketballDrillText with a resolution of , SlideEditing with a resolution of , and SlideShow with a resolution of . These Class F sequences exhibit different levels of content complexity due to variations in scene structure, number of subjects, and background details, making them suitable for learning texture-driven QTMT partitioning behavior. For each sequence, 50 frames are analyzed, resulting in a diverse and representative training dataset.
Although the training dataset is limited to three Class F sequences, these sequences were deliberately selected to cover a wide range of texture characteristics, including highly textured regions, smooth areas, and mixed-content scenes. This diversity allows the model to capture representative QTMT partitioning patterns despite the limited number of training sequences.
All sequences are encoded using the VVC reference software VTM 23.5 [
23]. For every
CU extracted from these frames, four GLCM texture features are computed and paired with the corresponding QTMT partitioning decisions and CU split flags selected by the reference encoder, forming the basis for the subsequent statistical analysis and model training.
In addition, the learned homogeneity thresholds are QP-dependent and are not overfitted to the training data. Their effectiveness is further validated through experiments conducted on standard test sequences from multiple classes (Classes A–E), demonstrating good generalization capability across different spatial resolutions and content types.
The correlation between each texture feature and the CU splitting decision is computed using the CorrelationAttributeEval method implemented in the Waikato Environment for Knowledge Analysis (WEKA) [
44], a widely used open-source machine learning and data mining toolkit that provides a comprehensive set of feature evaluation and statistical analysis algorithms.
The resulting correlation coefficients obtained under different quantization parameters are summarized in
Table 1, while their variation trends across QP values are illustrated in
Figure 4.
Several observations can be drawn from the statistical results. First, Homogeneity consistently exhibits the strongest correlation with the CU split decision across all tested quantization parameters. At QP 22, its correlation coefficient reaches 0.1093, which is noticeably higher than those of the other features. Although the correlation strength gradually decreases as the quantization parameter increases, Homogeneity remains the most informative descriptor even at QP 37, with a correlation value of 0.0586. This behavior indicates that spatial uniformity plays a dominant role in determining partitioning decisions at the level.
Second, Contrast shows the second highest correlation among the analyzed features. Its correlation coefficient decreases from 0.0843 at QP 22 to 0.0456 at QP 37, reflecting a progressive reduction in discriminative power as texture details are attenuated under stronger quantization. Nevertheless, the relatively stable ranking of Contrast across all QP values suggests that local gray-level variations remain a useful indicator for partition decision modeling.
In contrast, Entropy and ASM exhibit lower and less discriminative correlation values across all tested quantization parameters. Their correlation coefficients remain below 0.06 and show limited sensitivity to QP variation, indicating a weaker relationship with CU splitting behavior. This suggests that global randomness and texture energy, as captured by these descriptors, are less effective for characterizing QTMT partitioning decisions at the considered CU level.
As illustrated in
Figure 4, an overall decreasing trend in correlation strength is observed for all features as the quantization parameter increases. This trend can be attributed to the progressive suppression of fine texture details at higher QP values, which reduces the influence of spatial characteristics on partitioning decisions.
Although using a single feature may theoretically lead to some information loss, the results in
Table 1 and
Figure 4 clearly indicate that Homogeneity consistently outperforms the other GLCM descriptors across all QP values. The remaining features (Contrast, Entropy, and ASM) exhibit significantly lower and closely clustered correlation values, suggesting limited complementary information.
Therefore, retaining only Homogeneity enables a favorable trade-off between model simplicity, computational efficiency, and prediction reliability, while avoiding unnecessary model complexity.
Based on these observations, Homogeneity emerges as the most informative GLCM-based feature for QTMT partitioning analysis in inter coding. Its consistently higher correlation with CU split decisions across different quantization levels provides strong statistical evidence supporting its selection as the primary descriptor for learning adaptive decision thresholds in the proposed fast QTMT decision framework.
6. Proposed Learning-Based QTMT Decision Framework
This section presents a homogeneity-guided learning-based framework for fast QTMT decision in inter prediction. The proposed approach aims to significantly reduce encoder computational complexity while preserving rate–distortion performance. By exploiting the strong correlation between texture homogeneity and QTMT partitioning behavior, the framework adaptively restricts the partition search space at the
CU level. An overview of the proposed framework is illustrated in
Figure 5.
To ensure practical integration within the VVC encoder, the proposed framework is organized into two complementary stages: an offline learning stage and an online decision stage. This separation confines all learning-related operations to the offline phase, while keeping the online encoding process lightweight and deterministic.
During the offline learning stage, representative inter-coded training sequences are encoded using the VVC reference software VTM 23.5 [
23]. For each
inter CU, homogeneity values are extracted using GLCM-based texture analysis. Based on the correlation analysis presented in
Section 5, homogeneity is identified as the most informative descriptor and is therefore retained as the sole feature used in the learning process. Each training sample is represented by a single scalar homogeneity value paired with the QTMT partitioning outcome selected by the reference encoder, resulting in a compact and low-dimensional dataset with strong discriminative capability.
Rather than formulating the learning task as a direct partition classification problem, the collected data implicitly captures how QTMT partitioning behavior evolves with texture uniformity. In particular, the dataset reflects transitions between non-split coding units, square QT partitions, and directional BT and TT structures [
7,
21]. A GBM model is trained exclusively in the offline stage to learn adaptive decision boundaries in the homogeneity domain. These boundaries are learned independently for each QP and correspond to three ordered thresholds
,
, and
, which partition the homogeneity space into four distinct decision regions.
Importantly, the trained model is not used to directly predict the final partition mode. Instead, it derives three homogeneity thresholds that regulate which QTMT partition candidates are evaluated during encoding.
During the online decision stage, no learning or model inference are performed. For each inter CU, the homogeneity value is computed using the same lightweight feature extraction process adopted in the offline stage and compared against the pre-learned thresholds embedded in the encoder decision logic. This comparison enables a hierarchical and content-adaptive restriction of the QTMT search space driven by texture homogeneity.
Specifically, four decision regions are defined:
: only the NoSplit option is evaluated.
: only BT is enabled, while QT is always retained as a conservative candidate.
: both BT and TT (MTT) are enabled, while QT is always retained.
: only QT is evaluated and all MTT partitions are disabled.
Rather than directly selecting a final partition mode, the learned homogeneity thresholds regulate which QTMT partition candidates are evaluated during encoding. The final decision remains governed by rate–distortion optimization, while the adaptive restriction of the QTMT search space significantly reduces the number of partition candidates evaluated during inter prediction. Despite its low runtime complexity, the proposed framework closely follows the QTMT partitioning behavior of the reference encoder, achieving substantial complexity reduction with negligible impact on coding efficiency [
7,
22].
To further improve clarity and reproducibility, the online decision process of the proposed framework is explicitly described in Algorithm 2.
| Algorithm 2 Proposed learning-based QTMT decision algorithm |
- 1:
Input: 64 × 64 inter CU, learned thresholds - 2:
Output: Selected QTMT partition mode - 3:
Compute GLCM - 4:
Extract homogeneity value H using Equation ( 7) - 5:
Initialize candidate set - 6:
if
then - 7:
- 8:
else if
then - 9:
- 10:
else if
then - 11:
- 12:
else - 13:
- 14:
end if - 15:
Evaluate only modes in S using RDO - 16:
Select the best mode according to RD cost - 17:
return Selected QTMT mode
|
Figure 5 summarizes the complete workflow of the proposed approach. It illustrates how GLCM-based texture features are first extracted and statistically analyzed to select the most relevant descriptor, followed by offline GBM training to learn adaptive decision boundaries. These learned boundaries are subsequently embedded into the encoder and used during the online stage to hierarchically regulate QTMT partition candidate activation at the
inter CU level, enabling effective complexity reduction without modifying the core rate–distortion optimization process.
7. Experimental Results
This section evaluates the performance of the proposed fast QTMT decision algorithm in terms of encoding complexity reduction and rate–distortion efficiency. The proposed method was implemented on the VVC reference software VTM 23.5 [
23] and evaluated under the Random Access configuration following the Common Test Conditions (CTCs) [
24]. Standard Quantization Parameter values of 22, 27, 32, and 37 were used to cover a wide range of compression scenarios. The detailed experimental setup is summarized in
Table 2.
The performance is evaluated using BDBR and BDPSNR metrics [
45,
46], while encoder complexity reduction is measured using the time-saving ratio defined as
7.1. Overall Coding Performance
Table 3 presents the coding performance of the proposed method across all tested sequences. The results show that the proposed approach achieves a consistent reduction in encoding time while maintaining rate–distortion performance very close to that of the reference encoder.
On average, the proposed method achieves an encoding time reduction of approximately 27.6%, while introducing only a negligible BDPSNR loss of 0.006 dB and a BDBR increase of 0.19%. These results indicate that a large portion of unnecessary QTMT partition evaluations are effectively avoided through the proposed homogeneity-driven decision rules.
It is worth noting that the achieved complexity reduction is intentionally moderate compared to more aggressive learning-based approaches. This behavior can be attributed to the lightweight computation of the homogeneity feature, which, although simplified, still introduces a limited computational overhead. Nevertheless, this overhead is largely compensated by the reduction in redundant partition tests, resulting in a favorable overall time saving while preserving stable rate–distortion performance.
7.2. Comparison with State-of-the-Art Methods
Table 4 compares the proposed fast QTMT decision method with several representative inter-coding complexity reduction approaches reported in the literature. The comparison focuses on the trade-off between encoding time reduction and rate–distortion efficiency, as reflected by the BDBR and TS metrics.
The method in [
17] achieves moderate time saving by employing a convolutional neural network combined with multi-information fusion for early partition termination. However, this gain comes at the cost of a significant rate-distortion degradation, as indicated by the high BDBR increase of 3.18%. This suggests that the aggressive pruning strategy adopted in [
17] frequently eliminates beneficial partition candidates, particularly in regions with complex texture or motion.
The approach proposed in [
18] relies on CNN-based split probability estimation to prune unlikely QTMT candidates. While this method improves the BDBR performance compared to [
17], it still incurs a noticeable coding loss of 1.11% and requires the execution of a relatively heavy inference model at runtime. In contrast, the proposed method avoids complex inference and maintains a significantly lower BDBR increase by relying on simple texture-driven decision rules.
The framework presented in [
19] reports the highest time-saving ratio by dynamically assigning CTUs to subnetworks of different complexity based on a partition homogeneity map. Although this strategy achieves substantial complexity reduction, it introduces a considerable BDBR penalty of 1.94%, reflecting the cost of coarse-grained partition classification and network switching overhead. Moreover, the reliance on multiple subnetworks increases memory consumption and implementation complexity.
The GBM-based method in [
20] represents the closest approach to the proposed work in terms of learning paradigm. By using Average Local Variance as a texture descriptor, it achieves a favorable balance between time saving and coding efficiency. However, the higher time-saving ratio reported in [
20] is obtained through more aggressive partition pruning, which results in a larger BDBR increase compared to the proposed approach.
In contrast, the proposed method deliberately adopts a conservative pruning strategy guided by a lightweight homogeneity descriptor and adaptive thresholding. Although the resulting time-saving ratio is slightly lower than that of some learning-heavy approaches, the proposed method achieves the lowest BDBR increase among all compared techniques. This demonstrates that the proposed design effectively suppresses unnecessary QTMT evaluations while preserving the vast majority of rate-distortion gains offered by the full search.
Overall, the proposed approach offers a more balanced trade-off between encoding complexity reduction and coding efficiency. Its lightweight feature extraction, absence of deep inference models, and stable rate-distortion behavior make it particularly well suited for practical and resource-constrained encoder implementations, where robustness and predictability are critical design requirements.
8. Conclusions
In this paper, a fast QTMT partition decision algorithm for inter-prediction in video coding is proposed to mitigate the high encoder complexity introduced by the flexible partitioning structure. The proposed method is built upon statistical texture analysis and lightweight machine learning, where a homogeneity descriptor derived from the Gray-Level Co-occurrence Matrix is utilized to characterize the spatial uniformity of coding units.
By exploiting the strong correlation between texture homogeneity and partitioning behavior, a Gradient Boosting Machine model is trained offline to learn adaptive decision thresholds that guide the partitioning process at the 64 × 64 coding unit level. Based on these thresholds, a hierarchical decision strategy is employed, in which the encoder first determines whether further partitioning is required and subsequently selects between Quad-Tree and Multi-Type Tree structures only when splitting is beneficial.
This design enables effective pruning of redundant QTMT evaluations while preserving partitioning decisions that are consistent with those selected by the reference encoder. The proposed approach avoids complex inference models and excessive feature extraction, resulting in a stable and computationally efficient solution. Consequently, the method provides a practical framework for reducing inter-prediction complexity in modern video encoders, making it well suited for real-time and resource-constrained implementations.
Although the proposed framework currently focuses on the 64 × 64 CU level to maximize complexity reduction, extending this strategy to smaller CU sizes (e.g., 32 × 32) represents a promising research direction. Such an extension could potentially yield additional complexity savings. However, it also raises new challenges, including increased decision frequency, higher feature extraction overhead, and the need for more fine-grained threshold adaptation. These aspects will be investigated in future work to further improve the scalability of the proposed framework.