Next Article in Journal
Research on High-Precision DOA Estimation Method for UAV Platform in Strong Multipath Environment
Next Article in Special Issue
RoadMark-cGAN: Generative Conditional Learning to Directly Map Road Marking Lines from Aerial Orthophotos via Image-to-Image Translation
Previous Article in Journal
TF-Denoiser: A Time-Frequency Domain Joint Method for EEG Artifact Removal
Previous Article in Special Issue
EDAT-BBH: An Energy-Modulated Transformer with Dual-Energy Attention Masks for Binary Black Hole Signal Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Perception-Driven and Object-Aware Fast MTT Partitioning for H.266/VVC: A Saliency-Guided Complexity Reduction Framework

1
Department of Electrical Engineering, National Dong Hwa University, Hualien 974, Taiwan
2
Department of Electrical Engineering, National Taiwan Normal University, Taipei 106, Taiwan
3
Department of Electrical Engineering, National Sun Yat-sen University, Kaohsiung 804, Taiwan
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(1), 133; https://doi.org/10.3390/electronics15010133
Submission received: 30 September 2025 / Revised: 14 December 2025 / Accepted: 20 December 2025 / Published: 27 December 2025

Abstract

The H.266/Versatile Video Coding (VVC) standard was developed to address the growing demand for compressing ultra-high-definition video content, supporting resolutions ranging from 4K to 8K and beyond. H.266/VVC improves coding efficiency by introducing a flexible quadtree with nested multi-type tree (QT-MTT) partitioning and various advanced coding tools. However, these improvements substantially increase the encoding complexity. To address this issue, we propose a perception-driven and object-aware algorithm that accelerates the MTT process in H.266/VVC intra coding. Our method integrates pixel-level saliency detection with object bounding box detection. Specifically, visually distinguishable (VD) pixels are identified using a just noticeable distortion (JND) model based on average background luminance, while detected-object regions are extracted using a YOLO object detection network. These two types of perceptual information are combined to guide adaptive encoding decisions. For each frame, a perception-driven pixel map labeled with VD pixels and a YOLO-based object map are generated. Within the MTT framework, partitioning decisions are determined jointly by standard deviation metrics derived from VD pixels and detected-object region coverage. By incorporating flexible threshold settings, the proposed method can meet different users’ requirements. In this paper, we performed experiments under three threshold settings. The experimental results demonstrate that the proposed method reduces H.266/VVC intra coding time by 27.94% to 43.11%, with BDBR increases of only 1.02% to 1.53%, thus achieving an appropriate trade-off between encoding speed and coding efficiency.

1. Introduction

With the rapid advancement of multimedia and communication technologies, high-definition (HD) video has become ubiquitous across diverse applications, including mobile streaming, video conferencing, HD television, and immersive experiences such as virtual reality and augmented reality. These scenarios increase demand for video quality, resolution, and efficient compression, especially under constrained bandwidth and storage conditions. To meet these growing requirements, the Joint Video Experts Team (JVET) jointly established by ITU-T VCEG and ISO/IEC MPEG, finalized the H.266/Versatile Video Coding (VVC) standard in July 2020 [1,2,3,4].
H.266/VVC supports ultra-high-definition (UHD) content ranging from 4K to beyond 8K and achieves around 50% bitrate reduction compared to its predecessor, H.265/High Efficiency Video Coding (HEVC) [3,4]. H.266/VVC increases the maximum coding tree unit (CTU) size from 64 × 64 to 128 × 128, improving coding efficiency and adaptability to spatial content variation [5]. One of the major innovations in H.266/VVC is the inclusion of a more flexible block split scheme known as the quadtree with nested multi-type tree (QT-MTT) structure. In contrast to the quadtree partition used in H.265/HEVC, H.266/VVC further allows both binary tree (BT) and ternary tree (TT) splits for the coding unit (CU), applied either horizontally or vertically as shown in Figure 1a. The CTU partitioning process recursively splits coding units (CUs) using QT-MTT partitioning scheme, which includes QT, BT, and TT, until a predefined minimum CU size is reached. The encoder applies rate-distortion optimization (RDO) [5,6,7] to evaluate all candidate partitioning modes and select the one that best balances bitrate and distortion. Figure 1b illustrates a partitioning example generated from an actual video frame. As shown in the figure, regions with higher texture complexity or significant content variation tend to undergo finer partitions, while more homogeneous areas require only coarse partitioning. While these enhancements significantly improve coding efficiency, they also bring about a large number of split candidates, resulting in a substantial increase in encoding complexity.
Various fast algorithms have been proposed to reduce encoding time, typically by relying on structural heuristics or texture-based features. However, these methods commonly overlook perceptual redundancy, leading to unnecessary evaluations in visually insignificant regions and undermining overall efficiency. Furthermore, many existing approaches are designed under a single operating condition, which further limits their adaptability and makes it difficult for them to meet diverse user requirements. Learning-based approaches often rely on specific datasets or handcrafted features, limiting generalization across unseen content or quantization settings and causing inconsistent trade-offs between complexity and visual quality. In contrast, human visual perception is highly selective, focusing on salient regions. Based on this observation, our goal is to preserve the detailed quality of salient regions while saving computational resources for visually insignificant regions.
In this paper, we propose a saliency-guided complexity reduction framework for H.266/VVC intra coding that integrates pixel-level visual perception detection and object-level region awareness. For each frame, a perception-driven pixel map labeled with visually distinguishable (VD) pixels and a YOLO-based object map are generated. Within the MTT framework, partitioning decisions are determined jointly by standard deviation metrics of normalized VD pixel counts across CU sub-blocks and detected-object region coverage. Specifically, our method initially identifies VD pixels that represent salient content and constructs a detected-object region map via YOLO object detection to locate salient regions. Combining the perception-driven pixel map with object detection enables the framework to capture both pixel-level saliency and high-level semantic importance. Based on the perceptual importance of each CU, fast decision strategies are proposed to reduce unnecessary computation. In particular, it selectively skips BT and TT splits based on empirically determined thresholds and detected-object region distribution. This framework enables perceptually adaptive complexity reduction, preserving overall coding efficiency and visual quality in perceptually important areas.
The remainder of this paper is organized as follows. Section 2 describes the related work. Section 3 details the proposed method. Afterward, the experimental results are demonstrated in Section 4. Finally, Section 5 concludes the paper.

2. Related Work

To address the high computational complexity introduced by the recursive partitioning structure in H.266/VVC, a variety of fast CU partitioning strategies have been proposed [8,9]. Based on a review of representative studies, these methods can be broadly classified into four categories: probability-based methods [10,11], texture-based methods [12,13,14,15], learning-based methods [16,17,18,19,20,21,22], and visual perception-guided methods [23,24,25,26]. In the following, we review the core concepts and representative approaches of each category.
Probability-based methods aim to accelerate CU partitioning decisions by constructing statistical analysis to estimate the likelihood of specific split patterns. Duan et al. formulated CU partitioning as a binary classification problem and applied a Naive Bayes theory based on RD cost statistics to predict whether further splitting is necessary [10]. Park and Kang proposed a Bayesian decision method to skip redundant TT split directions. By comparing RD costs of BTH and BTV, the method identifies the less likely TT direction to reduce intra coding complexity [11].
Texture-based methods utilize content characteristics such as texture complexity and directional features to define heuristic rules that accelerate the partitioning process. Wu et al. proposed a texture-based CU partitioning algorithm that first determines whether to terminate division early based on brightness flatness. If division is necessary, the MTT division is then decided based on variance statistics [12]. Song et al. proposed a fast CU partitioning method that first computes horizontal-to-vertical gradient ratios to remove improbable splitting directions and then compares sub-block variances to decide between binary and ternary partition patterns [13]. Ni et al. proposed a texture analysis-based BT and TT partition approach, and a gradient-based intra mode decision technique to decrease redundant processing [14]. Liu et al. proposed a fast CU partitioning method that uses cross-block differences in gradient and content to skip unnecessary horizontal and vertical partitions [15].
Learning-based methods employ machine learning or deep learning models to automatically learn partitioning decision rules from large-scale data. He et al. adopted two random forest models to classify CUs based on their texture complexity, training a predictor for simple and complex regions, and introducing a separate termination classifier for ambiguous cases [16]. Wu et al. employed a two-stage SVM-based strategy to skip unnecessary partition modes during CU encoding, first using a split or non-split classifier to decide whether to split further, then a horizontal or vertical split model to determine the split direction, thus enabling the encoder to bypass redundant evaluations [17]. Wang et al. proposed a hierarchical support vector machine (SVM)-based partitioning algorithm that uses a fuzzy SVM for early termination and a directed acyclic graph SVM to determine the partition type of the CU [18]. Belghith et al. proposed a Convolutional Neural Network (CNN)-based fast partitioning method for 32 × 32 CUs, where a CNN-TT model learns TT splitting tendencies across three levels. Using predicted probabilities, different thresholds are applied at each level to decide whether to perform horizontal or vertical TT splits [19]. Im and Chan proposed a pre-processing algorithm that incorporates the relationship between QP and discrete cosine transform (DCT) coefficients, and they employed a concatenate-designed CNN to explore DCT features for CU split prediction [20]. Tissier et al. employed CNN to extract spatial features by generating a vector of probabilities that describes the partition at each 4 × 4 edge. This vector is then utilized by a light gradient boosting machine (LGBM) to predict the most likely splits at each block [21]. Saldanha et al. developed a configurable method based on LGBM, in which five separate classifiers were trained for each split type. Operating points are determined by varying the threshold settings to flexibly balance time savings and coding efficiency [22].
Visual perception-based methods aim to simulate the human eye’s sensitivity to visual information by incorporating visual saliency and perceptual importance into CU partition decisions, thereby enhancing subjective quality and coding efficiency. Chen et al. proposed an efficient decision-making method for MTT partitioning based on random forest models, aimed at rapidly determining horizontal and vertical partitioning modes. The method utilizes the horizontal and vertical distributions of perceptual information as input features for the machine learning model [23]. Tsai et al. designed a fast algorithm by leveraging the variances of perceptual information across different MTT partitioning modes of a CU, combined with machine learning classifiers [24]. Cui and Liang adopted an improved pixel-domain just noticeable distortion (JND)-based perceptual model derived from previous research. They utilized perceptual distortion variance and sub-CU differences to assist early termination and split mode selection, thereby reducing unnecessary partitioning [25]. Li et al. employed a saliency map to determine the visual significance of each region and jointly controlled partition and quantization to preserve quality in perceptually sensitive areas [26].

3. Proposed Method

3.1. Visual Perception

Perceptual visibility in the human visual system has been studied in psychophysics. Early work based on Weber’s law formulated the JND as the minimal luminance increment perceptible relative to the background intensity, establishing the proportional relationship commonly referred to as the Weber fraction [27]. Subsequent studies further examined how this proportionality varies under different luminance conditions and image structures, which inspired luminance-adaptive thresholding strategies widely adopted in later JND models [28]. Complementary research, particularly Barten’s contrast sensitivity function (CSF), analyzed human sensitivity to spatial frequencies, demonstrating that contrast sensitivity depends on spatial frequency rather than remaining constant across frequency bands [29]. In parallel, a physiological response model, Naka–Rushton equation, described the nonlinear saturation behavior of photoreceptors, reflecting how visual responses gradually approach saturation as stimulus intensity increases [30]. Together, these findings establish the theoretical foundation for perceptual thresholds.
JND-based approaches characterize pixel-level visibility thresholds by explicitly modeling luminance-dependent perceptual sensitivity. Several engineering-oriented JND models have further incorporated additional masking effects. For example, an advanced pixel-domain JND formulation was proposed to account for structural uncertainty and pattern masking, enabling more accurate suppression of texture-related redundancies [31]. However, such approaches require the computation of local structural information and luminance contrast, leading to substantially higher computational complexity compared with luminance-based models.
Beyond objective pixel-level distortions, perceptual sensitivity to luminance variations significantly affects how the human visual system perceives image quality. Specifically, changes in gray levels become noticeable only when they exceed a certain visibility threshold. Instead of relying solely on traditional metrics such as gradients or textures, detecting luminance changes in regions that are truly perceptually sensitive facilitates the identification of visually significant areas. The JND model serves as a quantitative approach to representing perceptual redundancies inherent in the human visual system. In this paper, for computational simplicity, we employ a luminance-based JND model that incorporates average background luminance to identify visually distinguishable (VD) pixels within a CU [28]. Let I i , j denote the gray-level intensity of the current pixel. The background luminance B i , j for the pixel located at i , j is estimated using a 5 × 5 mean filter. Based on the model proposed in [28], the visibility threshold J N D i , j is computed as a function of B i , j , with the parameters T 0 = 17 and γ = 3/128, as defined in (1). A pixel is classified as visually distinguishable if the absolute difference between its luminance and the local background I i , j B ( i , j ) is greater than or equal to the corresponding JND threshold [23,24]. This binary decision is represented by V D ( i , j ) , as shown in (2).
J N D i , j = T 0 × ( 1 ( B ( i , j ) 127 ) 1 2 ) + 3 ,    f o r    B ( i , j ) 127   γ × B i , j 127 + 3 ,            f o r    B i , j > 127
V D i , j = 1 ,    i f    I i , j B i , j J N D i , j 0 ,    o t h e r w i s e           
We use standard deviation metrics derived from the distribution of VD pixels within a CU to support early skip decisions. The standard deviations of normalized VD pixel numbers among sub-CUs for binary and ternary tree splits are calculated in (3) and (4), respectively. In the equations, V D s u b C U k represents the normalized number of VD pixels in the k -th sub-CU, obtained by dividing the original VD pixel count by the area of the corresponding sub-CU to account for differences in block size. The terms μ B T H , μ B T V ,   μ T T H , and μ T T V denote the average normalized VD number across all sub-CUs for BTH, BTV, TTH, and TTV partitioning, respectively. These values indicate the spatial inconsistency of VD pixel distribution across different partitioning structures and are used to decide whether further BT or TT splitting is necessary.
  σ B T H / B T V = 1 2 k = 1 2 ( V D s u b C U k μ B T H / B T V ) 2
  σ T T H / T T V = 1 3 k = 1 3 ( V D s u b C U k μ T T H / T T V ) 2

3.2. Object Detection

YOLO (You Only Look Once) is a single-stage object detection framework that predicts bounding boxes and class probabilities jointly. Its object-awareness makes it suitable for identifying salient regions in videos. Recent research has leveraged YOLO-based object detection together with H.266/VVC encoders to provide object-aware saliency cues, enabling the encoder to allocate more bits to regions containing meaningful objects [32]. Similarly, YOLO-based object detection has been used to guide QP selection and bit allocation in H.266/VVC encoder, showing that object-aware information can be incorporated into bitrate control strategies to improve coding efficiency [33]. In this study, YOLOv7 is employed to extract visually salient regions that correspond to areas likely to draw human visual attention due to innate physiological characteristics of the human visual system [34]. These perceptually important regions are incorporated into the partitioning decision process to guide encoding in areas where visual quality is more likely to be noticed.

3.3. Overall Algorithm

The proposed algorithm accelerates H.266/VVC intra coding by integrating perceptual saliency analysis with object-level awareness. BT/TT splits are selectively skipped based on the standard deviations derived from VD pixels and the detected-object region coverage within each CU. Figure 2 summarizes the global workflow of the proposed method, illustrating the procedure from JND computation and object detection to the final split decision.
Our approach begins by generating a perception-driven pixel map labeled with VD pixels and a YOLO-based object map for each frame. Partitioning decisions within the MTT framework are jointly guided by standard deviation metrics of normalized VD pixel counts across CU sub-blocks and detected-object region coverage, as shown in Figure 3. Specifically, when CU overlaps with the detected region by P or more, indicating high perceptual importance, all BT splits are required without skipping, regardless of the standard deviation. For TT splits, the standard deviation thresholds are adaptively selected based on detected-object region coverage: stricter thresholds are applied in salient regions to preserve finer partitions, while more relaxed thresholds are used elsewhere to enable early skipping. This adaptive strategy balances visual quality and computational complexity by preserving detail in perceptually significant areas and reducing processing in less important regions.

4. Experimental Results

In this study, we utilized the H.266/VVC Test Model (VTM) version 22.0 [35] to evaluate performance under the All Intra coding configuration. The test sequences include video sequences from Class A1, Class A2, Class B, Class C, Class D, and Class E, encompassing a broad range of content. For each sequence, we performed the experiments using the first 100 frames to maintain consistency. Our experiments follow the Common Test Conditions (CTC) [36], and detailed information about the test sequences is provided in Table 1. All experiments were conducted on a system running Windows 11 64-bit, equipped with an Intel Core i5-14400K CPU at 2.50 GHz, 32 GB DDR4 RAM at 4800 MHz. In addition, we used the YOLOv7 model [34] for object detection, which was trained on the Microsoft Common Objects in Context (MS COCO) dataset [37].
The proposed decision-threshold mechanism offers flexible configurations by enabling multiple operating points to accommodate diverse application requirements. In this work, the threshold parameters include the object-region coverage threshold P and the standard deviation thresholds ThBT, TTHU, TTVU, TTHL, and TTVL. These thresholds were preconfigured according to the selected operating points before the encoding process began. By adjusting different threshold combinations, the encoder can achieve an appropriate trade-off between reducing encoding time and preserving coding efficiency. In practical scenarios, thresholds can be flexibly tuned based on hardware performance constraints, quality requirements, and real-time demands to determine the most suitable encoding strategy. In this paper, we conducted experiments under three parameter settings as shown in Table 2.
To assess and compare performance with previous works [12,17,25], we adopted a set of commonly used metrics, including Bjøntegaard Delta Bitrate (BDBR) [38], Time Saving (TS) as shown in (5), and the cost performance metric TS/BDBR [39].
T S = 1 4 Q P ( 22 ,   27 ,   32 ,   37 ) T V T M 22.0 Q P T p r o p o s e d ( Q P ) T V T M 22.0 ( Q P ) × 100 %
The BDBR and TS performance of the proposed algorithm at different operating points is summarized in Table 3. The results demonstrate that the method reduces encoding time while maintaining coding efficiency, with advantages observed for high-resolution sequences. S1 employs the thresholds to limit the bitrate increase, resulting in reduced time savings. S2 adopts a threshold configuration between S1 and S3, whereas S3 utilizes the settings with a lower coverage threshold and looser skipping thresholds, reducing processing time at the cost of increased bitrate. These operating points provide a reference for understanding how different parameter selections affect performance. Table 3 also reports the standard deviations of BDBR and TS across all test sequences. The variations in BDBR remain small, with standard deviations of 0.67%, 0.66%, and 0.64% at operating points S1 to S3 across different resolutions or texture characteristics. In contrast, TS exhibits larger variability across content classes, mainly because the amount of time savings is dependent on scene characteristics. Sequences containing numerous CUs with homogeneous textures, such as sequences from Class A1 and Class A2, allow a greater number of BT and TT split operations to be skipped and thus lead to substantial time savings. However, the higher skip ratio may also result in larger BDBR. In contrast, sequences with dense textures or rapidly changing spatial characteristics in CUs, such as sequences from Class C and Class D, inherently require more partitions and therefore yield smaller time savings. More importantly, the algorithm provides flexibility by allowing different threshold combinations at multiple operating points, thereby meeting diverse performance requirements for specific application scenarios.
A comparison between the proposed algorithm at operating point S2 and the methods in [12,17,25] is presented in Table 4. On average, the proposed method achieves a BDBR of 1.24% and TS of 35.26%, both outperforming [12,25]. Although the TS in [17] is higher than that of the proposed algorithm, our method achieves noticeably better coding efficiency. The average BDBR of 1.24% represents an improvement of 1.47% over 2.71% reported in [17]. For most sequences, our algorithm yields lower BDBR values than [17]. In addition, the proposed method attains a higher average TS/BDBR ratio of 33.76, compared to 26.22 in [17], indicating that a better tradeoff between TS and BDBR can be achieved using the proposed method. Similar results can be observed by testing several 4K sequences from the Ultra Video Group (UVG) dataset [40] in Table 5.
Figure 4 provides a comprehensive comparison between the proposed algorithm and the related works [12,25], illustrating the relationship between TS and BDBR for the related methods as well as the three operating points (S1 to S3) previously defined for our approach. The curve in the figure shows extrapolated results under alternative operating points with varying threshold values. A trend can be observed across the three operating points. As the thresholds are adjusted from S1 to S3, the time saving increases, while the corresponding BDBR rises. This indicates that the proposed threshold mechanism enables the trade-off between complexity and bitrate. Furthermore, the relative positions of the points show that the proposed method achieves superior overall performance compared with the related works. Among the three proposed operating points, S2, whose BDBR is the closest to that of [25], achieves higher time saving. In addition, the approach in [12] shows both higher bitrate overhead and less time saving. The results indicate that the proposed algorithm outperforms the approaches in [12,25] by offering a better trade-off between TS and BDBR and also demonstrating adaptability to different application requirements.
Figure 5 presents the comparison of the proposed method at operating point S2 with [25] in terms of average BDBR and average TS across different sequence classes. Overall, the results show that the proposed method at operating point S2 achieves lower average BDBR than [25] for Class A2, Class B, Class C, and Class E. In addition, it also achieves higher TS values for all classes except Class E.
Figure 6 illustrates the rate-distortion (RD) curve comparison between the proposed algorithm at operating point S2 and VTM 22.0 for the CatRobot1 sequence in Class A2. As shown, the RD curve of operating point S2 closely follows that of VTM 22.0, indicating highly comparable RD performance. Furthermore, our method achieves a TS of 46.14%.
Figure 7 illustrates the outcomes of the proposed algorithm at operating point S2 using the frame from the BasketballPass sequence. Figure 7b presents the corresponding perception-driven pixel map. Figure 7c displays the object detection results obtained using YOLOv7, and Figure 7d shows the YOLO-based object map. Figure 7e,f compare the final partitioning results generated by VTM 22.0 and the proposed method at operating point S2, respectively. It can be observed that our algorithm allocates fine partitions around perceptually important regions (e.g., players) while simplifying the structure in less relevant areas. The combination of VD pixel and object detection cues can identify perceptual regions and facilitate partitioning process.
Figure 8 presents a subjective quality comparison between the original VTM 22.0 coding results and the proposed algorithm at operating point S2 for the frame of CatRobot1 sequence. As shown, the difference is hardly perceptible to the human eye. Moreover, the Peak Signal-to-Noise Ratio (PSNR) and Multi-Scale Structural Similarity (MS-SSIM) [41] values are very close, clearly demonstrating that the proposed algorithm integrates VD pixel analysis with object-level detection to preserve high subjective visual quality while reducing encoding time.
Table 6 reports the computation overhead introduced by generating the perception-driven pixel map and the YOLO-based object map for each test sequence. The results show that the luminance-based JND computation incurs 0.0233–0.4006% overhead across all sequences, with an overall average of 0.0829% relative to the total H.266/VVC intra encoding time. Meanwhile, YOLO detection introduces an overhead, ranging from 0.1812% to 5.7824%, with an overall average of 1.6054%. This overhead is higher, as object detection is performed once per frame and involves neural network inference. Nevertheless, the cost remains substantially lower relative to the complexity of evaluating MTT partition candidates. Considering that the proposed method achieves 27.94–43.11% time saving at three operating points, the time saving gained from reducing unnecessary BT/TT evaluations is significantly higher than the overhead of the generation for the perception-driven pixel map and the YOLO-based object map. In addition, the decision process relies solely on computationally efficient and regular statistical operations, including VD pixel-based metrics and object-coverage evaluation, indicating the potential of the approach for hardware integration in real-time encoder architectures.
The proposed method has limitations. Because the luminance-based JND model only reflects brightness variations, it cannot distinguish pixels or sub-blocks that share similar luminance but differ in chrominance. As a result, VD pixel counts may underrepresent regional complexity, leading to overly aggressive BT/TT skipping. In addition, the YOLO-based object detection is subject to constraint in certain scenes, particularly in cluttered scenes. Large rectangular bounding boxes can include background that is perceptually insignificant, and objects outside the categories defined in the COCO dataset may not be detected and are treated as background. Future work may address these issues to further improve robustness.

5. Conclusions

In this paper, we address the high complexity of H.266/VVC intra coding by proposing a perception-driven and object-aware block partitioning strategy that integrates pixel-level visual saliency analysis with object detection. In perceptually less significant regions, the partitioning process is simplified to save computation time, whereas in regions containing important objects or fine details, the necessary partition process is preserved to maintain visual quality and coding efficiency. The proposed method incorporates adjustable thresholds, allowing flexible configurations across different operating points. Experimental results demonstrate that, at three operating points, the method achieves time savings ranging from 27.94% to 43.11% with BDBR increases limited to only 1.02% to 1.53%, thereby achieving a practical trade-off between encoding speed and compression efficiency. Notably, the proposed approach achieves greater acceleration on high-resolution sequences, thereby supporting the requirements of 4K/8K real-world video applications. These findings indicate that the integration of visual perception analysis with object detection can contribute to the development of intelligent video coding technologies. In future work, we plan to explore adaptive threshold strategies that dynamically adjust according to local content characteristics to improve robustness and overall coding performance.

Author Contributions

Conceptualization, C.-Y.L. and M.-J.C.; Data curation, C.-Y.L.; Methodology, C.-Y.L. and M.-J.C.; Software, C.-Y.L. and Y.-F.L.; Validation, J.-Y.Y., Y.-C.C. and C.-M.L.; Supervision, M.-J.C. and C.-H.Y.; Writing—original draft, J.-Y.Y., Y.-C.C. and M.-J.C.; Writing—review and editing, C.-Y.L., Y.-F.L., C.-M.L. and C.-H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, under Grant NSTC 112-2221-E-259-009-MY3.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Uhrina, M.; Sevcik, L.; Bienik, J.; Smatanova, L. Performance comparison of VVC, AV1, HEVC, and AVC for high resolutions. Electronics 2024, 13, 953. [Google Scholar] [CrossRef]
  2. Mercat, A.; Mäkinen, A.; Sainio, J.; Lemmetti, A.; Viitanen, M.; Vanne, J. Comparative rate-distortion-complexity analysis of VVC and HEVC video codecs. IEEE Access 2021, 9, 67813–67828. [Google Scholar] [CrossRef]
  3. Bross, B.; Chen, J.; Ohm, J.-R.; Sullivan, G.J.; Wang, Y.-K. Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 2021, 109, 1463–1493. [Google Scholar] [CrossRef]
  4. Bross, B.; Wang, Y.-K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.-R. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
  5. Huang, Y.-W.; An, J.; Huang, H.; Li, X.; Hsiang, S.-T.; Zhang, K.; Gao, H.; Ma, J.; Chubach, O. Block partitioning structure in the VVC standard. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3818–3833. [Google Scholar] [CrossRef]
  6. Ortega, A.; Ramchandran, K. Rate-distortion methods for image and video compression. IEEE Signal Process. Mag. 1998, 15, 23–50. [Google Scholar] [CrossRef]
  7. Cerveira, A.; Agostini, L.; Zatt, B.; Sampaio, F. Memory profiling of H.266 versatile video coding standard. In Proceedings of the 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 23–25 November 2020. [Google Scholar]
  8. Choi, K. A study on fast and low-complexity algorithms for versatile video coding. Sensors 2022, 22, 8990. [Google Scholar] [CrossRef]
  9. Solovyev, T.; Sauer, J.; Pardo, J.E.F.; Alshina, E. AhG10: On Constrained Encoder Configuration of VTM; Doc. JVET-AL0055-v3; JVET: Geneva, Switzerland, 2025. [Google Scholar]
  10. Duan, L.; Jiang, X.; Li, W.; Jin, J.; Song, T.; Yu, F.R. VVC coding unit partitioning decision based on Naive Bayes theory. In Proceedings of the 2023 5th International Conference on Image Processing and Machine Vision, Macau, China, 13–15 January 2023; pp. 62–65. [Google Scholar]
  11. Park, S.-H.; Kang, J.-W. Context-based ternary tree decision method in versatile video coding for fast intra coding. IEEE Access 2019, 7, 172597–172605. [Google Scholar] [CrossRef]
  12. Wu, Z.; Jiang, X.; Song, T.; Liu, J.; Cen, Q. CU partitioning algorithm based on texture complexity in VVC. In Proceedings of the 2024 6th International Conference on Video, Signal and Image Processing, Ningbo, China, 22–24 November 2024; pp. 100–104. [Google Scholar]
  13. Song, Y.; Cheng, S.; Wang, M.; Peng, X. Fast CU partition for VVC intra-frame coding via texture complexity. IEEE Signal Process. Lett. 2024, 31, 959–963. [Google Scholar] [CrossRef]
  14. Ni, C.-T.; Lin, S.-H.; Chen, P.-Y.; Chu, Y.-T. High efficiency intra CU partition and mode decision method for VVC. IEEE Access 2022, 10, 77759–77771. [Google Scholar] [CrossRef]
  15. Liu, H.; Zhu, S.; Xiong, R.; Liu, G.; Zeng, B. Cross-block difference guided fast CU partition for VVC intra coding. In Proceedings of the 2021 International Conference on Visual Communications and Image Processing, Munich, Germany, 5–8 December 2021. [Google Scholar]
  16. He, Q.; Wu, W.; Luo, L.; Zhu, C.; Guo, H. Random forest based fast CU partition for VVC intra coding. In Proceedings of the 2021 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, Chengdu, China, 4–6 August 2021. [Google Scholar]
  17. Wu, G.; Huang, Y.; Zhu, C.; Song, L.; Zhang, W. SVM-based fast CU partitioning algorithm for VVC intra coding. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea, 22–28 May 2021. [Google Scholar]
  18. Wang, F.; Wang, Z.; Zhang, Q. FSVM- and DAG-SVM-based fast CU-partitioning algorithm for VVC intra-coding. Symmetry 2023, 15, 1078. [Google Scholar] [CrossRef]
  19. Belghith, F.; Abdallah, B.; Ben Jdidia, S.; Ben Ayed, M.A.; Masmoudi, N. CNN-based ternary tree partition approach for VVC intra-QTMT coding. Signal Image Video Process. 2024, 18, 3587–3594. [Google Scholar] [CrossRef]
  20. Im, S.-K.; Chan, K.-H. Faster intra-prediction of versatile video coding using a concatenate-designed CNN via DCT coefficients. Electronics 2024, 13, 2214. [Google Scholar] [CrossRef]
  21. Tissier, A.; Hamidouche, W.; Mdalsi, S.B.D.; Vanne, J.; Galpin, F.; Menard, D. Machine learning based efficient QT-MTT partitioning scheme for VVC intra encoders. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4279–4293. [Google Scholar] [CrossRef]
  22. Saldanha, M.; Sanchez, G.; Marcon, C.; Agostini, L. Configurable fast block partitioning for VVC intra coding using light gradient boosting machine. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 3947–3960. [Google Scholar] [CrossRef]
  23. Chen, M.-J.; Lee, C.-A.; Tsai, Y.-H.; Yang, C.-M.; Yeh, C.-H.; Kau, L.-J.; Chang, C.-Y. Efficient partition decision based on visual perception and machine learning for H.266/versatile video coding. IEEE Access 2022, 10, 42141–42150. [Google Scholar] [CrossRef]
  24. Tsai, Y.-H.; Lu, C.-R.; Chen, M.-J.; Hsieh, M.-C.; Yang, C.-M.; Yeh, C.-H. Visual perception based intra coding algorithm for H.266/VVC. Electronics 2023, 12, 2079. [Google Scholar] [CrossRef]
  25. Cui, X.-Y.; Liang, F. Perceptual based fast CU partition algorithm for VVC intra coding. In Proceedings of the 2023 IEEE Region 10 Conference (TENCON), Chiang Mai, Thailand, 31 October–3 November 2023. [Google Scholar]
  26. Li, W.; Jiang, X.; Jin, J.; Song, T.; Yu, F.R. Saliency-enabled coding unit partitioning and quantization control for versatile video coding. Information 2022, 13, 394. [Google Scholar] [CrossRef]
  27. Norwich, K.H. On the theory of Weber fractions. Percept. Psychophys. 1987, 42, 286–298. [Google Scholar] [CrossRef] [PubMed]
  28. Chou, C.-H.; Li, Y.-C. A perceptually tuned subband image coder based on the measure of just-noticeable-distortion profile. IEEE Trans. Circuits Syst. Video Technol. 1995, 5, 467–476. [Google Scholar] [CrossRef]
  29. Barten, P.G.J. Evaluation of subjective image quality with the square-root integral method. J. Opt. Soc. Am. A 1990, 7, 2024–2031. [Google Scholar] [CrossRef]
  30. Hisamitsu, S.; Okuno, H. An image coding algorithm with color constancy using the Retinex theory and the Naka–Rushton equation. In Proceedings of the 2022 International Conference on Artificial Life and Robotics, Beppu, Japan, 20–23 January 2022; pp. 507–512. [Google Scholar]
  31. Wu, J.; Lin, W.; Shi, G.; Wang, X.; Li, F. Pattern masking estimation in image with structural uncertainty. IEEE Trans. Image Process. 2013, 22, 4892–4904. [Google Scholar] [CrossRef] [PubMed]
  32. Fischer, K.; Fleckenstein, F.; Herglotz, C.; Kaup, A. Saliency-driven versatile video coding for neural object detection. In Proceedings of the ICASSP 2021-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1505–1509. [Google Scholar]
  33. Goto, K.; Katayama, T.; Song, T.; Shimamoto, T. YOLO-based bitrate control algorithm for VVC. In Proceedings of the 2023 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC), Jeju, Republic of Korea, 25–28 June 2023; pp. 227–231. [Google Scholar]
  34. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  35. JVET. VVC Test Model (VTM) Software. Version 22.0. Available online: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-22.0?ref_type=tags (accessed on 4 September 2023).
  36. Bossen, F.; Li, X.; Sharman, K.; Seregin, V.; Suehring, K. VTM and HM Common Test Conditions and Software Reference Configurations for SDR 4:2:0 10-bit Video; Doc. JVET-AK2010-v1; JVET: Geneva, Switzerland, January 2025. [Google Scholar]
  37. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  38. Doc. VCEG-M33; Bjontegaard, G. Calculation of Average PSNR Differences Between RD-Curves. ITU-Telecommunications Standardization Sector: Geneva, Switzerland, April 2001.
  39. Kau, L.-J.; Leng, J.-W. A gradient intensity-adapted algorithm with adaptive selection strategy for the fast decision of H.264/AVC intra-prediction modes. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 944–957. [Google Scholar] [CrossRef]
  40. Mercat, A.; Viitanen, M.; Vanne, J. UVG dataset: 50/120fps 4K sequences for video codec analysis and development. In Proceedings of the ACM Multimedia Systems Conference (MMSys), Istanbul, Turkey, 8–11 June 2020; pp. 297–302. [Google Scholar]
  41. Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; pp. 1398–1402. [Google Scholar]
Figure 1. Illustration of H.266/VVC QT-MTT partitioning structure.
Figure 1. Illustration of H.266/VVC QT-MTT partitioning structure.
Electronics 15 00133 g001
Figure 2. Overall workflow of the proposed method.
Figure 2. Overall workflow of the proposed method.
Electronics 15 00133 g002
Figure 3. Flowchart of the proposed BT and TT skipping algorithm.
Figure 3. Flowchart of the proposed BT and TT skipping algorithm.
Electronics 15 00133 g003
Figure 4. Comparison of BDBR and TS performance between the proposed algorithm and the methods in [12,25].
Figure 4. Comparison of BDBR and TS performance between the proposed algorithm and the methods in [12,25].
Electronics 15 00133 g004
Figure 5. Comparison of BDBR and TS performance across sequence classes between the proposed algorithm at operating point S2 and the method in [25].
Figure 5. Comparison of BDBR and TS performance across sequence classes between the proposed algorithm at operating point S2 and the method in [25].
Electronics 15 00133 g005
Figure 6. Comparison of RD curves of VTM 22.0 and the proposed method for Class A2 CatRobot1 sequence.
Figure 6. Comparison of RD curves of VTM 22.0 and the proposed method for Class A2 CatRobot1 sequence.
Electronics 15 00133 g006
Figure 7. Results of the proposed method and partition comparison with VTM 22.0 for the frame of Class D BasketballPass sequence.
Figure 7. Results of the proposed method and partition comparison with VTM 22.0 for the frame of Class D BasketballPass sequence.
Electronics 15 00133 g007aElectronics 15 00133 g007b
Figure 8. Subjective quality comparison for the frame of CatRobot1 (Class A2) sequence. (a) VTM22.0, QP 22, PSNR 42.8277 dB, MS-SSIM 0.995698. (b) Proposed, QP 22, PSNR 42.7906 dB, MS-SSIM 0.995321.
Figure 8. Subjective quality comparison for the frame of CatRobot1 (Class A2) sequence. (a) VTM22.0, QP 22, PSNR 42.8277 dB, MS-SSIM 0.995698. (b) Proposed, QP 22, PSNR 42.7906 dB, MS-SSIM 0.995321.
Electronics 15 00133 g008
Table 1. Description of the test sequences.
Table 1. Description of the test sequences.
ClassResolutionSequenceFrame Rate (fps)Bit Depth
A1 3840   × 2160Tango26010
3840   × 2160FoodMarket46010
3840   × 2160Campfire3010
A2 3840   × 2160CatRobot16010
3840   × 2160DaylightRoad26010
3840   × 2160ParkRunning35010
B 1920   × 1080MarketPlace6010
1920   × 1080RitualDance6010
1920   × 1080Cactus508
1920   × 1080BasketballDrive508
1920   × 1080BQTerrace608
C 832   × 480RaceHorses308
832   × 480BQMall608
832   × 480PartyScene508
832   × 480BasketballDrill508
D 416   × 240RaceHorses308
416   × 240BQSquare608
416   × 240BlowingBubbles508
416   × 240BasketballPass508
E 1280   × 720FourPeople608
1280   × 720Johnny608
1280   × 720KristenAndSara608
Table 2. Parameter settings for each operating point.
Table 2. Parameter settings for each operating point.
Operation PointPThBTTTHUTTHLTTVUTTVL
S10.70.00020.07 0.05 0.03 0.01
S20.60.00040.12 0.10 0.05 0.03
S30.40.00080.16 0.14 0.12 0.10
Table 3. BDBR and TS performance of the proposed algorithm at different operating points.
Table 3. BDBR and TS performance of the proposed algorithm at different operating points.
ClassSequenceOperation Points
S1S2S3
BDBR (%)TS (%)BDBR (%)TS (%)BDBR (%)TS (%)
A1Tango22.5358.132.5960.332.6560.82
FoodMarket41.9642.802.0643.302.0744.62
Campfire1.0542.031.2149.111.4152.52
A2CatRobot11.9439.942.1446.142.3552.25
DaylightRoad21.4347.031.5651.241.6355.89
ParkRunning30.5138.510.5744.050.6752.43
BMarketPlace1.9157.032.0260.492.1064.13
RitualDance2.1530.872.4736.852.8443.98
Cactus0.6428.510.8836.641.1944.61
BasketballDrive0.5824.240.8632.731.1137.45
BQTerrace0.4421.440.6430.820.8740.05
CRaceHorses0.5224.200.6733.380.8842.54
BQMall0.4916.490.7024.681.1435.20
PartyScene0.3314.290.5326.200.8440.40
BasketballDrill1.0621.411.6130.942.2940.44
DRaceHorses0.3713.620.4922.890.8432.66
BQSquare0.397.130.5318.340.8829.49
BlowingBubbles0.3810.920.6621.660.9334.15
BasketballPass0.539.390.9319.631.4329.04
EFourPeople1.2520.271.5627.672.0538.26
Johnny0.7921.011.0728.061.6237.73
KristenAndSara1.2125.361.5030.671.9039.86
Average1.0227.941.2435.261.5343.11
Standard Deviation0.6714.490.6612.100.649.39
Table 4. Performance comparison of the proposed algorithm with [12,17,25].
Table 4. Performance comparison of the proposed algorithm with [12,17,25].
ClassSequence[12][17][25]Proposed (S2)
BDBR
(%)
TS
(%)
TS/
BDBR
BDBR
(%)
TS
(%)
TS/
BDBR
BDBR
(%)
TS
(%)
TS/
BDBR
BDBR
(%)
TS
(%)
TS/
BDBR
A1Tango2---2.4264.4526.632.1139.2318.592.5960.3323.29
FoodMarket4---1.4746.9331.931.2339.1731.852.0643.3021.02
Campfire---2.6564.7424.431.9137.3319.541.2149.1140.59
Average---2.1858.7127.661.7538.5823.331.9550.9128.30
A2CatRobot1---3.2763.8119.512.2037.9617.252.1446.1421.56
DaylightRoad2---2.0270.3934.851.4438.5726.781.5651.2432.85
ParkRunning3---1.4655.1437.771.2139.6932.800.5744.0577.28
Average---2.2563.1130.711.6238.7425.611.4247.1443.90
BMarketPlace---2.5871.9327.881.6434.6121.102.0260.4929.95
RitualDance---4.2164.0615.222.1437.1217.352.4736.8514.92
Cactus---2.7866.6123.961.3736.3026.500.8836.6441.64
BasketballDrive---2.3867.8128.492.0836.9417.760.8632.7338.06
BQTerrace---2.4364.2526.440.7122.4831.660.6430.8248.16
Average---2.8866.9324.401.5933.4922.871.3739.5134.55
CRaceHorses3.0410.693.522.0062.1031.050.7329.2640.080.6733.3849.82
BQMall2.7310.934.002.9262.9321.551.1929.9225.140.7024.6835.26
PartyScene0.3313.8642.001.4058.7741.980.4319.3945.090.5326.2049.43
BasketballDrill0.9316.4917.735.3965.2912.111.4625.9517.771.6130.9419.22
Average1.7612.9916.812.9362.2726.670.9526.1332.020.8828.8038.43
DRaceHorses3.4612.253.541.6958.9834.900.5619.2534.380.4922.8946.71
BQSquare2.0029.7814.891.6859.9835.700.2113.2062.860.5318.3434.60
BlowingBubbles1.7825.6014.382.2459.9426.760.5019.7839.560.6621.6632.82
BasketballPass1.506.424.282.3461.1526.131.0623.8422.490.9319.6321.11
Average2.1918.519.271.9960.0130.870.5819.0239.820.6520.6333.81
EFourPeople---4.3667.1415.401.4436.2225.151.5627.6717.74
Johnny---4.3467.0115.441.8236.4720.041.0728.0626.22
KristenAndSara---3.5666.2118.601.5332.1621.021.5030.6720.45
Average---4.0966.7916.481.6034.9522.071.3828.8021.47
Sequence Average1.9715.7513.042.7163.1626.221.3231.1327.941.2435.2633.76
Table 5. BDBR and TS performance for 4K sequences from the UVG dataset at the S2 operating point.
Table 5. BDBR and TS performance for 4K sequences from the UVG dataset at the S2 operating point.
SequenceBDBR (%)TS (%)TS/BDBR
Bosphorus0.8536.7843.27
FlowerKids1.6644.6626.90
FlowerPan1.3147.6536.37
Jockey0.8444.3152.75
RiverBank0.8632.1937.43
Twilight2.5051.6520.66
Average1.3442.8736.23
Table 6. Computational overhead (%) introduced by generating the perception-driven pixel map and the YOLO-based object map.
Table 6. Computational overhead (%) introduced by generating the perception-driven pixel map and the YOLO-based object map.
ClassSequencePerception-Driven Pixel MapYOLO-Based Object Map
A1Tango20.30240.8820
FoodMarket40.40061.3430
Campfire0.07320.2463
A2CatRobot10.11320.2959
DaylightRoad20.10540.2757
ParkRunning30.05720.1812
BMarketPlace0.09801.0304
RitualDance0.09541.2673
Cactus0.03710.4574
BasketballDrive0.06190.6577
BQTerrace0.03040.3639
CRaceHorses0.03381.4641
BQMall0.03271.4531
PartyScene0.02400.9632
BasketballDrill0.04551.8403
DRaceHorses0.02914.2905
BQSquare0.02334.1289
BlowingBubbles0.02513.7639
BasketballPass0.03555.7824
EFourPeople0.04481.1500
Johnny0.08151.7914
KristenAndSara0.07311.6893
Average0.08291.6054
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, C.-Y.; Yeh, J.-Y.; Chen, Y.-C.; Li, Y.-F.; Lien, C.-M.; Chen, M.-J.; Yeh, C.-H. Perception-Driven and Object-Aware Fast MTT Partitioning for H.266/VVC: A Saliency-Guided Complexity Reduction Framework. Electronics 2026, 15, 133. https://doi.org/10.3390/electronics15010133

AMA Style

Lin C-Y, Yeh J-Y, Chen Y-C, Li Y-F, Lien C-M, Chen M-J, Yeh C-H. Perception-Driven and Object-Aware Fast MTT Partitioning for H.266/VVC: A Saliency-Guided Complexity Reduction Framework. Electronics. 2026; 15(1):133. https://doi.org/10.3390/electronics15010133

Chicago/Turabian Style

Lin, Chih-Ying, Jia-Yi Yeh, Yu-Cheng Chen, Yi-Fan Li, Chih-Ming Lien, Mei-Juan Chen, and Chia-Hung Yeh. 2026. "Perception-Driven and Object-Aware Fast MTT Partitioning for H.266/VVC: A Saliency-Guided Complexity Reduction Framework" Electronics 15, no. 1: 133. https://doi.org/10.3390/electronics15010133

APA Style

Lin, C.-Y., Yeh, J.-Y., Chen, Y.-C., Li, Y.-F., Lien, C.-M., Chen, M.-J., & Yeh, C.-H. (2026). Perception-Driven and Object-Aware Fast MTT Partitioning for H.266/VVC: A Saliency-Guided Complexity Reduction Framework. Electronics, 15(1), 133. https://doi.org/10.3390/electronics15010133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop