Object-Cooperated Ternary Tree Partitioning Decision Method for Versatile Video Coding

In this paper, we propose an object-cooperated decision method for efficient ternary tree (TT) partitioning that reduces the encoding complexity of versatile video coding (VVC). In most previous studies, the VVC complexity was reduced using decision schemes based on the encoding context, which do not apply object detecion models. We assume that high-level objects are important for deciding whether complex TT partitioning is required because they can provide hints on the characteristics of a video. Herein, we apply an object detection model that discovers and extracts the high-level object features—the number and ratio of objects from frames in a video sequence. Using the extracted features, we propose machine learning (ML)-based classifiers for each TT-split direction to efficiently reduce the encoding complexity of VVC and decide whether the TT-split process can be skipped in the vertical or horizontal direction. The TT-split decision of classifiers is formulated as a binary classification problem. Experimental results show that the proposed method more effectively decreases the encoding complexity of VVC than a state-of-the-art model based on ML.


Introduction
Deep learning (DL) [1] techniques are known to outperform non-DL approaches in diverse fields, such as computer vision [2], natural language processing [3], and speech recognition [4]. Among the DL models, object detection models have demonstrated particular success in computer vision but have rarely been deployed in video coding. Instead, most researchers who have investigated video coding use traditional machine learning (ML) approaches to reduce the complexity of the encoders [5]. Herein, we propose that object detection models can reduce the video coding complexity of the next-generation video coding standard known as versatile video coding (VVC) [6].
Unlike previous video coding standards, such as high efficiency video coding (HEVC) [7], VVC introduces a multi-type tree (MTT) block partitioning structure that supports binary tree (BT) and ternary tree (TT) splits in the horizontal and vertical directions. Although these approaches achieve higher coding efficiency than HEVC, they considerably increase the computational complexity [8] because the VVC encoder applies a brute-force method to optimize the partitioning structure.
To reduce the encoding complexity, we must reduce the number of MTT-partitioning steps. Recent complexity reduction methods have achieved fast MTT-partitioning decision in VVC [5]; however, the methods that use object detection for fast TT partitioning in the horizontal and vertical directions are rarely reported.
To bridge this gap, we previously proposed improving the complexity of the VVC encoder by applying ML models. This research is an extension of work originally presented in ICEIC 2022 [9]. In this paper, we first apply object detection techniques to VVC to decrease its encoding complexity. We propose a new framework by combining DL for object detection in the first stage (feature extraction) and ML for accurate TT-split prediction in the second stage (TT-split decision). In the feature extraction stage, we extract new features related to objects via object detection in each frame of a video sequence while obtaining the context-based features using the traditional context-based approach. In the TT-split decision stage, we conduct experiments using ML models with extracted features for fast TT partitioning to reduce the encoding complexity of VVC.
Herein, we show that our proposed method decreases the encoding time of the VVC test model (VTM4.0) by up to 60% with an average coding loss of 0.56%. Moreover, our proposed method is flexible and adaptable to applications.
The contributions of this paper are summarized as follows: • We discover new object-based features that can cooperate with context-based methods.
We assume that objects are the key characteristics of videos and object features can help reduce the computational cost (or complexity) of VVC. • Our proposed framework newly combines a descent DL-based model with the traditional ML approach-DL for feature extraction and ML for the final decision scheme. Experimental results show that the proposed combined framework outperforms the state-of-the-art model.
The remainder of this paper is organized as follows. Section 2 describes existing methods related to encoding complexity reduction and overviews DL models for object detection. Section 3 introduces the proposed object-cooperated TT partitioning decision method. The dataset and the associated training process are also presented in Section 3. Section 4 describes the experimental setup and test environment and compares the performances of the MLP-based method and proposed methods with that of the anchor. Finally, Section 5 provides conclusions of this study.

Existing Methods for Encoder-Complexity Reduction
Although VVC can provide powerful coding performance, the computational complexity of the VVC encoder is substantially high compared with that of the HEVC encoder. The current various approaches apply statistical analysis and neural networks have been researched to reduce the complexity of the VVC encoder. For a statistical analysis based approach, Park et al. [8] proposed a context-based fast TT decision method using the directional information between BT and TT. H. Yang et al. [10] proposed a fast intra coding algorithm consisting of fast coding unit (CU) partition and fast intra mode decision using the combination of binary classifiers. This method exploited the textural complexity of the current CU and the context information from neighboring CUs.
The aforementioned methods evaluated the statistical correlations between a current CU and the neighboring CUs. Recent studies have studied neural network-based fast decision schemes based on neural networks to avoid redundancy in the process of optimal VVC block structure. Park et al. [11] designed a fast decision scheme using two lightweight neural networks to determine TT block partitioning. Q. Zhang et al. [12] proposed a fast CU decision algorithm based on DenseNet, which predicts the probabilities of whether the edge of 4 × 4 blocks are the boundary blocks or not. T. Li et al. [13] designed a multi-stage Convolutional Neural Network (CNN) model to predict the quad-tree and multi-type tree-based CU partition method for accelerating the encoding process of intra-mode VVC. S. Wu et al. [14] proposed a hierarchy grid fully convolutional network framework, which can substantially predict the particular hierarchical split structure to automatically control the trade-off between coding efficiency and complexity.

Object Detection
Object detection employs computer vision and image processing technologies to detect object instances of a certain class within an image. Object detection can be categorized in: ML-based and DL-based approaches. ML-based approaches are frameworks based on Haar features, scale-invariant feature transform (SIFT), and histograms of oriented gradients (HOG) functions [15] followed by a classification technique such as a support vector machine (SVM). Meanwhile, DL techniques perform end-to-end framework without using specifically defined functions and are usually based on CNN. DL-based methods for object detection can be categorized into two main types: one-stage methods and two-stage methods. The one-stage methods prioritize the inference speed using You Only Look Once (YOLO) [16], a single shot detector [17], or RetinaNet [18]. The two-stage methods prioritize the detection accuracy using a model such as Faster R-CNN [19], Mask R-CNN [20], or Cascade R-CNN [21].
In this study, we use a model that prioritizes the inference speed to reduce the encoding complexity of VVC. Therefore, we choose the YOLO method for object detection and the YOLOv5 [22] model for experiments. YOLOv5 has lower capacity and faster speed than the other YOLO versions. YOLOv5, a family of object detection architecture and models pre-trained on the COCO 2017 dataset [23], has been introduced to four models: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which are simply named small, medium, large, and extra-large, respectively. The four models have the same backbone or head, but different multiples of model depths and layer widths. We use the YOLOv5s model, which has the fastest inference speed among the YOLOv5 pre-trained models.

Proposed Object-Cooperated TT Partitioning Decision Method
Because the VVC encoder cannot know which block should avoid the TT-split process [9], it usually attempts to determine the optimal partitioning structure via bruteforce search, which is a time-consuming process. To reduce the encoding time from that of brute-force searching, we adopt ML models that decide when a TT-split is required. First, using ML models, we experimentally identify which model can accurately predict whether a TT-split process should be skipped (see Section 4.3). Second, we select the lightweight model that provides the highest accuracy of TT-split decisions with our extracted features compared with the existing method (see Section 4.4). In our proposed method, the model that makes accurate TT-partitioning decisions using extracted features is applied to each TT-split direction.
As mentioned above, our proposed method for fast TT partitioning comprises twostages: the feature extraction stage and the TT-split decision stage. Figure 1 presents the framework of the proposed object-cooperated TT partitioning decision method. In the feature extraction stage, our extracted features related to objects are obtained via DL-based object detection and the context-based features [11] are obtained from the traditional contextbased approach for each TT-split direction. The feature extraction and the training and evaluation datasets are explained in Section 3.1. In the TT-split decision stage, two decision tree (DT) classifiers are applied in each TT-split direction in the same manner as done in [11]: horizontal TT (TT_H) and vertical TT (TT_V). The encoding-complexity reduction of VVC by the DT model using the extracted features is experimentally demonstrated in each TT-partitioning direction.

Process of Feature Extraction Including Object Detection
To reduce the encoding complexity of VVC, the extracted features forming the input vector of the ML model must ensure accurate predictions of TT-splits decisions using the model. Accordingly, we assume that objects are among the main features of the video sequences; moreover, we can characterize the number and ratio of objects. As shown in Figure 1, our feature extraction method uses YOLOv5 object detection. We first extracted 11 features such as quad-tree depth, MTT depth, etc. (Features F presented in Figure 1) through context-based approaches. Second, we executed YOLOv5 based on object detection on the frames of video sequences to obtain object-cooperated features (Features O presented in Figure 1). Figure 2 shows the object detection results of two frames in video sequences. In the first frame (Figure 2a), 23 objects were detected: 9 persons, 12 bicycles, and 2 backpacks. We define the object's ratio as the ratio of objects, and the objects' number as the sum of the number of objects by object detection in the frame. The object's number is 23 and the object's ratio is 51.58%. The object's number and objects' ratio are the new features obtained from object detection. We specified the new features as object features. By contrast, no objects were detected in the second frame ( Figure 2b). In this case, the object's number and objects' ratio are zero. Object's ratio is defined as follows: Object's ratio = A bounding box size of objects detected by YOLOv5 A frame resolution × 100, Features F were extracted during the encoding process. The datasets obtained from encoding process comprised 11 input features and a binary class-determining whether the TT-split process is required (binary class = 1) or not (binary class = 0). The features F were the same as those reported in [11]. The 11 features are the quad-tree depth (QTD), BT's superiority in rate-distortion (RD) cost view (BTS), Boolean value indicating whether the optimal BT direction of two BTs in RD cost view (BTD), block shape ratio depending on TT direction (BSR), BT/TT depth (MTD), intra prediction mode (IPM), intra subblock partition (ISP), multiple reference lines (MRL), coded block flag (CBF), multiple transform set (MTS), and quantization parameter value for a frame (QP). The reported 11 features details can be found [11]. Features O present the object' number and object's ratio of a frame in a video sequence. We added two columns of the input vector of the ML models to combine the newly extracted Features O with Features F obtained using the existing method [11].

Datasets for Training and Evaluation
The training dataset in this study was derived from the Tencent Video Dataset (TVD) [24], which differs from that in [11]. The TVD captures a variety of content coverage within 86 video sequences, each comprising 65 frames with 3840 × 2160 spatial resolution. As the training dataset, the 0th frame (the very first frame of the sequence), the 20th frame, the 40th frame, and the 60th frame of TVD sequences were used and then encoded under an all intra (AI) configuration. The frames used for training were never used for testing. Five QP values with a wide range (20,25,30,35, and 40) were used for encoding.
We used JVET test sequences [25] to evaluate the proposed method . For encoding, we adopted QP values with a wide range of 20, 25, 30, 35, and 40. The test dataset comprised the encoded 0th frame in each test video sequence. Features were extracted during the encoding process. To evaluate the performance of the ML models, object-features in the 0th frame of each test sequence were extracted via object detection. Moreover, when comparing the performances of the existing and proposed complexity-reduction methods, we used object features obtained via object detection in the common test condition (CTC) recommended by JVET experts [25].

Data Augmentation of the Training Datasets
To improve the generalization of the ML model, we expanded the training dataset through a data augmentation technique called pixel-level transform, which was implemented via OpenCV-Python [26]. This technique adjusts the brightness of the sequences by manipulating the pixel values in the original video sequences. As shown in Table 1, 10 video sequences were adjusted to be brighter (by 39.06%) than the original video sequences and another 10 video sequences were adjusted to be darker (by −19.53%) than the original sequences. Figure 3 shows examples of an original sequence, a brighter sequence, and a darker sequence.  Table 2 lists the number of training samples collected after data augmentation of TVD. The original dataset comprises 86 sequences. Each brightness dataset consists of 10 sequences that were brightened and 10 sequences that were darkened via data augmentation. A total of 6,665,015 and 6,689,424 samples were thus collected for TT_H and TT_V, respectively.  To demonstrate the effectiveness of the data augmentation technique, we used the Pearson correlation coefficient (PCC) to analyze the correlation between extracted features and binary classes in Figure 4. Figure 4a,b display the heatmaps of PCCs before and after data augmentation, respectively, for TT_H. One of the object-features obtained via object detection, i.e., object's ratio increased to 0.0085 and 0.012. It was confirmed that the feature was more related to the binary class. In the next experiment, we applied the DT model as the basic ML model with different maximum tree depth (max depth = 5, 6, and 7) to TT_H splitting decisions. The performances of the models with and without data augmentation are compared in Table 3. Because approximately 10% of datasets were added through the data augmentation technique and accuracy of the DT model increased as the depth of the DT increased, we applied the data augmentation technique to training datasets. In addition, when object detection is performed, it is reported that the effect is better if the data augmentation technique is applied [27]. Thus, the data augmentation technique is applied to the experiment.

TT Partitioning Decision Stage Based on DT
To decide whether TT partitioning should be avoided, we proposed a framework using two DTs for the TT-split decision stage (Figure 1). DTs were chosen owing to their very fast inference speed and low implementation complexity on DTs with limited maximum depth. DT is a nonparametric supervised learning algorithm for classification and regression. The model generated using a DT predicts the value of a target by learning simple decision rules inferred from the data features. To predict the value of a target, we used the Gini impurity function [28], which determines how well a DT is split. The Gini impurity ranges from 0 (all elements belong to the same class) to 1 (each class has only one element). When Gini impurity is 1, all elements are randomly distributed into various classes; when the Gini impurity is 0.5, the elements are uniformly distributed across some classes. Furthermore, a DT learns from the data features and approximates a sine curve with a set of if-then-else decision rules up to max depth. The deeper the tree is, the more complicated the rules of the DT model are. The DT is usually divided until the class value is perfectly determined or until the data are fewer than the minimum number of samples that can be split; that is, the minimum number of sample data required to form a leaf node.
Because TT-splits are directional, DTs are trained separately. Thus, the Gini values resulting from the DTs, which determine whether a TT should be split (i.e., TT partitioning), differ between the models. Among the ML models for determining TT partitioning, DT was chosen because it shows the best accuracy. The performances of the ML models in each direction are given in Tables 4 and 5 (see Section 4.3).  We established two models for the two TT-split directions (TT_H and TT_V). In each model, we evaluated three DT models with different max depths (5, 6, and 7). The range of max depth was limited for the following reasons: • If the max depth is less than 5, the model is oversimplified and provides poor predictions. • If the max depth is greater than 7, the model becomes too complicated and is prone to overfitting.
During the experiment, the DT with max depth = 7 achieved the best predictions; therefore, the maximum depth was set to 7 in subsequent analyses. Figure 5 shows graphs of the DT models with max depth = 7. Figure 5a,b are the left side and right-side graphs of the DT based on the root node for TT_H. Figure 5c,d are the left side and right-side graphs of the DT based on the root node for TT_V. The graphs can be enlarged by running our GitHub code (https://github.com/sujineel/Object-cooperated-Ternary-Tree-Partitioning-Decision-Method-for-Versatile-Video-Coding accessed on 12 June 2022), which is provided online. The output value y of the DT model determines whether TT should be split in each direction. For example, if the output value of a DT, y, is equal to 1, TT is split; if y = 0, TT is not split. We set a threshold α in the mid-range of y (i.e., when y ranges between 0 and 1, α = 0.5) for mapping the floating value to a Boolean value (true/false answer). The entire TT_H or TT _V splitting process is omitted if y is less than 0.5. In the entire TT-partitioning process, y determines the best CU. By avoiding unnecessary TT-splits, the proposed method reduces the encoding complexity of VVC.
The coding efficiency is considerably reduced when a required TT is incorrectly predicted by the DT; that is, when the DT outputs a false-negative. To solve the coding loss of the predictive DT model, we should adjust the α to suit the encoding application. If the application prioritizes image quality, α should be less than 0.5 even if the complexity is somewhat compromised. Therefore, we propose two threshold values (0.5 and 0.25) that accomplish a reasonable trade-off between coding efficiency and complexity.

Experimental Setup
All the encoding operations were conducted using personal computers with Intel i7-10700 eight-core 2.90-GHz processors and a 64-bit Windows 10 operating system, with the hyper-threading and turbo modes turned off. Experiments were performed without GPUs to reduce the complexity of the ML models. The training and testing of the ML models were assessed using Jupyter Notebook. Visual Studio 2017 was used for conversion and experiments with c++ languages.
The model performances were evaluated using the TensorFlow [29] and scikit-learn libraries [30]. The TensorFlow library is an open-source software library for ML and artificial intelligence. TensorFlow can be used for a range of tasks but focuses particularly on the training and inference of deep neural networks. TensorFlow was developed by the Google Brain team for internal Google use in research and production. Scikit-learn is a free software ML library for the Python programming language. Scikit-learn includes various classification, regression, and clustering algorithms, including SVM, random forest (RF), gradient boosting, and k-means. It is designed to inter-operate with the numerical and scientific Python libraries such as NumPy and SciPy.

Evaluation Metrics of the ML Models
ML models are used in the TT-split decision stage of the proposed model. As the performance metrics of the ML models, we used the metric provided by TensorFlow and Scikit-learn libraries. We measured total time for training and the accuracy of models. The accuracy is defined as follows: where True Positives denote the correct predictions of actually true answers, False Positives denote the wrong predictions of actually false answers predicted as true, True Negatives denote the correct predictions of actually false answers, and False Negatives denote the wrong predictions of actually true answer predicted as false.
To evaluate the accuracy of our method using the object-features, we added the objectfeatures to the DT with max depth = 7 according to the direction of TT-splits. The results of the experiment are shown in detail in Section 4.4.

Evaluation Protocol of the Proposed Method: Comparisons with the Anchor
All encoding experiments were conducted using VTM4.0 in the AI coding configuration. To evaluate the performance of the proposed method, coding efficiency and computational complexity were measured in terms of Bjontegaard delta bit rate (BDBR), which represents the rate saving of methods under the same objective quality and computing encoding time (∆EncT). The BDBR is the bitrate loss over four QPs in percentage with respect to the anchor for the same Peak Signal-to-Noise. In [11], BDBR is defined as follows: where BDBR y , BDBR u , and BDBR v are the weighted average of the BDBRs of the Y, U, and V components, respectively. Using BDBR y , we compared the coding efficiency of the proposed and existing methods with respect to the anchor. The encoding time reductions of the proposed and existing methods with respect to the anchor were assessed in each sequence. ∆EncT is calulated as: To evaluate the model performances, we selected a method based on the traditional context-based approach for early TT partitioning [11]. The BDBR y and ∆EncT of the proposed method were evaluated at the α values of 0.5 and 0.25. For a fair comparison, VTM4.0 was applied to the existing and proposed methods. The experimental results are presented in Section 4.5.

Performance of ML Models for Accurate TT-Split Prediction
Tables 4 and 5 present the performance results (accuracy and training time) of the ML models established for the TT_H and TT_V split directions, respectively.
The ML models used in the TT-split decision stage were DT, RF, and multi-layer perceptron (MLP) [9]. We first established three DT models with different maximum depths (max depth = 5, 6, and 7) and then established three RF models with the different number of DTs (number of DTs = 5, 6, and 7). We finally constructed a fully connected neural network with 13 input nodes, 30 hidden nodes, and 1 output node (the MLP model) and set the number of epochs to 2000 or 3000. The number of hidden layers was set to 30 to ensure the same accuracy for evaluating the proposed method as that for the existing method [11].
The results show that the DT models achieved higher accuracy within less training time than the other models. The DT model with max depth = 7 achieved the highest accuracy within a fast total training time in TT_H decisions. Thus, this model was selected for determining whether a TT-split is required in the TT-partitioning decision stage.

Performance of the Proposed Object-Cooperated TT Partitioning Decision Method
We now compare the performance of the method that inputs context-based features and the proposed method that additionally inputs object-features. Tables 6 and 7 display the accuracy of the methods per sequence in the horizontal and vertical directions of TT-split, respectively, on a 0th frame of 22 sequences at various video resolutions [25].
Using the existing method, we evaluated a DT model with max depth 7 and only context-based features. In the proposed object-cooperated method, the TT-partitioning decision method, the DT model with a max depth = 7 was trained using 13 features comprising 11 context-based features and two additional features (object-features) obtained via object detection-employing YOLOv5.
As shown in Table 6, our proposed object-cooperated method exhibits higher accuracy than the DT-based method [9] in the worst cases (video sequences with an accuracy of less than 80%). In the worst cases, we also prove that our proposed method improves the accuracy of five out of seven sequences, as shown in Table 7, confirming its effectiveness. Table 6. Accuracy of the DT-based and proposed method regarding the TT_H decisions.

Sequence
DT-Based Method [9] Proposed Method Accuracy (%) Accuracy (%)   Table 8 compares the performances of the existing and proposed method with respect to BDBR y and ∆EncT. To demonstrate that our proposed method is flexible for users depending on the need of applications, we adjusted the α value of classification on DT model for TT_H and TT_V. The ∆EncT value was optimized using the proposed method (with α = 0.5). The best result of ∆EncT is the application of the proposed method when α = 0.5 with a 60%, on average, compared with the anchor (VTM4.0). To list methods that show the superior performance based on ∆EncT, they are in the order of the proposed method (α = 0.5) [11], and the proposed method (α = 0.25). We also confirm that out proposed method reasonably reduced the encoding complexity of VVC. Meanwhile, the BDBR Y value when using the the proposed method (α = 0.5) increased by 0.56%, which is 0.01% higher than that obtained using a previously reported model [11]. However, the value obtained using the proposed method (α = 0.25) increased by only 0.11% relative to the anchor, although the ∆EncT value was 75%. Thus, our proposed method achieved a moderate trade-off between encoding complexity and coding efficiency. The results of the video sequence experiments show that the proposed methods (α = 0.5 and α = 0.25) outperformed the method reported in [11] in terms of ∆EncT and BDBR y , respectively. The largest reduction in encoding time was 57%, achieved using our proposed method with α = 0.5 on the RaceHorses (832 × 480), Johnny sequence. On the same sequence, at the resolutions of (832 × 480) and (416 × 240), the existing MLP-based method reduced by 61% and 62%, respectively. Comparing the best results, it can be seen that our proposed method (α = 0.5) improved by 4% and 5% in terms of ∆EncT, respectively, over the MLP-based method. Table 9 shows results between the bitrate and the average object's number, the object's ratio when the DT model sets α as 0.5. The average object's number and the object's ratio were determined by object detection of frames of the JVET test sequences. As the result, we identified the assumption that object-features can be hints to determine the characteristics of the video. Based on various JVET test sequences [25], it was confirmed that sequences with a low object ratio or a small number of objects are superior to other sequences in terms of bitrate. For example, BQSquare and PartyScene sequences show a low average object ratio and the best bitrate. The MLP-based method [11] was incomparable because there were no object features.  Figures 6 and 7 show the decoded images of models yielding the best ∆EncT results on the video sequence of RaceHorses (832 × 480) and RaceHorses (416 × 240) in Table 8 for QPs of 22 and 37, respectively. The image-quality degradations were not noticeably different in the proposed method, the MLP-based method [11], and VTM4.0. Meanwhile, Figures 8 and 9 show the decoded images of models yielding the worst ∆EncT results in Table 8 for QPs of 22 and 37, respectively. On the video sequences of RitualDance and Cactus, where the proposed method (with α = 0.5) delivered the poorest performance (68% and 71%, respectively), the encoding times were increased by 72% and 73%, respectively, in the existing method. Comparing the worst results, it can be seen that our proposed method (α = 0.5) improved by 4% and 2% in terms of ∆EncT, respectively, over the MLPbased method. Moreover, increasing the QP from 22 to 37 caused no significant difference in the image-quality degradation of the proposed method, the MLP-based method [11], and VTM4.0.   Table 8 when QP = 37.  Table 8 when QP = 22.  Table 8 when QP = 37.

Conclusions
To reduce the encoding complexity of VVC, we proposed a framework combining DL for object detection with ML for accurate TT-split prediction. The framework extracts the image features in the first stage and decides whether to split the TT in the second stage. In the feature extraction stage, we acquired object-features-object number and object ratio-using the object detection model YOLOv5. In the TT-split decision stage, we determined whether to split TT using DT, which showed the highest accuracy in an experimental test on multiple ML models. The experimental results confirmed that our proposed method is flexible for the purpose of the application. Therefore, the proposed method could be effectively used in the case of a VVC encoder that can reduce encoding complexity while somewhat compromising the quality or in the case of the encoder that can reduce encoding complexity while ensuring some quality. Furthermore, the extracted object-features and the optimization of the VVC encoder based on object detection may be further investigated to reduce the complexity of high-quality encoders in the near future.