Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion

Liu, Fangmei; Zhang, He; Zhang, Qiuwen

doi:10.3390/electronics14183646

Open AccessArticle

Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion

by

Fangmei Liu

,

He Zhang

and

Qiuwen Zhang

^*

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3646; https://doi.org/10.3390/electronics14183646

Submission received: 14 August 2025 / Revised: 8 September 2025 / Accepted: 13 September 2025 / Published: 15 September 2025

Download

Browse Figures

Versions Notes

Abstract

As a new generation 3D video coding standard, 3D-HEVC offers highly efficient compression. However, its recursive quadtree partitioning mechanism and frequent rate-distortion optimization (RDO) computations lead to a significant increase in coding complexity. Particularly, intra-frame coding in depth maps, which incorporates tools like depth modeling modes (DMMs), substantially prolongs the decision-making process for coding unit (CU) partitioning, becoming a critical bottleneck in compression encoding time. To address this issue, this paper proposes a fast CU partitioning framework based on hierarchical feature fusion convolutional neural networks (HFF-CNNs). It aims to significantly accelerate the overall encoding process while ensuring excellent encoding quality by optimizing depth map CU partitioning decisions. This framework synergistically captures CU’s global structure and local details through multi-scale feature extraction and channel attention mechanisms (SE module). It introduces the wavelet energy ratio designed for quantifying the texture complexity of depth map CU and the quantization parameter (QP) that reflects the encoding quality as external features, enhancing the dynamic perception ability of the model from different dimensions. Ultimately, it outputs depth-corresponding partitioning predictions through three fully connected layers, strictly adhering to HEVC’s quad-tree recursive segmentation mechanism. Experimental results demonstrate that, across eight standard test sequences, the proposed method achieves an average encoding time reduction of 48.43%, significantly lowering intra-frame encoding complexity with a BDBR increment of only 0.35%. The model exhibits outstanding lightweight characteristics with minimal inference time overhead. Compared with the representative methods under comparison, this method achieves a better balance between cross-resolution adaptability and computational efficiency, providing a feasible optimization path for real-time 3D-HEVC applications.

Keywords:

3D-HEVC; depth map; intra-coding; hierarchical feature fusion

1. Introduction

With the advancement of 3D vision technology, 3D video has garnered significant attention due to its immersive and realistic visual effects. Stereoscopic displays based on planar stereo imaging have become a major research focus [1]. However, the increase in the number of viewpoints, while enhancing the stereoscopic effect, leads to an exponential growth in data volume, posing significant challenges for video storage and transmission [2].

To address these challenges, the Joint Collaborative Team on 3D Video Coding (JCT-3V) [3], established by the ITU’s VCEG and ISO’s MPEG, has developed various 3D video compression standards. The evolution of coding technology has progressed from the H.264-based Multiview Video Coding (MVC) extension standard in 2005 [4] to the MVC+D extension standard in 2013 [5], the 3D-AVC standard at the end of 2013 [6], and the MV-HEVC and 3D-HEVC standards developed between 2014 and 2015 [7,8]. Among these, 3D-HEVC not only efficiently compresses texture images but also introduces depth map coding techniques, leveraging the correlation between texture and depth to achieve high-efficiency 3D video compression.

To enhance the efficiency and robustness of depth map coding, 3D-HEVC introduces several new coding tools, such as Depth Modeling Modes (DMMs), Depth Intrinsic Segmentation (DIS), Segment-Based Depth Coding (SDC), and View Synthesis Optimization (VSO) [9,10]. However, the recursive quadtree partitioning of Coding Units (CUs) and the exhaustive prediction methods significantly increase encoding time and computational complexity. Currently, the complexity of 3D-HEVC severely limits its further development. To facilitate the widespread adoption of 3D-HEVC, it is imperative to reduce computational complexity while ensuring negligible coding performance loss. This is crucial for improving practical application efficiency and advancing 3D video technology.

In the exploration of improving 3D-HEVC depth map intra coding, researchers have focused on two key directions: refining CU partitioning and optimizing prediction mode selection. These efforts aim to reduce encoding complexity while maintaining coding quality. In recent years, proposed CU partitioning optimization strategies can be broadly categorized into three types. Statistical decision-based methods: Zou et al. [11] leveraged Bayesian statistical decision theory to compute conditional probabilities and risk costs for splitting versus retaining CUs, providing a rigorous mathematical criterion for early termination of partitioning. In contrast, Wang et al. [12] treat Bayes’ theorem as a classifier, predicting the optimal partition depth through feature patterns learned from data to simplify the decision process. Omran et al. [13] innovatively applied the traditional image processing technique of connected component labeling (CCL) to the CU partitioning problem in depth maps. This method leverages the inherent structural properties of depth maps to achieve highly accurate segmentation in low-bitrate applications. Machine learning-based approaches: Hamout and Elyousfi [14] proposed an optimized CU segmentation method based on the AutoMerge Possibility Clustering (AM-PCM) algorithm. By extracting amplitude-simplified centroid (ASMCV) and variance from depth maps as features, dynamic thresholds were derived through clustering analysis to filter CU sizes requiring evaluation, thereby reducing redundant computations. Subsequently, they proposed a joint optimization method combining clustering and edge detection [15], precisely identifying and reducing redundant computations by dynamically clustering depth map features with edge detection techniques. Su et al. [16] employed Gradient Boosting Machines (GBMs), a powerful ensemble learning algorithm, using Average Local Variance (ALV) as a feature to learn adaptive segmentation thresholds for CUs at different depths. Li et al. [17] transformed CU segmentation into a clustering problem via unsupervised learning, proposing an adjustable early decision scheme; Wang [18] applied IoT data to accelerate 3D-HEVC encoding by extracting CU segmentation-related information through data analysis to optimize decision efficiency. Deep learning-based approaches: Liu et al. [19] proposed a fast deep intra-frame coding method using a deep edge classification network, classifying depth map edge features via CNN to significantly reduce coding complexity; Omran et al. [20] introduced a fast segmentation algorithm based on multi-depth convolutional neural networks (MD-CNNs), training specialized depth map models integrated into the encoder and leveraging inter-view dependencies to lower complexity.

Optimization of intra prediction mode selection is also an important direction to improve 3D-HEVC coding efficiency. Hamout et al. [21] employed a self-organizing map (SOM) network combined with manually selected efficient features to construct a model for skipping redundant intra modes in depth maps; Zhang et al. [22] proposed a global context aggregation intra prediction network, optimizing intra prediction modes for depth video coding by aggregating global and local features; Zhang et al. [23] proposed an intra prediction scheme based on depth region segmentation, using deep networks to learn depth map region features to achieve refined mode classification; Lee et al. [24] proposed a fast depth intra mode decision method based on intra prediction cost and probability, adaptively skipping redundant HEVC prediction modes and DMM by utilizing prediction cost and probability characteristics, effectively reducing coding complexity; Song et al. [25] proposed a content-adaptive mode decision method, dynamically adjusting strategies according to video content characteristics to balance complexity and coding quality; Chen et al. [26] simplified redundant calculations of intra prediction modes and depth modeling modes by analyzing the total sum of squares and variance of prediction unit boundaries, and dynamically adjusting decision thresholds in combination with rate-distortion costs; Huo et al. [27] proposed a fast rate-distortion optimization method for depth maps, introducing a cumulative check mechanism in RD cost calculation, terminating the current mode calculation when the cumulative cost reaches the existing minimum value, and simplifying the selection process.

Although existing methods have achieved significant success in reducing the intra-frame encoding complexity of 3D-HEVC depth maps, there remains room for improvement in the direction of CU segmentation for depth maps: statistical decision-based approaches rely on manual thresholds and heuristic rules, failing to adaptively capture global-local relationships within CUs and struggling to handle complex content; machine learning-based approaches, while capable of data-driven learning of feature-partition relationships, are constrained by feature engineering quality and model capacity, making it difficult to capture the highly nonlinear structure of depth maps.

In recent years, deep learning has been widely applied in video coding, yet significant shortcomings persist in 3D-HEVC depth map coding. For instance, Liu et al. [19] employed single-scale feature extraction, while [20] utilized a multi-channel architecture but failed to effectively fuse global and local features, resulting in poor segmentation accuracy in texture-complex regions. Furthermore, existing methods lack a quantization parameter awareness mechanism, making it challenging to balance computational efficiency and coding performance across resolution scenarios. To address these issues, this paper proposes the Hierarchical Feature Fusion Convolutional Neural Network (HFF-CNN). Experimental results demonstrate that the proposed method achieves a 48.43% encoding time reduction across eight standard test sequences with an incremental BDBR of only 0.35%. The core contributions of this paper are as follows:

Construct external features: Designing a wavelet energy ratio for quantifying the texture complexity of depth map CUs and combining it with QP to jointly serve as external features. These are fused with network-extracted features, thereby enhancing the model’s perception of texture structure and coding quality;
Propose the HFF-CNN: Hierarchically processes input CU features and employs a multi-scale feature fusion strategy. Combines with a channel attention mechanism (SE module) to achieve adaptive enhancement of key features, followed by integration with external features to enable accurate predictions.

The structure of this paper is arranged as follows: Section 2 analyzes the coding structure of 3D-HEVC, focusing on the complexity of depth map coding and the CU partitioning process for depth maps to identify the encoding complexity issues and lay the foundation for subsequent research. Section 3 details the proposed algorithm, including the setting of external features, model design, and overall process, to construct a complete framework for fast intra coding of 3D-HEVC depth maps. Section 4 summarizes the experimental results. Finally, Section 5 provides a conclusion to this study and outlines directions for future research.

2. Observations and Analysis

2.1. Complexity Analysis of 3D-HEVC Depth Map Coding

As a derivation of the HEVC framework, 3D-HEVC provides an effective solution for three-dimensional video coding. By eliminating redundancy between viewpoints, it significantly improves the compression efficiency of multi-viewpoint video and depth maps. Within the 3D-HEVC encoding framework, multi-viewpoint texture and depth maps require processing. Figure 1 illustrates its encoding structure (from Viewpoint 0 to Viewpoint N-1, where N is the total number of viewpoints). These images capture perspectives at the same moment but from different spatial positions. Consequently, viewpoints exhibit both high correlation and variations in parallax and occlusion due to perspective differences. The encoding process advances through two data streams: red arrows convey cross-viewpoint information, leveraging viewpoint similarity to reduce redundancy, while black arrows handle intra-viewpoint encoding data transfer between texture and depth maps. All encoded data are ultimately multiplexed into a 3D bitstream.

Depth maps play a critical role in 3D-HEVC encoding. They convey the key depth information of objects in the scene, which is essential for synthesizing high-quality 3D views. In applications such as virtual reality (VR) and augmented reality (AR), accurate depth map encoding serves as the foundation for providing realistic 3D visual experiences. For example, in a VR environment, depth map information can precisely describe the spatial relationships between objects, enabling users to perceive realistic spatial depth. Different from natural scene videos, the pixel values in depth maps directly represent the depth of objects, featuring unique value ranges and distribution patterns. The edge information in depth maps is crucial for determining the shape and position of objects, and any blurring or distortion of these edges will severely degrade the visual quality of synthesized views.

In terms of encoding time, depth map encoding presents a significant time-consuming issue. As shown in Figure 2, we statistically analyzed the encoding time distribution of four groups of video sequences at two resolutions: Balloons [28], Newspaper [29] at 1024 × 768, and Shark [30], and Poznan_Hall2 [31] at 1920 × 1088. The results show that the time required for depth map encoding far exceeds that of texture map encoding. Data indicate that depth map encoding accounts for an average of 78.42% of the total encoding time, while texture map encoding only accounts for 21.58%, with the former taking 3.6 times longer than the latter. This is because, in addition to the conventional intra prediction techniques used for texture map encoding, depth map encoding introduces new technologies such as VSO, DMM, and SDC. These technologies enhance encoding efficiency but significantly increase computational complexity. Therefore, how to effectively reduce the computational load of depth map encoding while maintaining encoding quality has become a critical issue that needs urgent resolution.

2.2. Complexity Analysis of CU Partitioning for 3D-HEVC Depth Maps

The high encoding complexity of H.265/HEVC [32] stems from its recursive quadtree-based CU partitioning mechanism. 3D-HEVC inherits this structure, where video frames are initially divided into 64 × 64 Coding Tree Units (CTUs) corresponding to depth level 0, as shown in Figure 3. In subsequent encoding stages, these CTUs are further subdivided according to the quadtree structure to form smaller blocks. As the partitioning depth increases, the granularity of CUs becomes finer. Lower-depth CUs typically represent large-scale uniform regions, while higher-depth CUs usually involve more complex detailed regions. Although this layered partitioning approach effectively improves encoding accuracy, it also leads to a substantial increase in computational load during the encoding process—especially in depth map encoding, where the complexity of CUs sharply increases with higher depth levels.

The quadtree partitioning of CTU adopts a hierarchical decision-making mechanism, which is divided into two stages: top-down RD-cost calculation and bottom-up partitioning decision. In the top-down stage, CUs at each level (from depth 0 to 3) are traversed in the order of increasing depth, and RD-cost is calculated layer by layer; in the bottom-up stage, optimal partitioning is determined through cross-layer comparison: if the RD-cost of a CU at depth n is higher than the sum of its four child partitions at the sub-layer (n = 0, 1, 2), partitioning is enforced; otherwise, the CU is retained. For a 64 × 64 CTU, the full traversal mode requires 85 CU partitioning validations, 11,935 SATD (Sum of Absolute Transformed Differences) cost calculations, and 2623 RD cost calculations to obtain the final partitioning result. This process balances distortion and bitrate cost through the RDO algorithm, where the RD-cost calculation formula is as follows:

R D_{c o s t} = S S E_{l u m a} + ω_{c h r o m a} \times S S E_{c h r o m a} + λ_{m o d e} \times R_{m o d e}

(1)

Here,

ω_{c h r o m a}

represents the chroma weighting parameter, while

λ_{m o d e}

denotes the Lagrange multiplier. Specifically, the objective of RDO is to determine the optimal coding mode by evaluating the total cost of each possible partitioning. Although this method excels in compression efficiency, its computational complexity remains high, particularly in depth map encoding, where the recursive partitioning process significantly increases the computational burden.

To quantitatively analyze the depth distribution of CUs in depth maps, we used the reference software HTM16.3 [33] in All-Intra mode [34], encoded the JCT-3V standard test sequences [35] (including Balloons, Newspaper, Shark, and Poznan_Hall2) with multiple QP combinations (25–34, 30–39, 35–42, 40–45), and statistically analyzed the depth map CU partitioning results. Table 1 clearly shows the distribution of depth map CUs across different depths after encoding. Observation reveals that CUs at depths 0 and 1 dominate in most cases, with their proportion significantly increasing at high QP or in static scenes. Conversely, divisions at depths 2 and 3 concentrate in high-motion regions or complex edge scenes at low QP, exhibiting overall lower proportions. This result indicates that traditional exhaustive coding modes involve substantial redundant computations. By identifying the depth level of the current CU and skipping unnecessary CU size calculations, approximately two-thirds of the pattern search and RDO computations can be effectively avoided. This significantly reduces computational complexity, providing a practical optimization path for real-time 3D video coding systems.

3. The Proposed Algorithm

3.1. Construction of CU Complexity Features Based on Wavelet Energy Ratio and Quantization Parameter

Although convolutional neural networks can effectively extract features, their limited perceptual domain makes it challenging to comprehensively support complex CU partitioning predictions. To further improve the partitioning prediction accuracy of the HFF-CNN model, we introduce two key external features: the wavelet energy ratio that reflects the texture complexity of depth map CUs, and the quantization parameter that reflects coding quality. These two features respectively provide supplementary information from their respective dimensions; fused with the depth features extracted by the model, they jointly enhance the understanding of CU characteristics to achieve more accurate partitioning prediction. In the proposed model, these two features are added as external features and participate in training together.

In traditional image processing, the wavelet energy ratio is often used as an analytical tool for distinguishing between simple and detailed regions. In contrast, targeting the depth map CU partitioning task, this study designs a dedicated wavelet energy ratio calculation scheme and innovatively converts it into a quantitative feature that drives intelligent decision-making. To accurately capture the texture information of CU blocks from the macro to the micro level, we adopt the 2D Discrete Wavelet Transform (2D-DWT) to perform two-level decomposition on CU blocks, and select the Daubechies-2 wavelet basis from the Daubechies family (whose low-pass and high-pass filters have predefined fixed structures with a length of 4) to construct a filter bank, so as to balance computational complexity and detail extraction capability. Subsequently, for the sub-bands obtained after decomposition, we calculate the ratio of the total energy of high-frequency sub-bands to the total energy of all sub-bands, and finally obtain the wavelet energy ratio feature for quantifying the texture complexity of depth map CUs. Figure 4 illustrates the detailed two-level decomposition process, where

A \in (L L_{1}, L H_{1}, H L_{1}, H H_{1}, L L_{2}, L H_{2}, H L_{2}, H H_{2})

, and

A (i, j)

denotes the coefficient of each sub-band at position

(i, j)

, which is generated by convolving with the selected filter and combining with downsampling and used for subsequent energy calculation.

First, conduct the first layer decomposition: For an input CU block of size 64 × 64, first apply the low-pass filter

h_{l o w}

and the high-pass filter

h_{h i g h}

to each row of pixels in the horizontal direction. When the filters slide along each row, multiply them with the corresponding pixels and sum up the results to generate low-frequency (smooth) and high-frequency (detail) information in the row direction. Then, through sub-sampling at every other point, the column dimension is halved from 64 to 32, yielding two intermediate matrices of size 64 × 32,

H_{l o w}

(horizontal low-frequency) and

H_{h i g h}

(horizontal high-frequency). Immediately afterwards, repeat the same filtering and sub-sampling operations on each column of these two intermediate matrices in the vertical direction. The row dimension is halved to 32, and finally four sub-bands of size 32 × 32 are generated: the low-frequency sub-band

L L_{1}

(vertical low-pass + horizontal low-pass),

L H_{1}

which captures horizontal details (vertical low-pass + horizontal high-pass),

H L_{1}

which captures vertical details (vertical high-pass + horizontal low-pass), and

H H_{1}

which captures diagonal details (vertical high-pass + horizontal high-pass).

Subsequently, perform the second-level decomposition on the low-frequency sub-band

L L_{1}

(32 × 32) obtained in the first level. The process is the same as that of the first level: first, perform horizontal filtering and downsampling to get to 32 × 16, then perform vertical filtering and downsampling to reach 16 × 16, resulting in the low-frequency sub-band

L L_{2}

and three high-frequency sub-bands

L H_{2}

,

H L_{2}

, and

H H_{2}

of the second level. The energy of each sub-band is calculated by the sum of squares of all coefficient values

A (i, j)

within the sub-band, with the following formula:

E (A) = \sum_{i = 1}^{W_{A}} \sum_{j = 1}^{H_{A}} {|A (i, j)|}^{2}

(2)

Here,

W_{A}

and

H_{A}

represent the width and height of sub-band A, respectively. The total energy of the high-frequency sub-bands (

E_{h i g h}

) and the overall energy across all sub-bands (

E_{t o t a l}

) are calculated as follows:

E_{h i g h} = \sum_{k \in \{L H_{1}, H L_{1}, H H_{1}, L H_{2}, H L_{2}, H H_{2}\}} E_{k}

(3)

E_{t o t a l} = E ({LL}_{1}) + E ({LL}_{2}) + E_{h i g h}

(4)

To quantify the texture complexity of CU, the ratio of the total energy of the extracted high-frequency sub-bands to the total energy of all sub-bands is calculated to obtain the wavelet energy ratio

R_{w a v e l e t}

, defined as follows:

R_{w a v e l e t} = \frac{E_{h i g h}}{E_{t o t a l}}

(5)

After this feature is introduced into the proposed HFF-CNN model, it can effectively enhance the model’s segmentation sensitivity in complex texture regions and reduce the risk of over-segmentation in flat regions.

Another external feature is the quantization parameter, which serves as a crucial metric for assessing encoding quality, directly influencing the compression rate and partitioning strategy of CU blocks. Normalization not only mitigates biases caused by different QP scales but also enhances the model’s sensitivity to the impact of encoding quality. In this study, the QP value is normalized to the range [0, 1] using the following formula:

{QP}_{n o r m} = \frac{Q P - {QP}_{min}}{{QP}_{max} - {QP}_{min}}

(6)

Here,

Q P \in [0, 51] \cap Z

,

Q P_{min}

, and

Q P_{max}

represent the minimum and maximum values within the dataset, respectively. Through normalization, the values are standardized to a uniform scale, enabling the model to more reliably learn the impact of the quantization parameter on partitioning decisions. This process provides comprehensive data support for partitioning prediction.

3.2. CU Partitioning Prediction Based on HFF-CNN

To efficiently predict the partitioning of CUs in depth maps, this model incorporates a framework that integrates multi-scale feature extraction, squeeze and excitation (SE) module enhancement, and external feature fusion, as illustrated in Figure 5. The model takes a 64 × 64 CU block as input, processes it through multiple channels and feature fusion operations, and ultimately outputs the partitioning probabilities for both the CU and its sub-blocks.

First, the frames to be encoded of the depth video are cropped into 64 × 64 CTUs and input into the network after preprocessing. Among the three input channels, one remains at the original 64 × 64 size, while the other two are reduced to 32 × 32 and 16 × 16 respectively through average pooling. This multi-scale design aims to better adapt to the hierarchical structure of CU partitioning, enabling the model to extract features of coding units hierarchically from global to local levels.

Each scale of feature maps undergoes three convolutional layers, gradually extracting high-level features and compressing the feature map size to focus on important regions. The first layer uses 4 × 4 convolutional kernels with a stride of 4, outputting 16 feature maps, and the size of the feature maps is reduced to one-fourth of the input size; the next two layers use 2 × 2 convolutional kernels with a stride of 2, outputting 24 and 32 feature maps respectively. Each channel focuses on features of different levels: the 64 × 64 feature map extracts global low-level information, the 32 × 32 feature map focuses on medium-scale local information, and the 16 × 16 feature map concentrates on high-level abstract structures. A ReLU activation function is applied after each convolutional layer to enhance nonlinear expression capability, and its mathematical expression is as follows:

R_{m} (C U_{n}) = \{\begin{matrix} C U_{n}, m = 0 \\ max (0, W_{m} R_{m - 1} (C U_{n}) + B_{m}), 1 \leq m \leq M \end{matrix}

(7)

Here,

C U_{n}

represents the n-th input CU block, while

W_{m}

and

B_{m}

denote the convolutional weights and bias term of the m-th layer, respectively.

The joint feature maps generated by the convolutional layers are fed into the SE module for feature enhancement, where Squeeze and Excitation operations are performed sequentially. During the Squeeze operation, global average pooling is applied to the output feature map from the convolutional layer (with dimensions

H \times W \times C

, where H and W represent the height and width of the feature map, respectively, and C denotes the number of channels). Specifically, global average pooling computes the spatial average over H and W, resulting in a 1 × 1 × C vector. This operation effectively compresses spatial information and extracts global features for each channel, providing essential global context for the subsequent Excitation operation. The calculation formula is as follows:

Z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)

(8)

Here,

Z_{c}

represents the globally aggregated feature vector, and

x_{c} (i, j)

(i = 1, 2, …, H; j = 1, 2, …, W; c = 1, 2, …, C) denotes the pixel values of the input feature map. After obtaining the results from the Squeeze operation, the Excitation operation is performed. This process involves two fully connected (FC) layers. The first FC layer compresses the 1 × 1 × C vector to a smaller dimension (with a reduction ratio of 4), and the compressed vector is passed through a ReLU activation function to introduce nonlinearity, thereby enhancing the network’s representational capability. Subsequently, the second FC layer restores the compressed vector to its original dimension of 1 × 1 × C. Finally, a Sigmoid activation function is applied to constrain the output values within the range of from 0 to 1, generating channel-wise attention weights

s_{c}

:

s_{c} = σ (W_{2} δ (W_{1} z_{c}))

(9)

Here,

δ (x) = max (0, x)

represents the ReLU activation function, while

σ (x) = \frac{1}{1 + e^{- x}}

denotes the Sigmoid function. Next, the channel-wise attention weights are multiplied element-wise with the original feature map to achieve feature reweighting:

x_{c}^{'} (i, j) = s_{c} \cdot x_{c} (i, j)

(10)

Here,

x_{c}^{'} (i, j)

represents the weighted pixel value. This process adaptively adjusts the weights of different channels to enhance important features while suppressing redundant ones, thereby optimizing feature representation. Through this weighting mechanism, the model can more precisely focus on critical features, ultimately improving overall performance.

The features weighted by the SE module are fed into the fully connected (FC) layers. Each FC layer progressively extracts features at varying hierarchical levels. FC1-1 and FC1-2, with fewer neurons, extract the most critical information to generate Output 1. This output corresponds to the probability value of whether the 64 × 64 CU block input to the model will be further partitioned. FC2-1 and FC2-2, containing more neurons, extract deeper features to produce Output 2. Output 2 corresponds to the respective partition probabilities of the four 32 × 32 sub-CUs generated if the parent 64 × 64 CU is partitioned. FC3-1 and FC3-2, possessing the highest neuron count, process the finest-grained features for Output 3. The values of Output 3 respectively denote the partitioning probabilities of the resulting 16 × 16 sub-CUs when the corresponding 32 × 32 sub-CU is partitioned. At the feature extraction stage of each hierarchy (i.e., after the operations of FC1-1, FC2-1, and FC3-1), the wavelet energy ratio and the normalized QP value are combined with the output of the first fully connected layer of the hierarchy to additionally provide the wavelet energy ratio reflecting the texture complexity and the normalized QP information reflecting the coding quality requirement for the partitioning decision of the corresponding hierarchy. In the subsequent fully connected operations, the fused feature vector is further abstracted and refined through the weighted calculation of neurons, so that the model can make more reasonable partitioning decisions. Finally, each output layer outputs the corresponding partitioning probability value through the sigmoid activation function. The partitioning decision process is recursive. By comparing with the threshold of 0.5, if the output probability value is greater than 0.5, the corresponding CU continues to be partitioned to the next depth size; if it does not exceed 0.5, the partitioning of the corresponding CU stops. According to the outputs of the three fully connected layers, the complete partitioning decision for the input 64 × 64 CU can be realized.

Since the binary cross-entropy (BCE) loss function is commonly used in binary classification tasks and the proposed model in this study is essentially a binary classification problem, the binary cross-entropy loss function is employed during model training. This function evaluates each predicted partition probability against its corresponding ground truth label. It is defined as follows:

L = - \frac{1}{N} \sum_{n = 1}^{N} (l_{n} \times ln Y_{n} + (1 - l_{n}) \times ln (1 - Y_{n}))

(11)

Here,

l_{n}

represents the ground truth partition label for the n-th sample, and

Y_{n}

denotes the predicted probability generated by the model. The loss function effectively optimizes the classification performance of the model.

3.3. Overall Algorithm Flow

In practical application scenarios, this study applies the proposed HFF-CNN to the HTM encoder to replace the original depth map CU partitioning method, with the overall encoding process shown in Figure 6. During the depth map CU partitioning stage, the system invokes the trained HFF-CNN model to predict the partitioning mode of the CU. Leveraging the collaborative effects of multi-scale feature extraction, channel attention mechanism, and external auxiliary features (wavelet energy ratio and normalized quantization parameter), the HFF-CNN model accurately analyzes the characteristics of the current CU and outputs the partitioning probabilities for corresponding hierarchical levels. After obtaining the prediction results, a pre-set threshold is used to determine whether to continue partitioning the CU. Once CU partitioning is completed, subsequent encoding operations proceed strictly according to the established rules of the HTM encoder until the final bitstream is generated, completing the entire encoding process and achieving efficient encoding of 3D-HEVC depth maps.

4. Experimental Results and Analyses

4.1. Dataset Construction and Experimental Setup

We selected six sequences from the JCT-3V standard test sequences [35] (1024 × 768 resolution: Outdoor, Lovebird, BookArrival; 1920 × 1088 resolution: Dancer, Cafe, CarPark), whose content covers various types including static backgrounds, indoor and outdoor scenes, fine details, and complex motions. Based on the reference software HTM16.3 and its built-in configuration file encoder_intra_main.cfg, we encoded the depth maps under different QP values (34, 39, 42, 45) to extract the corresponding 64 × 64 CU partitioning results, which will be used as ground truth labels to construct the dataset. To avoid overlap between training and testing data, training and testing frames were completely separated by at least 50 frames.

The experiments were conducted on a computer equipped with an AMD Ryzen 9 8900X3D CPU (manufactured by Advanced Micro Devices, Inc. (AMD), Santa Clara, CA, USA), 32 GB RAM, and the Windows 11 64-bit operating system. Both the training of the proposed HFF-CNN model and the encoding tests for the proposed method and the reference software (HTM16.3) were performed in the same environment. The CPU only impacts absolute encoding time. Subsequent experiments yielded relative metrics unaffected by CPU model or performance, with consistent stability across variations, thereby enhancing the generalizability of conclusions. The CNN implementation relies on functions from the TensorFlow library and accelerates training by activating the GPU (used exclusively during the model training phase). Table 2 summarizes the hardware and software environment used in the experiments. During the model training process of this study, we set a series of optimized hyperparameters to ensure the efficiency and stability of the depth map CU partitioning task. Specifically, the batch size was set to 64, and the initial learning rate was set to 0.005, which was decayed by a factor of 5 after every 50 epochs to guide the model to converge more smoothly. The training was conducted for a maximum of 120 epochs, and an early stopping mechanism was adopted. The training would be terminated if there was no improvement in the validation loss for 15 consecutive epochs. The Adam optimizer was selected to adapt to the characteristics of the multi-label classification task. The binary cross-entropy loss function was used, which was consistent with the design of the model’s output layer to ensure the rationality of the predicted probabilities. Regarding the training computational cost of the HFF-CNN model, in the experimental environment described, training with a dataset containing four QP values and corresponding depth map CTU partitions, the actual time taken for the model to complete the entire training process is approximately 12 h. This one-time training investment is entirely acceptable given the significant acceleration benefits the model brings during the encoding phase.

4.2. Analysis of Experimental Results

To evaluate the effectiveness of the proposed method, we conducted experiments on the 3D-HEVC reference software HTM16.3. The trained model was integrated into HTM16.3, and tests were performed using eight standard test sequences officially provided by JCT-3V [35], with resolutions covering 1024 × 768 and 1920 × 1088. Table 3 provides further details on the test video sequences. Since this study primarily focuses on reducing intra-frame encoding complexity, experiments employed the All Intra [34] mode. The configuration file used was baseCfg_3view+depth_AllIntra.cfg (included in the HTM16.3 software package), with four sets of QP values applied for texture and depth encoding: (25–34, 30–39, 35–42, 40–45). The encoding time savings (TS) represents the reduction in intra encoding time compared to the reference software HTM16.3, serving as a metric for the extent of encoding complexity reduction. TS is calculated as follows:

T S = \frac{1}{4} \sum_{i = 1}^{4} \frac{T_{H T M}^{Q P_{i}} - T_{p r o}^{Q P_{i}}}{T_{H T M}^{Q P_{i}}}

(12)

T_{p r o}

represents the encoding time of the proposed algorithm, and

T_{H T M}

refers to the encoding time of the reference model HTM16.3. Additionally, BDBR (Bjøntegaard Delta Bit Rate) [36] is used as a standard metric to evaluate the coding efficiency gains achieved by different methods while maintaining a consistent target quality.

Table 4 fully presents the experimental results of the proposed method. Verified through ablation experiments, on eight groups of standard test sequences, the model without integrating the wavelet energy ratio feature designed by us achieved an average coding time saving of 46.13% compared with HTM16.3. The complete HFF-CNN framework fused with this feature achieved more excellent comprehensive performance: it not only realized an average coding time saving of 48.43%, but also incurred extremely low coding performance loss (with only a 0.35% increase in BDBR and a 0.02 dB decrease in BD-PSNR). The experimental results validate the effectiveness of the proposed method from two aspects: on the one hand, the 2.30% performance improvement directly proves the effective contribution of the designed wavelet energy ratio feature. Through linear convolution operations with low computational cost, this feature significantly enhances the model’s accuracy in judging the partitioning of low-texture regions, and avoids the exponential growth of RDO computation caused by excessive CU partitioning due to misjudgment. On the other hand, the final results show that the proposed HFF-CNN successfully achieves an optimized balance between coding efficiency and quality by virtue of hierarchical feature fusion, channel attention mechanism, and the integration of external features.

We conducted a comprehensive evaluation of the computational complexity and actual inference overhead of the HFF-CNN model: Using the ptflops tool, we measured that processing a single 64 × 64 CU requires only 1.15 MFLOPs (including external feature extraction and the entire network forwarding process). In actual encoding, the average encoding duration for the eight sequences is 3114.1 s, while the average inference time of the model is 10.9 s, accounting for only 0.35% of the total encoding time. The lightweight design of the model ensures that the inference overhead it introduces is significantly offset by the 48.43% encoding acceleration. With negligible impact on coding quality, this fully demonstrates the deployment feasibility and practical value of the proposed method in a pure CPU environment.

4.3. Comparison with Other Algorithms

Table 5 presents the comprehensive comparison results of intra coding optimization methods for 3D-HEVC. Different research teams have developed unique acceleration strategies via differentiated designs for coding unit decision mechanisms and prediction mode selection. The technical evolution path of the Hamout team is particularly clear: the early proposed optimization strategy [14] achieved a 40.2% time acceleration with a 0.21% BDBR increment, laying the foundation for intra coding efficiency optimization; the subsequent improved method based on clustering algorithm and edge detection [15] further balanced performance, increasing the time acceleration rate to 45.8%. Although the BDBR increment slightly increased to 0.32%, the overall performance is better, reflecting the progress of the team’s technical iteration. In addition, the method proposed by Lee et al. [24] based on intra prediction cost and probability achieved a 39.26% time saving with a 0.68% BDBR increment, verifying the feasibility of prediction cost-driven optimization. The method proposed in this paper ranks the best in the TS metric with 48.43%. Although the BDBR increment (0.35%) is slightly higher than that of [14] (0.21%), the difference is only 0.14 percentage points, which is completely within the acceptable range of coding efficiency loss in practical applications; in terms of time efficiency, it has achieved significant improvement compared with existing comparison methods. This characteristic of significantly improving computational efficiency within acceptable efficiency loss makes it more suitable for scenarios with strict timeliness requirements such as real-time 3D video transmission.

A deeper analysis of the impact of different resolutions on coding time savings reveals that, under different resolutions, there are significant differences in the time savings performance of various methods, and the advantages of the proposed method in this paper are particularly prominent. As shown in Figure 7, among the first three sequences with 1024 × 768 resolution, the TS of the early method [14] by the Hamout team is only 33.47%, the subsequent proposed [15] increases it to 41.03%, and that of Lee et al. [24] is 38.83%, all lower than the 47.66% of the proposed method in this paper. Among the latter five high-resolution sequences with 1920 × 1088 resolution, the TS of [14] increases to 44.30%, and [15] further reaches 50.70%, but the cross-resolution performance fluctuation of [15] is as high as 9.67 percentage points, with poor stability. In contrast, the proposed method in this paper achieves a TS of 48.89% at high resolution, with a cross-resolution fluctuation of only 1.23 percentage points. It not only maintains efficient acceleration but also achieves excellent stability, fully verifying the algorithm’s advantages in adaptive processing of video features at different spatial scales, and providing more reliable technical support for the flexible application of 3D-HEVC in multi-resolution scenarios.

4.4. Comparison of Synthesized Views

To comprehensively evaluate our method, we selected two representative sequences from video sequences with resolutions of 1024 × 768 and 1920 × 1088, respectively. In Balloons, the camera moves uniformly in a parallel direction, with significant dynamics of foreground characters, and the depth map has both complex boundaries and local flat regions; in Newspaper, the camera remains fixed, with slight movement of foreground objects and completely static background, and the depth map shows fine boundary details and some gentle regions; in Shark, the camera follows the movement of objects, with diverse movement transformations of foreground objects, and the depth map has a small amount of regular boundary information and more flat regions. GT_Fly contains a large number of buildings; as the camera moves slowly forward, the background buildings change from far to near, and its depth map contains a large amount of complex boundary information and concentrated flat regions. According to the definition of virtual viewpoints in 3D-HEVC, the position between adjacent viewpoints is parameterized as a continuous value in [0.0,1.0]. In this experiment, the geometric center position n = 0.5 (the 50% position between View0 and View1) is selected as the core observation point. Meanwhile, considering that the QP combination of 35–42 in mainstream streaming media application scenarios is exactly the critical point sensitive to subjective distortion, this QP group is used for testing. As shown in Table 6, through juxtaposed comparison of the original reference synthesized frames, HTM16.3 benchmark synthesized frames, and the synthesized frames of the proposed method, it is found that there is no visual perception degradation in sensitive regions such as the contours of moving characters in Balloons and the text edges in Newspaper.

To quantitatively verify the above conclusions, Table 7 presents the objective perceptual quality evaluation results of the aforementioned sequences under the same conditions. The average SSIM and VMAF values of the synthesized videos generated by the proposed method are 0.947 and 92.4, respectively, which are extremely close to the synthesis results of HTM16.3. These objective measurement results are consistent with the minimal coding loss indicated by the only 0.35% BDBR increase, indicating that the proposed method effectively maintains the visual fidelity of synthesized views while achieving coding acceleration.

5. Conclusions

This paper addresses the high computational complexity of 3D-HEVC depth map intra-coding by proposing a fast CU partitioning framework based on a Hierarchical Feature Fusion Convolutional Neural Network (HFF-CNN). The core contribution lies in the collaborative design of multi-scale feature extraction and a channel attention mechanism (SE module), effectively capturing both global structural features and local details of coding units. By fusing the designed wavelet energy ratio and normalized quantization parameters, the model significantly enhances the robustness of CU partitioning prediction. Experimental results demonstrate that the proposed method reduces encoding time by an average of 48.43% across multiple resolution scenarios while maintaining a negligible BDBR increase of 0.35%, achieving an efficient balance between computational efficiency and coding quality. Additionally, the model’s cross-resolution adaptability validates its generalization capabilities under diverse video contents and encoding parameters, offering a versatile technical foundation for practical deployment of 3D video coding systems.

Future work will focus on enhancing the efficiency and scene adaptability of 3D-HEVC depth map coding: On one hand, we will explore joint optimization strategies with intra-frame prediction mode selection, reducing redundant computations during encoding through feature sharing and decision coordination. On the other hand, we will deepen lightweight model design by integrating dynamic network compression techniques to lower computational resource demands without compromising encoding quality. Furthermore, we will expand experimental validation by adapting the proposed method to Random Access (RA) and Low Delay (LD) encoding configurations. By fine-tuning external features to accommodate diverse scenario requirements, we will validate the method’s adaptability across multiple applications, providing a more comprehensive technical pathway for widespread real-time deployment of 3D-HEVC.

Author Contributions

Conceptualization, F.L. and H.Z.; methodology, F.L.; software, Q.Z.; validation, H.Z. and Q.Z.; formal analysis, F.L.; investigation, F.L.; resources, Q.Z.; data curation, H.Z.; writing—original draft preparation, H.Z. and Q.Z.; writing—review and editing, F.L. and Q.Z.; visualization, F.L.; supervision, Q.Z.; project administration, Q.Z.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China No.61771432, and 61302118, and the Key projects Natural Science Foundation of Henan 232300421150, and Zhongyuan Science and Technology Innovation Leadership Program 244200510026, the Scientific and Technological Project of Henan province 232102211014.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Artois, J.; Van Wallendael, G.; Lambert, P. 360DIV: 360° video plus depth for fully immersive VR experiences. In Proceedings of the 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2023; pp. 1–2. [Google Scholar]
Li, T.; Yu, L.; Wang, H.; Kuang, Z. A bit allocation method based on inter-view dependency and spatio-temporal correlation for multi-view texture video coding. IEEE Trans. Broadcast. 2020, 67, 159–173. [Google Scholar] [CrossRef]
Cai, Y.; Wang, R.; Gu, S.; Zhang, J.; Gao, W. An adaptive pyramid single-view depth lookup table coding method. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1940–1944. [Google Scholar]
Mallik, B.; Sheikh-Akbari, A.; Bagheri Zadeh, P.; Al-Majeed, S. HEVC based frame interleaved coding technique for stereo and multi-view videos. Information 2022, 13, 554. [Google Scholar] [CrossRef]
Li, T.; Yu, L.; Wang, H.; Kuang, Z. An efficient rate–distortion optimization method for dependent view in MV-HEVC based on inter-view dependency. Signal Process. Image Commun. 2021, 94, 116166. [Google Scholar] [CrossRef]
Hamout, H.; Elyousfi, A. Fast depth map intra-mode selection for 3D-HEVC intra-coding. Signal Image Video Process. 2020, 14, 1301–1308. [Google Scholar] [CrossRef]
Khan, S.N.; Khan, K.; Muhammad, N.; Mahmood, Z. Efficient prediction mode decisions for low complexity MV-HEVC. IEEE Access 2021, 9, 150234–150251. [Google Scholar] [CrossRef]
Jeon, G.; Lee, Y.; Lee, J.-K.; Kim, Y.-H.; Kang, J.-W. Robust spatial-temporal motion coherent priors for multi-view video coding artifact reduction. IEEE Access 2023, 11, 123104–123116. [Google Scholar] [CrossRef]
Du, G.; Cao, Y.; Li, Z.; Zhang, D.; Wang, L.; Song, Y.; Ouyang, Y. A low-latency DMM-1 encoder for 3D-HEVC. J. Real-Time Image Process. 2020, 17, 691–702. [Google Scholar] [CrossRef]
Zhang, H.; Yao, W.; Huang, H.; Wu, Y.; Dai, G. Adaptive coding unit size convolutional neural network for fast 3D-HEVC depth map intracoding. J. Electron. Imaging 2021, 30, 041405. [Google Scholar] [CrossRef]
Zou, D.; Dai, P.; Zhang, Q. Fast depth map coding based on Bayesian decision theorem for 3D-HEVC. IEEE Access 2022, 10, 51120–51127. [Google Scholar] [CrossRef]
Wang, X. A fast 3D-HEVC video encoding algorithm based on Bayesian decision. In Proceedings of the International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023), Yinchuan, China, 18–19 August 2023; Volume 12941, pp. 937–942. [Google Scholar]
Omran, N.; Kachouri, R.; Maraoui, A.; Werda, I.; Belgacem, H. 3D-HEVC fast CCL intra partitioning algorithm for low bitrate applications. In Proceedings of the 2025 IEEE 22nd International Multi-Conference on Systems, Signals and Devices (SSD), Sousse, Tunisia, 17–20 February 2025; pp. 749–755. [Google Scholar]
Hamout, H.; Elyousfi, A. A computation complexity reduction of the size decision algorithm in 3D-HEVC depth map intracoding. Adv. Multimed. 2022, 2022, 3507201. [Google Scholar] [CrossRef]
Hamout, H.; Elyousfi, A. Low 3D-HEVC depth map intra modes selection complexity based on clustering algorithm and an efficient edge detection. In Proceedings of the International Conference on Artificial Intelligence and Green Computing, Hammamet, Tunisia, 15 January 2023; pp. 3–15. [Google Scholar]
Su, X.; Liu, Y.; Zhang, Q. Fast depth map coding algorithm for 3D-HEVC based on gradient boosting machine. Electronics 2024, 13, 2586. [Google Scholar] [CrossRef]
Li, Y.; Yang, G.; Qu, A.; Zhu, Y. Tunable early CU size decision for depth map intra coding in 3D-HEVC using unsupervised learning. Digit. Signal Process. 2022, 123, 103448. [Google Scholar] [CrossRef]
Wang, X. Application of 3D-HEVC fast coding by Internet of Things data in intelligent decision. J. Supercomput. 2022, 78, 7489–7508. [Google Scholar] [CrossRef]
Liu, C.; Jia, K.; Liu, P. Fast depth intra coding based on depth edge classification network in 3D-HEVC. IEEE Trans. Broadcast. 2021, 68, 97–109. [Google Scholar] [CrossRef]
Omran, N.; Werda, I.; Maraoui, A.; Kachouri, R.; Belgacem, H. 3D-HEVC fast partitioning algorithm based on MD-CNN. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence & Green Energy (ICAIGE), Sousse, Tunisia, 10–12 October 2024; pp. 1–6. [Google Scholar]
Hamout, H.; Hammani, A.; Elyousfi, A. Fast 3D-HEVC intra-prediction for depth map based on a self-organizing map and efficient features. Signal Image Video Process. 2024, 18, 2289–2296. [Google Scholar] [CrossRef]
Zhang, J.; Hou, Y.; Peng, B.; Pan, Z.; Li, G. Global-context aggregated intra prediction network for depth video coding. IEEE Trans. Circuits Syst. II, Exp. Briefs 2023, 70, 3159–3163. [Google Scholar] [CrossRef]
Zhang, J.; Hou, Y.; Zhang, Z.; Jin, D.; Zhang, P.; Li, G. Deep region segmentation-based intra prediction for depth video coding. Multimed. Tools Appl. 2022, 81, 35953–35964. [Google Scholar] [CrossRef]
Lee, J.Y.; Park, S. Fast depth intra mode decision using intra prediction cost and probability in 3D-HEVC. Multimed. Tools Appl. 2024, 83, 80411–80424. [Google Scholar] [CrossRef]
Song, W.; Dai, P.; Zhang, Q. Content-adaptive mode decision for low complexity 3D-HEVC. Multimed. Tools Appl. 2023, 82, 26435–26450. [Google Scholar] [CrossRef]
Chen, M.J.; Lin, J.R.; Hsu, Y.C.; Ciou, Y.S.; Yeh, C.H.; Lin, M.H.; Kau, L.J.; Chang, C.Y. Fast 3D-HEVC depth intra coding based on boundary continuity. IEEE Access 2021, 9, 79588–79599. [Google Scholar] [CrossRef]
Huo, J.; Zhou, X.; Yuan, H.; Wan, S.; Yang, F. Fast rate-distortion optimization for depth maps in 3-D video coding. IEEE Trans. Broadcast. 2022, 69, 21–32. [Google Scholar] [CrossRef]
Lin, J.R.; Chen, M.J.; Yeh, C.H.; Chen, Y.C.; Kau, L.J.; Chang, C.Y.; Lin, M.H. Visual perception based algorithm for fast depth intra coding of 3D-HEVC. IEEE Trans. Multimed. 2021, 24, 1707–1720. [Google Scholar] [CrossRef]
Pan, Z.; Yi, X.; Chen, L. Motion and disparity vectors early determination for texture video in 3D-HEVC. Multimed. Tools Appl. 2020, 79, 4297–4314. [Google Scholar] [CrossRef]
Lin, J.R.; Chen, M.J.; Yeh, C.H.; Lin, S.D.; Sue, K.L.; Kau, L.J.; Ciou, Y.S. Vision-oriented algorithm for fast decision in 3D video coding. IET Image Process. 2022, 16, 2263–2281. [Google Scholar] [CrossRef]
Yao, W.; Wang, X.; Yang, D.; Li, W. A fast wedgelet partitioning for depth map prediction in 3D-HEVC. In Proceedings of the Twelfth International Conference on Graphics and Image Processing (ICGIP 2020), Xi’an, China, 13–15 November 2020; Volume 11720, pp. 257–266. [Google Scholar]
Wen, W.; Tu, R.; Zhang, Y.; Fang, Y.; Yang, Y. A multi-level approach with visual information for encrypted H.265/HEVC videos. Multimed. Syst. 2023, 29, 1073–1087. [Google Scholar] [CrossRef]
Bakkouri, S.; Elyousfi, A. Early termination of CU partition based on boosting neural network for 3D-HEVC inter-coding. IEEE Access 2022, 10, 13870–13883. [Google Scholar] [CrossRef]
Hamout, H.; Elyousfi, A. Fast 3D-HEVC PU size decision algorithm for depth map intra-video coding. J. Real-Time Image Process. 2020, 17, 1285–1299. [Google Scholar] [CrossRef]
Bakkouri, S.; Elyousfi, A. Machine learning-based fast CU size decision algorithm for 3D-HEVC inter-coding. J. Real-Time Image Process. 2021, 18, 983–995. [Google Scholar] [CrossRef]
Chen, J.; Wang, B.; Liao, J.; Cai, C. Fast 3D-HEVC inter mode decision algorithm based on the texture correlation of viewpoints. Multimed. Tools Appl. 2019, 78, 29291–29305. [Google Scholar] [CrossRef]

Figure 1. 3D-HEVC encoding structure: red and black arrows for cross-view and intra-view data flows.

Figure 2. Encoding time distribution of depth maps and texture maps.

Figure 3. Schematic diagram of hierarchical partitioning of CTU.

Figure 4. Two-layer decomposition and sub-band generation based on 2D-DWT.

Figure 5. The proposed HFF-CNN model.

Figure 6. Flowchart of the proposed method.

Figure 7. Comparison of encoding time savings at different resolutions. Methods: Lee (2024) [24], Hamout (2022) [14], Hamout (2023) [15], and proposed.

Table 1. Distribution of CU depth levels in depth maps.

Sequence	QP	Depth 0 (%)	Depth 1 (%)	Depth 2 (%)	Depth 3 (%)
Balloons	34	33.1	30.9	23.8	12.2
	39	44.3	36.7	15.1	3.9
	42	57.4	34.6	6.7	1.3
	45	73.6	23.9	2.1	0.4
Newspaper	34	12.5	29.1	32.8	25.6
	39	25.1	40.4	23.8	10.7
	42	41.7	39.8	14.8	3.7
	45	61.4	31.4	6.4	0.8
Shark	34	30.4	35.4	19.7	14.5
	39	53.0	30.2	11.3	5.5
	42	68.2	23.8	6.3	1.7
	45	80.6	16.4	2.8	0.2
Poznan_Hall2	34	71.4	20.9	5.8	1.9
	39	83.4	13.3	2.8	0.5
	42	91.0	7.6	1.2	0.2
	45	95.8	3.8	0.4	0.0
Average		57.7	26.1	11.0	5.2

Table 2. Experimental environment.

	Hardware
CPU		AMD Ryzen 9 8900X3D
GPU		RTX 4070Ti Super
RAM		32GB
OS		windows 11 64bits
	Software
Reference software		HM 16.18
Configuration file		encoder_intra_main.cfg
QP(depth)		34, 39, 42, 45

Table 3. JCT-3V standard test sequences and related information.

Sequence	Resolution	Frames	Frame Rate	3-Views Input	Scene Characteristics
Kendo	1024 × 768	300	30	1-3-5	Multiple overlapping objects
Balloons		300	30	1-3-5	High-dynamic motion
Newspaper		300	30	2-4-6	Static text, dynamic person
GT_Fly	1920 × 1088	250	25	9-5-1	CG city geometry
Shark		300	30	1-5-9	Complex biological contours
Poznan_Hall2		200	25	7-6-5	Large smooth areas
Poznan_Street		250	25	5-4-3	Complex street dynamics
Undo_Dancer		250	25	1-5-9	Rapid complex motion

Table 4. Comparison of encoding effects between the proposed method and the original HTM16.3.

Sequence	Resolution	BD-PSNR (db)	Proposed (Without $R_{wavelet}$ )		Proposed
Sequence	Resolution	BD-PSNR (db)	BDBR (%)	TS (%)	BDBR (%)	TS (%)
Kendo	1024 × 768	−0.01	0.15	44.32	0.13	46.68
Balloons		−0.02	0.29	46.81	0.27	49.33
Newspaper		−0.03	0.38	44.12	0.36	46.97
Average (1024 × 768)		−0.02	0.27	45.08	0.25	47.66
GT_Fly	1920 × 1088	−0.02	0.42	46.52	0.39	48.23
Shark		−0.01	0.61	49.53	0.58	52.23
Poznan_Hall2		−0.02	0.46	43.71	0.44	45.93
Poznan_Street		−0.03	0.40	44.92	0.41	46.57
Undo_Dancer		−0.02	0.22	49.14	0.21	51.49
Average (1920 × 1088)		−0.02	0.42	46.76	0.41	48.89
Average (overall)		−0.02	0.37	46.13	0.35	48.43

Table 5. Comparison of encoding effects between the proposed method and other methods under HTM16.3.

Sequence	[24]		[14]		[15]		Proposed
Sequence	BDBR (%)	TS (%)	BDBR (%)	TS (%)	BDBR (%)	TS (%)	BDBR (%)	TS (%)
Kendo	0.6	39.5	0.17	35.2	0.43	44.6	0.13	46.68
Balloons	1.4	39.0	0.12	32.9	0.27	38.7	0.27	49.33
Newspaper	1.4	38.0	0.08	32.3	0.20	39.8	0.36	46.97
GT_Fly	0.3	38.3	0.08	35.0	0.15	55.7	0.39	48.23
Shark	0.5	39.4	0.26	44.0	0.14	42.6	0.58	52.23
Poznan_Hall2	0.6	41.4	0.39	51.6	0.40	52.7	0.44	45.93
Poznan_Street	0.4	39.9	0.26	41.6	0.62	52.6	0.41	46.57
Undo_Dancer	0.2	38.6	0.29	49.3	0.44	49.9	0.21	51.49
Average	0.68	39.26	0.21	40.2	0.32	45.8	0.35	48.43

Table 6. Comparison of synthesized views.

Sequence	Resolution	Original Reference Frame	Synthesized Views (HTM16.3)	Synthesized Views (Proposed)
Balloons	1024 × 768
Newspaper	1024 × 768
Shark	1920 × 1088
GT_Fly	1920 × 1088

Table 7. Perceptual quality assessment of synthesized video sequences.

Sequence	SSIM		VMAF
Sequence	HTM16.3	Proposed	HTM16.3	Proposed
Balloons	0.948	0.945	92.7	91.9
Newspaper	0.962	0.960	94.3	93.8
Shark	0.935	0.931	91.5	90.6
GT_Fly	0.956	0.953	93.8	93.1
Average	0.950	0.947	93.1	92.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, F.; Zhang, H.; Zhang, Q. Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion. Electronics 2025, 14, 3646. https://doi.org/10.3390/electronics14183646

AMA Style

Liu F, Zhang H, Zhang Q. Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion. Electronics. 2025; 14(18):3646. https://doi.org/10.3390/electronics14183646

Chicago/Turabian Style

Liu, Fangmei, He Zhang, and Qiuwen Zhang. 2025. "Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion" Electronics 14, no. 18: 3646. https://doi.org/10.3390/electronics14183646

APA Style

Liu, F., Zhang, H., & Zhang, Q. (2025). Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion. Electronics, 14(18), 3646. https://doi.org/10.3390/electronics14183646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Intra-Coding Unit Partitioning for 3D-HEVC Depth Maps via Hierarchical Feature Fusion

Abstract

1. Introduction

2. Observations and Analysis

2.1. Complexity Analysis of 3D-HEVC Depth Map Coding

2.2. Complexity Analysis of CU Partitioning for 3D-HEVC Depth Maps

3. The Proposed Algorithm

3.1. Construction of CU Complexity Features Based on Wavelet Energy Ratio and Quantization Parameter

3.2. CU Partitioning Prediction Based on HFF-CNN

3.3. Overall Algorithm Flow

4. Experimental Results and Analyses

4.1. Dataset Construction and Experimental Setup

4.2. Analysis of Experimental Results

4.3. Comparison with Other Algorithms

4.4. Comparison of Synthesized Views

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI