Low-Complexity and Hardware-Friendly H.265/HEVC Encoder for Vehicular Ad-Hoc Networks

Real-time video streaming over vehicular ad-hoc networks (VANETs) has been considered as a critical challenge for road safety applications. The purpose of this paper is to reduce the computation complexity of high efficiency video coding (HEVC) encoder for VANETs. Based on a novel spatiotemporal neighborhood set, firstly the coding tree unit depth decision algorithm is presented by controlling the depth search range. Secondly, a Bayesian classifier is used for the prediction unit decision for inter-prediction, and prior probability value is calculated by Gibbs Random Field model. Simulation results show that the overall algorithm can significantly reduce encoding time with a reasonably low loss in encoding efficiency. Compared to HEVC reference software HM16.0, the encoding time is reduced by up to 63.96%, while the Bjontegaard delta bit-rate is increased by only 0.76–0.80% on average. Moreover, the proposed HEVC encoder is low-complexity and hardware-friendly for video codecs that reside on mobile vehicles for VANETs.


Introduction
Vehicular ad-hoc networks (VANETs) can provide multimedia communication between vehicles with the aim of providing efficient and safe transportation [1]. Vehicles with different sensors can exchange and share information for safely breaking, localization and obstacle avoiding. Moreover, the sharing of traffic accident's live video can improve the rescue efficiency and alleviate traffic jams. However, video transmission has been considered as a challenging task for VANETs, because video transmission over VANETs can significantly increase bandwidth [2]. This work focuses on the development of video codec that supports real-time video transmission over VANETs for road safety applications.
The demanding challenges of VANETs are bandwidth limitations and opportunities, connectivity, mobility, and high loss rates [3]. Because of the resource-demanding nature of video data in road safety applications, bandwidth limitations is the bottleneck for real-time video transmission over VANETs [4,5]. Moreover, due to the limited vehicle node's battery lifetime, video delivery over VANETs remains extremely challenging. Essentially, the low-complexity video encoder can accelerate the video transmission in real-time, and achieve low delay for video streaming. Hence, in order to transmit real-time video with bandwidth constraint, it is vital to develop an efficient video encoder. A video encoder with high encoding efficiency and low encoding complexity is the core requirement of VANETs [6].
Concurrently, the main video codecs are High Efficiency Video Coding (HEVC, or H.265) that are developed by the Joint Collaborative Team on Video Coding (JCT-VC) Group, and AV1 that is developed by Alliance for Open Media (AOMedia) [7]. Nevertheless, video codec is faced with several challenges. AV1 is a newer codec, royalty free and open sourced. However, the hardware implementation of AV1 encoder will take a long time. H.265/HEVC is the state-of-the-art standardized video codecs [8]. Compared with H.265/HEVC, AV1 increases the complexity significantly without achieving an increase in coding efficiency. When supporting the most available (resource-limited/mobile) devices or having a need for real-time, low latency encoding, it would be better to stick to H.265/HEVC. However, the encoding complexity of H.265/HEVC encoder increases dramatically due to its recursive quadtree representation [9]. Although previous excellent works have been proposed for reducing H.265/HEVC encoder complexity [10][11][12][13][14], most of them balance the encoding complexity and encoding efficiency unsuccessfully.
To address this issue, spatial and temporal information is widely used to reduce the computation redundancy of the H.265/HEVC encoder. However, the spatiotemporal correlation between the coding unit and the neighboring coding unit is not better used. To the author's best knowledge, complexity reduction in real-time coding with the available computational power at the VANET nodes has not been well studied, especially from the viewpoint of hardware implementation. The key contributions of this work are summarized as follows: • We propose a low-complexity and hardware-friendly H.265/HEVC encoder. The proposed encoder allows the encoding complexity to be reduced significantly so that low delay requirements for video transmission in power-limited VANETs nodes are satisfied. • A novel spatiotemporal neighboring set is used to predict the depth range of the current coding tree unit. The prior probability of coding unit splitting or non-splitting is calculated with the spatiotemporal neighboring set. Moreover, the Bayesian rule and Gibbs Random Field (GRF) are used to reduce the encoding complexity for H.265/HEVC encoder with the combination of the coding tree unit depth decision and prediction unit modes decision. • The proposed algorithm can balance the encoding complexity and encoding efficiency successfully.
The encoding time can be reduced by 50% with negligible encoding efficiency loss, and the proposed encoder is suitable for real-time video applications.
The rest of paper is organized as follows. Related works are reviewed in Section 2. Section 3 discusses background details. In Section 4, the fast CU decision algorithm is presented. Simulation results are discussed in Section 5. Section 6 concludes this work.

Video Streaming in Vehicular Ad-Hoc Networks
Real-time Video transmission over VANETs can improve emergency responses' effectiveness for road safety applications. The high rates and low delay are the most challenging aspects of video transmission over VANETs [15]. Recently, some works have been proposed to solve this problem. Different video applications over VANETs need different resources. The collaborative vehicle to vehicle communication approach is presented to enhance the scalable video quality in intelligent transportation systems (ITS), and different methods are developed to enhance the quality of experience and quality of service during scalable video transmission over VANETs [16]. Meanwhile, the use of redundancy for video streaming over VANETs has been analyzed, and a selective additional redundancy approach is proposed to improve the video quality [17]. Moreover, in Ref. [18], a vehicle rewarding method for video transmission over VANETs using real neighborhood and relative velocity is presented to optimize video transmission. Although previous works have studied the issues of video streaming over VANETs, the performance of video codec to support real-time video transmission is missing.

Low Complexity Algorithm for H.265/HEVC Encoder
To reduce the encoding complexity of the H.265/HEVC encoder, some fast algorithm of coding unit (CU) size decision and prediction unit (PU) mode decision is presented. In previous works, the main spatiotemporal parameters that are used for fast CU size decisions include the neighboring CU depth, rate-distortion (RD) cost, motion vector (MV), coded block flag (cbf), and sample-adaptive-offset (SAO) information. Moreover, some other statistical learning based CU selection methods are proposed include Bayesian classifier, support vector machine (SVM), decision tree (DT), AdaBoost classifier and artificial neural network (ANN).
Jiang et al. presented a fast encoding complexity method based on the probabilistic graphical model [19,20]. These proposed algorithms consist of CU early termination and CU early skip methods to reduce the redundant computing of inter-prediction in H.265/HEVC. However, these methods cannot achieve better trade-off between the encoding efficiency and encoding complexity. Refs. [21,22] focused on decreasing the CU depth to reduce the encoding complexity of the H.265/HEVC encoder. In Ref. [23], the unimodal stopping model-based early skip mode decision was used to speed up the process of mode decision. This proposed early skip mode decision method can reduce encoding time significantly. In Ref. [24], a fast algorithm for the H.265/HEVC encoder was based on the Markov Chain Monte Carlo (MCMC) model and Bayesian classifier. Even though the above fast CU size decision methods utilized the spatiotemporal correlations, the fast PU mode decision methods are ignored.
Tai et al. introduced three novel methods including early CU split, early CU termination and search range adjustment to reduce the computation complexity for H.265/HEVC [25]. This proposed algorithm can outperform previous works with respect to both the speed and the RD performance. In Ref. [26], a fast inter CU decision was proposed based on the latent sum of absolute differences (SAD) estimation. This proposed algorithm achieved an average of 52% and 58.4% reductions of the encoding time. Refs. [27,28] focused on CU size decision and PU mode decision, the fast encoding algorithms based on statistical analysis were proposed to reduce the encoding complexity for the H.265/HEVC encoder. This method can reduce about 57% and 55% of the encoding time of the H.265/HEVC encoder. The above methods can significantly reduce the encoding complexity with the joining of the CU depth and PU modes prediction, however, these previous works cannot balance the encoding complexity and encoding efficiency successfully. Moreover, the cost of hardware implementation is higher for previous works.
All in all, this paper focus on the development of video encoder that supports real-time video streaming over VANETs. We design a low-complexity and hardware-friendly encoder to allow video transmission to adapt to the VANETs environment. In addition, compared with the current literature, the proposed encoder can achieve better performance trade-off.

H.265/HEVC
H.265/HEVC standard was released in 2013 by JCT-VC, which can reduce bit-rates by about 50% over H.264. In addition, H.265/HEVC adopts hybrid video compression technology, and the typical structure of H.265/HEVC encoder is shown in Figure 1. The main modules of H.265/HEVC encoder include: (1) Intra-prediction and inter-prediction, (2) transform (T), (3) quantization (Q) and (4) context-adaptive binary arithmetic coding (CABAC) entropy coding. Moreover, inter-prediction and intra-prediction modules are used to decrease the spatial and temporal redundancy. Transform and quantization modules are used to decrease visual redundancy. The entropy coding module is used to decrease the information entropy redundancy. It is noted that the inter-prediction module is the most critical tool, which consumes about 50% computation complexity. Then, in order to achieve real-time coding, the computation complexity of H.265/HEVC encoder should be reduced by decreasing spatiotemporal redundancy.

Input Video
Inter-prediction The video frame is divided into a lot of coding tree units (CTUs) in H.265/HEVC standard. A CTU includes a coding tree block (CTB) of the luma samples, two CTBs of the chroma samples, and associated syntax elements. The CTU size can be adjusted from 16 × 16 to 64 × 64. Each CTU can be divided into four square CUs, and a CU can be recursively divided into four smaller CUs. A CU consists of a coding block (CB) of the luma samples, two CBs of the choma samples, and the associated syntax elements. The CU size can be 8 × 8, 16 × 16, 32 × 32, or 64 × 64. Figure 2 shows an example of the CTB structure for a given CTU. The CTU in Figure 2a is divided into different sized CUs. Correspondingly, the CTB structure is shown in Figure 2b. In each depth of CTB, the rate-distortion (RD) cost of each node is checked until the RD cost is minimum. The prediction unit (PU) can be transmitted in the bitstream, which identifies the prediction mode of CU. A PU consists of a prediction block (PB) of the luma, two PB of the chroma, and associated syntax elements. Figure 3 shows the eight partition modes that may be used to define the PUs for a CU in H.265/HEVC inter-prediction. For a CU configured to use inter-prediction, all eight partitions include four symmetry modes (2N * 2N, 2N * N, N * 2N, N * N) and four asymmetric modes (2N * nU, 2N * nD, nL * 2N, nR * 2N).
A CU can be recursively divided into transform units (TUs) according to the quadtree structure, and CU is the root of the quadtree. The TU is a basic representative block having residual or transform coefficients. In TU, one syntax element named coded block flag (cbf) indicates whether at least one non-zero transform coefficient is transmitted for the whole CU. When there is a non-zero coefficient, cbf is equal to 0. When there is no non-zero coefficients, cbf is equal to 1. Moreover, cbf is an important factor for the CU size decision [14]. The advantage of block partitioning structure is that the arbitrary size of CTU enables the codec to be readily optimized for various contents, applications, and devices. However, the recursive structure of coding block causes lots of redundant computing. In order to support the real-time video transmission over VANETs, the redundant computing of the H.265/HEVC encoder should be decreased significantly.

The Novel Spatiotemporal Neighborhood Set
The object motion is regular in video sequences and there is some continuity in the depth between adjacent CUs. If the depth range of the current CU can be inferred from the encoded neighboring CU, then some hierarchical partitioning is directly skipped or terminated. Therefore, the computational complexity has been reduced, significantly.
In order to utilize the spatiotemporal correlation, the four neighborhood set G is defined as Set G is shown in Figure 4, where CU L , CU TL , CU TR , and CU CO denote the left, top-left CU, top-right, and collocated of the current CU, respectively.

CTU Depth Decision
For video compression techniques, a smooth coding block popularly has the smaller CU depth. By contrast, the larger depth value is suitable for a complex area. Previous works show that the object motion in the same frame remains directional, and the motion and texture of the neighboring CUs are similar. In this work, the depths of neighboring CTU in the set G are used to predict the depth range of current CTU, and the predicted depth of current CTU is calculated as where k is the index of neighboring CTU in set G, Dep k is the depth of neighboring CTU in the set G, and θ k is a weight factor of neighboring CTU's depth, respectively, in the set G. In the H.265/HEVC standard, the range of CTU is depth 0, 1, 2, and 3. Hence, the calculated depth of the current CTU ( Dep CTU ) satisfies In Equation (3), Dep k ≤ 3. Therefore, weight factor θ k satisfies If the range of current CTU is depth 0, 1, 2, and 3, then the sum of weight factor θ k is 1 in this work. Moreover, Zhang's work confirms that, when the weight factor of the spatial neighboring CTU's depth is more than the weight factor of the temporal neighboring CTU's depth, the calculated CTU depth is closer to the actual depth of the current CTU [29]. In this work, each weight factor of the spatial neighboring CTU's depth is equal, and the weight factor of spatial neighboring CTU's depth is more than the weight factor of temporal CTU's depth. Then θ k satisfies However, the calculated value of Dep CTU is a non-integer most of the time. It is not suitable to directly predict the depth of current CTU by the value of Dep CTU . Therefore, the rule of CTU depth range has been formulated as Table 1, and the depth range of current CTU can be generated with the value of Dep CTU . (1) when the predicted depth of current CTU Dep CTU satisfies Dep CTU ≤ 1.5, it means that the motions of neighboring CTUs are smooth and the depths of neighboring CTUs are small. The current CTU belongs to the still or homogeneous motion region and is classified as type T1.
In this case, the minimum depth of current CTU Dep min is equal to "0", and the maximum depth of current CTU Dep max is equal to "2". (2) when the predicted depth of current CTU Dep CTU satisfies 1.5 < Dep CTU ≤ 2.5, it means that the depths of neighboring CTUs are middle. The current CTU belongs to the moderate motion region and is classified as type T2. In this case, the minimum depth of current CTU Dep min is equal to "1", and the maximum depth of current CTU Dep max is equal to "3". (3) when the predicted depth of current CTU Dep CTU satisfies 2.5 < Dep CTU ≤ 3, it means that the motions of neighboring CTUs are intense and the depths of neighboring CTUs are high. The current CTU belongs to the fast motion region and is classified as type T3.
In this case, the minimum depth of current CTU Dep min is equal to "2", and the maximum depth of current CTU Dep max is equal to "3".

PU Mode Decision
The CU splitting or non-splitting is formulated as a binary classification problem ω i , where i = 0, 1. In this work, ω 0 and ω 1 respectively represent CU non-splitting and CU splitting, and the variable x represents the RD-cost of the PU. According to the Bayes' rule, the posterior probability p(ω i |x) can be calculated as follows: According to Bayesian decision theory, the prior probability p(ω i ) and the conditional probability p(x|ω i ) values must be known. Therefore, CU non-splitting (ω 0 ) will be chosen if the following condition holds true: Otherwise, CU splitting (ω 1 ) will be chosen. The conditional probability p(x|ω 0 ) and p(x|ω 1 ) are the probability density function of the RD cost, and they are approximated by normal distributions. Defining the mean values and covariance of RD cost of CU non-splitting and splitting as N(µ 0 , σ 0 ) and N(µ 1 , σ 1 ), the normal function can be given by The prior probability p(ω i ) is modeled with Gibbs Random Fields (GRF) model in set G [30], and p(ω i ) will always have the Gibbsian form where Z is a normalization constant, and E(ω i ) is cost function. k is the index of set G, and ω k denotes the non-splitting or splitting value of the neighborhood k-CU (ω k = −1, 1). The CU size decision deals with the binary classification problem (ω i = −1, 1), and the clique potential ϕ(ω i , ω k ) obeys the Ising model [31]: where the parameter γ is the coupling factor, which denotes the strength of current CU correlation with neighborhood k-CU in set G. In this work, γ is set to "0.75". Then, the prior p(ω i ) can be written in the factorized form: At last, the Equation (6) can be written as Finally we can define the final CU decision function as S(ω i ), which can be written in the exponential form It should be noted that the statistical parameters p(x|ω i ) are estimated by using a non-parametric estimation with online learning, and are stored in a lookup table (LUT). The frames used for online updating of the values of (µ 0 , σ 0 ) and (µ 1 , σ 1 ) are shown as in Figure 5. In each group of pictures(GOP), the 1st frame that can be encoded by using the original H.265/HEVC coding will be used for the online update, while the successive frames are coded by using the proposed algorithm.
Original HEVC coding frame Fast decision coding frame Through the above analysis, the proposed PU decision based on Bayes' rule includes the CU termination decision (inter 2N * 2N) and CU skip decision (inter 2N * 2N, N * 2N, 2N * N). In the case of the CU termination decision, the current CU is not divided into sub-CUs in the sub-depth. In the case of the CU skip decision, the current PU mode in current CU depth is determined at the earliest possible stage. Therefore, the flowchart of the proposed PU mode decision is described as follows.
(1) At the encoding time for inter prediction, first of all, look up the statistical parameters in LUT. Then, the RD cost of the inter 2N * 2N PU mode is checked. If the condition satisfies S(ω 0 ) > S(ω 1 ) and cbf = 0, the CU termination decision is processed. Otherwise, if the condition is satisfying S(ω 1 ) > S(ω 0 ) and cbf = 1, a CU skip decision is made. (2) RD cost of the inter 2N * N PU mode is checked. If the condition is satisfying S(ω 1 ) > S(ω 0 ) and cbf = 1, a CU skip decision is made. (3) RD cost of the inter N * 2N PU mode is checked. If the condition is satisfying S(ω 1 ) > S(ω 0 ) and cbf = 1, a CU skip decision is made. (4) Other PU modes are checked according to the H.265/HEVC reference model.

The Overall Framework
Based on the above analysis, the proposed overall algorithm incorporates the CTU depth decision and the PU mode decision algorithms to reduce the computation complexity of the H.265/HEVC encoder. The flowcharts are shown in Figures 6 and 7  It is noted that the maximum GOP size is equal to "8" in this work, and the value of (µ 0 , σ 0 ) and (µ 1 , σ 1 ) are updated every GOP for PU mode decision. Figure 8 shows the core architecture of the H.265/HEVC with mode decision. By using the architecture, inter-frame prediction is used to eliminate the spatiotemporal redundancy. The proposed CU decision method can accelerate the inter-prediction module before fast rate-distortion optimization (RDO). The novel spatiotemporal neighboring set is used to reduce the complexity of inter encoder which leads to a very low-power cost. Moreover, video codec on mobile vehicles for VANETs need to be more energy efficient and more reliable, so reducing the complexity of the video encoder is important. Then, the proposed low-complexity and hardware-friendly H.265/HEVC encoder can ensure the reliability of the video codec for VANETs significantly. Moreover, as a benefit of the high complexity reduction rate, the energy consumption can be reduced for hardware design, significantly.

Experimental Results
To evaluate the performance of the proposed low-complexity and hardware-friendly H.265/HEVC encoder for VANETs, this section shows the experimental results by implementing the proposed algorithms with the H.265/HEVC reference software [32]. The simulation environments are shown in Table 2. The Bjontegaard delta bit-rate (BDBR) is used to represent the average bit-rate [33], and the average time saving (TS) is calculated as where Time HM16.0 (QP i ) and Time proposed (QP i ) denote the encoding time of using HM16.0 and the proposed algorithm with different QP. In this work, the scenarios have been chosen carefully. This work focuses on the development of a video codec that supports real-time video transmission over VANETs for road safety applications. The common test conditions (CTC) are provided to conduct experiments [34]. The test sequences in CTC have different spatial and temporal characteristics and frame rates. Furthermore, the video sequences of traffic scenarios including 'Traffic' and 'BQTerrace' (as in Figure 9) are tested in this work. Moreover, we selected low delay (LD) configuration to reflect the real-time application scenario for all encoders.  Tables 3 and 4 show the performance results of the CTU depth decision, PU mode decision and the overall (proposed) methods, compared to H.265/HEVC reference software in random access (RA) and low delay (LD) configurations. From the experimental results on Table 3, It can be seen that the encoding time can be reduced by 15.59%, 55.79%, and 50.96% on average for CTU depth decision, PU mode decision, and overall methods, while the BDBR can be incremented by only 0.11%, 0.96%, and 0.80%, respectively. From the experimental results in Table 4, the encoding time can be reduced by 14.05%, 50.28%, and 50.23% on average for CTU depth decision, PU mode decision, and overall methods, while the BDBR can be incremented by only 0.15%, 0.79%, and 0.76%, respectively. For high-resolution of sequences such as "BQTerrace", and "Vidyo4", the time saving is particularly high. Therefore, the overall (proposed) algorithm can significantly reduce the encoding complexity and rarely affects encoding efficiency. Moreover, the proposed method can achieve the trade-off between the encoding complexity and the encoding efficiency. In addition, the optimal tradeoff of encoding performance can be adjusted by the coupling factor γ. Therefore, the optimal tradeoff of encoding performance is that the encoding complexity can be reduced significantly with less than or equal to 0.8% encoding efficiency, and less low delay (LD) and random access (RA) configuration. In order to find the optimal tradeoff with coupling factor γ, the γ is set to "0.5", "0.75" and "0.85" under the same simulation environments. The compared results of the average efficiency and time saving are shown as in Table 5. From this table we can see that, in this case of γ = 0.75, the encoding performance is optimal in this work. Table 3. Performance comparison of different parts of the proposed method (random access (RA)).  Video objective quality evaluation can be expressed by rate-distortion (R-D) curve. The R-D curve is fitted through four data points, and PSNR/bit-rate are assumed to be obtained for QP = 22, 27, 32, 37. In addition, when an error on the predicted depth of the current CTU occurs, the bit-rate will increase. In this paper, the video objective quality is evaluated by using bit-rate and PSNR. Then the lower the accuracies of the predicted depth of the current CTU algorithm, the more the bit-rate increases. Figure 10 shows the R-D curve of the proposed method, compared with the H.265/HEVC reference software. It can be noticed that the enlarged part of the figure shows the proposed algorithm is close to HM16.0 under the LD and RA configurations. In addition, Figure 11 shows the time saving of the sequences "Cactus" and "BlowingBubbles". It is noted that the encoding time can be reduced under different configurations.

Sequence BDBR(%) TS(%) BDBR(%) TS(%) BDBR(%) TS(%)
The performance comparison of the proposed method is shown in Table 6, compared to previous works [12][13][14][24][25][26]. Goswami's work is based on Bayesian decision theory and Markov Chain Monte Carlo model (MCMC). Zhang's work is based on the Bayesian method and Conditional Random Fields (CRF). Tai's algorithm is based on depth information and RD cost. Zhu's algorithm is based on the machine learning method. Ahn's work is based on spatiotemporal encoding parameters. Xiong's work is based on the latent sum of absolute differences (SAD) estimation. However, the proposed approach is based on Bayesian rule and Gibbs Random Field. Although Zhu's method can achieve a 65.60% encoding time reduction, the BDBR is higher than the proposed method. Moreover, the increasing of the BDBR is smaller than state-of-the-art works, while the time saving is more than 50% on average. Compared with previous works [19,20], the proposed work can trade-off the encoding complexity and encoding efficiency successfully.  Figure 11. Time savings of the proposed method for "Cactus" and "BlowingBubbles". Table 6. Performance comparison of the proposed method compared to previous works.

Conclusions
In order to develop the low-complexity and hardware-friendly H.265/HEVC encoder for VANETs, based on a novel spatiotemporal neighborhood set, the Bayesian rule and Gibbs Random Field are used to reduce the encoding complexity for the H.265/HEVC inter-prediction in this work. The proposed algorithm consists of CTU depth decision and PU mode decision methods. Experimental results demonstrate that the proposed approach can reduce the average encoding complexity of H.265/HEVC encoder by about 50% for VANETs, while the increasing of BDBR is less than or equal to 0.8% on average.