A Hardware-Friendlyand High-Efficiency H.265/HEVC Encoder for Visual Sensor Networks

Visual sensor networks (VSNs) have numerous applications in fields such as wildlife observation, object recognition, and smart homes. However, visual sensors generate vastly more data than scalar sensors. Storing and transmitting these data is challenging. High-efficiency video coding (HEVC/H.265) is a widely used video compression standard. Compare to H.264/AVC, HEVC reduces approximately 50% of the bit rate at the same video quality, which can compress the visual data with a high compression ratio but results in high computational complexity. In this study, we propose a hardware-friendly and high-efficiency H.265/HEVC accelerating algorithm to overcome this complexity for visual sensor networks. The proposed method leverages texture direction and complexity to skip redundant processing in CU partition and accelerate intra prediction for intra-frame encoding. Experimental results revealed that the proposed method could reduce encoding time by 45.33% and increase the Bjontegaard delta bit rate (BDBR) by only 1.07% as compared to HM16.22 under all-intra configuration. Moreover, the proposed method reduced the encoding time for six visual sensor video sequences by 53.72%. These results confirm that the proposed method achieves high efficiency and a favorable balance between the BDBR and encoding time reduction.


Introduction
Rapid technological developments have increased the demand for sensor networks, including multimedia data sensors. Sensor networks in camera equipment are known as visual sensor networks (VSNs). VSN nodes can capture and send visual data for monitoring applications such as security surveillance, wildlife observation, and object recognition [1][2][3]. Although visual data can enrich monitoring, visual sensors generate vastly more data than scalar sensors. Storing, transmitting, and processing these visual data is challenging due to storage, computing power, and transmission bandwidth limitations [4][5][6]. Therefore, achieving a high compression rate and low complexity are both key requirements of VSNs [7][8][9].
High-efficiency video coding (HEVC/H.265) [10] was developed by the Joint Collaborative Team on Video Coding (JCT-VC), an joint effort by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The newest video compression standard, versatile video coding (VVC/H.266) [11], was finalized on July 2020, and the compression complexity of H.266 is substantially greater than that of HEVC. Compared with H.264/AVC [12], HEVC can achieve the same video quality at approximately 50% of the bit rate; this compression of visual data is highly efficient. Hence, the storage, computing power, and transmission bandwidth limitations suggest that H.265 is appropriate for VSNs. Additionally, considering that packet loss may occur during the transmission, these error data may cause error propagation due to inter prediction and motion compensation. Intra-frame encoding plays a vital role in the prevention of the error propagation, because it does not need to reference other previous coded frames [13].
Moreover, due to the large number of computations associated with motion estimation in inter prediction, the HEVC inter coding profile may not be adopted in video applications with a low complexity requirement [14]. HEVC includes many tools to improve the compression efficiency of intra frames, such as coding tree units (CTUs), intra prediction, and rate distortion optimization (RDO). Although these technologies can achieve this high compression rate, they also increase the complexity of compression. Hence, applying HEVC in VSNs requires improvements in encoding efficiency.
To solve this problem, we first analyzed the complexity of intra prediction with various coding unit (CU) sizes by using standard test sequences. Then, we analyzed various video characteristics to design the proposed algorithm. The videos captured by the visual sensor camera usually have a single scene and a restricted directional sensing field of view. The content of the videos usually contain a huge number of homogenous regions, such as the background. In intra prediction, these homogenous regions tend to be larger-sized CUs. On the basis of this analysis, we propose a low-complexity and hardware-friendly accelerating algorithm for HEVC intra encoding that reduces the computational complexity.
The key contributions of this work are summarized as follows: • A hardware-friendly and high-efficiency H.265/HEVC encoder for intra frames is proposed. The proposed method also can be parallelized because it only uses information from the current CU. The proposed method significantly reduces computation complexity while achieving a high compression rate, satisfying the requirements for VSN video transmission. • Four projection directions are used in the proposed method to predict the depth range of the current CTU and eliminate impossible intra prediction modes. Moreover, to reduce the effects of noise, we normalized the average intensity of each CU to generate a generalized threshold. • The proposed method achieves high-efficiency encoding; it has more consistent encoding time savings for all test sequences and a slight increase in the Bjontegaard delta bit rate (BDBR) compared to the HEVC test model.
In this study, we provide a hardware-friendly and high-efficiency HEVC encoder to reduce computational complexity for VSN applications. The remainder of the article is organized as follows. Section 2 describes some well-known HEVC acceleration methods. Subsequently, we describe a preliminary analysis of the complexity of the intra prediction and CU partitioning with various CU sizes by using standard test sequences. A hardwarefriendly and high-efficiency method developed on the basis of this analysis is described in Section 3. Section 4 reveals that the proposed method achieved high efficiency and a favorable balance between the BDBR [15] and encoding time reduction for the test sequences of the HEVC test model.

Related Work
Some recent studies have attempted to reduce the computational complexity of HEVC using a variety of methods, such as fast decision algorithms for CU size and mode prediction methods. Several studies have presented texture feature-or machine learning (ML)-based techniques to reduce redundancy in HEVC encoding. Works exploiting texture features include [16][17][18][19]. In [16], Min et al. used global and local texture complexity and four local edge complexity metrics for each block to determine partitioning. The information of neighboring CUs was considered in [17][18][19][20]. Shen et al. [17] applied the most probable mode (MPM) method, which compares the current CU depth with that of the above and the remaining CUs and exploits texture complexity to reduce redundant processes. Le et al. [18] used four spatially neighboring CUs that had been encoded to predict the optimal depth. In [19], Lu et al. used the average depth instead of the maximum depth of neighboring CUs to predict a depth range for the current CU. Fengwei et al. [16] proposed an early termination algorithm for CU partition based on statistical analysis and a fast mode selection algorithm based on the best mode distribution characteristics.
In addition to these texture-based approaches, ML methods have also been proposed. Refs. [21][22][23][24][25] used a support vector machine (SVM) to reduce the encoding complexity. Liu et al. [21] used the features of texture complexity, direction complexity, sub-CU information, and the quantization parameter (QP) to determine the CU depth. Zhang et al. [22] used a two-stage SVM method. In the first stage of classification, a three-output classifier with offline learning was developed to enable early termination of deciding the size or checking depth of the current CU. The second stage of binary classification, which performed online learning on previously encoded frames, was proposed to further refine the determination of CU size. Werda et al. [23] designed a fast CU partition module based on the SVM approach and a gradient-based fast intra prediction mode module. In [24], SVM is used for decision making over the selected intra prediction mode classification which significantly reduces the number of modes. Amna et al. [25] built an online SVM-based method to forecast the CU partition module. The convolutional neural network (CNN) has also been applied to accelerate intra mode decision-making. Yi et al. [26] used a CNN to make the intra mode decision by the features of CUs. In [27], CNN was used to predict the depth of CTU. CNN obtains 64 × 64 CTU as input, and the prediction of depth for each 64 × 64 coding unit is represented by a 16 × 16 matrix of each 4 × 4 block. According the depth matrix, the redundancy partitioning is skipped.
Several studies have attempted to accelerate HEVC encoders for VSNs [28] or vehicular ad hoc networks [29]. In [28], Pan et al., analyzed the content properties of CUs to reduce the encoding complexity of an HEVC encoder for VSNs. In [29], an initial coding tree unit depth decision algorithm was developed that controlled the depth search range. Second, a Bayesian classifier was used to predict unit decisions for inter prediction, and the prior probability value was calculated using the Gibbs random field model.
Although these methods can accelerate CU partitioning, the correlation between CU and texture features has rarely been exploited, and some proposed algorithms are not suitable for hardware implementation. Therefore, methods that strike a better trade-off between complexity reduction and encoding loss can still be formulated. In the next section, we formally examine intra prediction and CU partitioning complexity and then present a hardware-friendly and high-efficiency method.

Proposed Method
In this section, the original HEVC encoding process is introduced, and the proposed accelerating algorithm for reducing the computation complexity of this process is then described.

Encoding Process in HEVC
HEVC is based on a block-based hybrid coding architecture. Each frame of an input video is divided into numerous blocks, called CTUs, and each CTU is divided into many smaller blocks called CUs. The size of a CTU is 64 × 64 and it can be split using a quadtree; this partitioning is displayed in Figure 1. CUs can be classified as having one of four depths: 64 × 64, 32 × 32, 16 × 16, or 8 × 8. A total of 85 CUs can be examined during CTU encoding. As presented in Figure 2, for each CU, intra and inter prediction must be performed before rate-distortion optimization (RDO) [30] is executed to calculate the rate-distortion cost (RD cost). Finally, the encoding scheme with the minimal RD cost is selected as the optimal encoding method. The RD cost is expressed as (1), where RD cost is the RD cost, λ is the Lagrange multiplier, R is the number of encoding bits, and D is the reconstruction distortion.
To select the optimal encoding scheme, all possible depth levels and prediction modes must be exhaustively checked. The recursive structure of CUs results in many redundant computational steps; this restricts the scheme's ability to be used in HEVC applications.

A Hardware-Friendly and High-Efficiency H.265/HEVC Encoder for Visual Sensor Networks
To accelerate these recursive computations, we propose an algorithm for CU partitioning and intra prediction. The proposed algorithm is based on the main idea of intra prediction, which can encode and decode individually without referencing the information from other frames. Additionally, to reference information from the previous frame, extra designs must be added to the hardware architecture. Considering the hardware cost, referencing information from only the current frame is more hardware friendly. Hence, the proposed algorithm in our design does not use the information from the previous frame. The proposed method has three steps: edge feature extraction, projected gradient normalization, and finally, fast CU partition and mode decision. These steps are detailed in the following sections.

Edge Feature Extraction
Due to the limitations of the VSN and hardware implementation, we adopted filters for extracting features instead of applying machine learning. We adopted the edge detection operator and calculated the gradient of four directions (0 • , 45 • , 90 • , 135 • ) by projecting G(i, j). These four gradients were denoted as D 0 , D 45 , D 90 , and D 135 . G 0 (i, j), G 90 (i, j) and G(i, j) were calculated by Equations (2)-(4), where (2) and (3) are the 3 × 3 intensity matrices of the current block A that are centered on the point currently being computed. G(i, j) is the gradient value as calculated from G 0 (i, j) and G 90 (i, j). i and j represent the position of the current center pixel in a row and in a column, respectively.
In general, G 0 (i, j), G 90 (i, j) can be used to calculate the four directions by applying θ(i, j) as expressed in Equation (5). The gradient of each direction by the projected G(i, j) and θ(i, j) can be calculated by Equations (6)- (9).
D 45 can be derived as shown in Figure 3. The projection formula can be expressed as G(i, j) × sin(θ − 45) or G(i, j) × sin(45 − θ). The absolute value of these expressions are the same; hence, we adopt only one of them to calculate the gradient. D 135 can also be derived from Figure 4. The projection formula be expressed as G(i, j) × cos(θ − 45).  To reduce computational complexity, D 0 and D 90 can be reduced using Equations (10), (11), and (4). Hence, Equations (6) and (8) can be rewritten as Equations (12) and (13). Through the application of Equations (12) and (13), Equations (7) and (9) can be rewritten as Equations (14) and (15). These manipulations greatly reduce the computational complexity. The above steps can be summarized in Algorithm 1. The gradient of the four directions (D 0 , D 45 , D 90 , D 135 ) by projecting can be calculated using G 0 and G 90 .
Algorithm 1 Projection of each pixel.
where W represents the width of the block, and H represents the height of the block. The direction with the greatest sum is the main direction of the block.

Projected Gradient Normalization
As mentioned in the previous section, the main direction of the block can be calculated by Equation (16); however, each direction is correlated. For example, a texture with G 0 = 10 and G 90 = 0 has the projection values of 10, 7.07, 0, and 7.07 for the directions 0 • , 45 • , 90 • , and 135 • , respectively. If we consider all directions, the main direction of the block might not be the direction with the greatest sum, as demonstrated by Figure 5 and Table 1. The texture of the block seems to be vertical or horizontal; however, the main direction calculated by Equation (16)   To solve this problem, only some instead of all directions are considered, instead of all of them. Equation (17) is introduced to analyze the relationship between each direction and calculate the difference between the greatest projection value MD 1 and the second-greatest projection value MD 2 . To identify the threshold for determining which group an angle belongs to if it is not near any main direction, we observed a statistical analysis of the relation between the projection of each direction with intensity G(i, j) = 10. For example, the projections on 0 • and 45 • are almost identical if the angle is approximately between them; determining which group this angle belongs to is challenging. Hence, we only take MD 1 as the main direction if Max d > 0.12. Otherwise, both directions are considered. The relevant equations are Equations (18) and (19), where d 1 and d 2 represent the directions with the greatest and second-greatest projection values, respectively. Moreover, to reduce the effect of noise, if the intensity of D 0 (i, j), D 45 (i, j), D 90 (i, j), and D 135 (i, j) is too small, it is set to 0.
After the projection value is adjusted, Equation (16) is used to calculate the magnitude of each direction and to sort them from largest to smallest; that is, M 1 , M 2 , M 3 , and M 4 . The direction with the greatest magnitude is the main direction of the block.

Fast CU Partition and Mode Decision
Generally, the CTU partition is based on the complexity and distribution of the texture; hence, the texture complexity can be used to determine whether to halt splitting of the current CU. For example, if a block contains more textures, its edge is more obvious and the average gradient is larger. Typically, the gradient G(i, j) is adopted as in Equation (4); however, both the hardware implementation cost and the computational complexity of this method are overly large. We therefore adopted absolute values to approximate the gradient [31], and Equation (4) can be rewritten as Equation (20).
To judge a homogeneous block, the average intensity is used to represent the texture complexity of the CU and is calculated as in Equation (21), where N represents the size of the CU.
To select a general threshold, we normalized the average intensity of each CU with the same size in the same frame from 0 to 255 as in Equation (22), where G max represents the maximum intensity of all CUs with the same size in the same frame.
After the normalized gradient is obtained, a threshold is set to determine whether the texture is flat or complex and whether CU splitting should be halted. The quantization parameter (QP) affects the CU partitioning; a smaller QP preserves more detail in the video and tends to cause increased splitting of CUs. Several video sequences were used to study the relationship between the threshold, QP, encoding time reduction, and BDBR. The experimental results are presented in Figure 6; the orange line represents time reduction, and the blue line represents the BDBR. A threshold of 0.4 × QP strikes the best trade-off between BDBR and time reduction. If the G nor of the current CU is smaller than the threshold, partitioning and CU splitting are halted. If splitting of the current CU is not halted, it must be determined if the intra prediction of the current depth can be skipped for the current CU. In the proposed method, this was performed on the basis of the direction of four sub-CUs. Because the main direction of each CU was calculated, the number of sub-CUs with a direction different from the current CU can be counted. First, sub-CUs are filtered on the basis of the texture intensity; the direction of a sub-CU is considered only if its M 1 is greater than 0.25 times M 1 of the current CU. We then consider the main direction of the CU and the sub-CU; if these directions differ, the sub-CU is counted. If the count is greater than 2, half of the sub-CUs have a different direction than the current CU, and intra prediction is skipped for the current CU. Figure 7 presents the algorithm for CU partitioning. Subsequently, the intra mode decision-making method is introduced. The accumulated magnitude of the four CU directions can be obtained from Equation (16) in Section 3.2.2. To determine the main direction of the current CU, Equation (23) is used to calculate the ratio P i of M 1 and M 2 . If the M 2 is similar to M 1 (P i > 0.2), the texture of the current CU is considered to contain two directions. After the main direction of the current CU is calculated, the results on Tables 2 and 3 are used to obtain the corresponding modes for intra prediction.  In addition to these modes, the intra modes of the neighboring CU, direct current (DC) modes, and planar modes are also added to the mode candidate list. Figure 8 presents the algorithm for building the mode candidate list. After obtaining the mode candidate list based on the method of [32], we reordered the modes in the candidate list after calculating the sum of the absolute transform difference (SATD) cost and selected the three modes with the lowest cost as the candidates that undergo the time-consuming RDO process. The SATD cost is calculated by Equation (24), where D SATD is the residual of the SATD, λ is the Lagrange multiplier, and Bits m is the number of bits for the prediction mode. The most suitable mode for the current block will be planar or DC if all angular modes have the same SATD cost. Therefore, all angular modes were removed from the candidate list if all angular modes had the same SATD cost. A flowchart of this process is presented in Figure 9.

Experimental Results
We evaluated our proposed method by using HEVC test software (HM) and compared the results with several related works to validate the efficiency of the proposed algorithm. The main reason why we employed the HM encoder is to obtain a fair comparison. To our knowledge, the HM encoder is the recognized standard version, which is employed in most recent studies instead of x265.The proposed algorithm was implemented in HM 16.22 to evaluate its overall performance.

Experimental Environment and Conditions
We used the most recently released version of the HEVC test software to evaluate the algorithms. All tests were performed using the all-intra configuration. The test sequences recommended by JCT-VC [33] from class A to class E were used to evaluate our algorithm in terms of BDBR and time reduction. Time reduction was determined using Equation (25); QPs represents the QP set {22, 27, 32, 37}, T ori is the total encoding time of the HM encoder, and T mod is the total encoding time of the HM encoder with our algorithms. BDBR was determined based on YUV-PSNR and bit rate. The testing machine had an Intel Core i7-8700 CPU clocked at 3.20 GHz and was running Windows 10 (64 bit).

Experimental Results
To evaluate the performance of each individual method, several test videos were used. The results are summarized in Tables 4 and 5. Table 4 presents the results for the proposed method with normalization and without normalization. The normalization is effective when the brightness of video is low or the complexity of content is quite different, such as with Mobisode2, Keiba, and Johnny in Table 4. The proposed method with normalization can obtain a better BDBR with little time savings loss. A video with an average brightness will not be affected by normalization, such as BlowingBubbles in Table 4. Table 5 presents the results for acceleration of CU partition and acceleration of CU partition and intra mode, respectively. The acceleration of CU partition reduced the complexity of the encoding process by approximately 40% on average, with a slight increase in BDBR. The results also reveal that the acceleration of the intra mode can reduce encoding time by approximately 10% and negligibly reduce BDBR. Tables 6 and 7 present the results for the proposed method and previous methods [22,24,25,34,35]. The proposed method reduced the encoding time by 45.33% on average and increased the BDBR by 1.07% when compared to HM 16.22. The symbol * indicates that some frames of a sequence were used in the training set in [22]. As indicated in Tables 6 and 7, the time savings of the algorithms of Zhang et al. [22], Jamali et al. [34], Sulochana et al. [24], Amna et al. [25], and Yin et al. [35] were 48.02%, 47.0%, 31.9%, 47.0%, and 32.6% on average, respectively, and their average BDBR increased by 1.39%, 1.44%, 0.83%, 1.5%, and 0.87%, respectively. The BDBR of the proposed method was lower than those of [22,25,34], indicating that the proposed method ensures that the most CUs are predicted correctly. In addition, to balance the performance of BDBR and TS, we adopted the TS/ BDBR to evaluate the performance better. This evaluation metric is also used in [36][37][38], so we can obtain an intuitive evaluation of the results. Table 8 demonstrates that evaluation measure for the proposed method and previous methods [22,24,25,34,35].
The results reveal that under the same increase in BDBR increase, the time savings of our proposed method are the best.   In addition to these test video sequences, according to the characteristics of VSNs, cameras capture videos of distant objects/scenes from a certain direction [3], and six video sequences are taken from [28] as visual sensor videos to evaluate the proposed algorithm. Hence, the six visual sensor videos, namely FourPeople, Johnny, KristenAndSara, Vidyo1, Vidyo3, and Vidyo4, were used to evaluate the proposed method. The six video sequences are displayed in Figure 10. Each video sequence has a resolution of 1280 × 720 and frame rate of 60 fps. Table 9 presents the results for the proposed method and [22,34]. The proposed method reduces the encoding time by 53.72% and increases BDBR by 1.13% on average compared with HM 16.22. Although the time reduction obtained in Zhang et al. [22] is 8% higher than the proposed method, its BDBR is twice as high. Moreover, the BDBR and time savings of the proposed method are both superior to the algorithm of Jamali et al. [34]. Table 9 demonstrates that the proposed method achieved a higher efficiency and a better balance between BDBR and time reduction for VSNs than previous algorithms. Figure 11 illustrates the splitting results for the default HM16.22 algorithm and the proposed method with QP set to 22. The CU partition is skipped if the block is flat, and the split is close to the textures if the block is complex. Note: The symbol *indicates that some frames of a sequence were used in the training set in [22].

Conclusions
In this paper, a hardware-friendly and high-efficiency H.265/HEVC encoder for VSN is proposed. The proposed method exploits the gradient of the texture to skip redundant CU partitioning processes and facilitates efficient intra prediction. The experimental results reveal that the proposed method can reduce the encoding time by 45.33% but only increases BDBR by 1.07% when compared to HM16.22. Moreover, the performance of the proposed method for six visual sensor video sequences was superior to that of previous algorithms. In summary, our proposed method achieves high-efficiency encoding with more consistent encoding time reductions for all test sequences and only a small increase in BDBR.
HEVC is a block-based hybrid coding architecture; in addition to intra prediction, there is also inter prediction configuration. Based on the experience gained in the development of the proposed method, an acceleration algorithm for inter prediction or other block-based hybrid coding architectures can be developed.