Early CU Depth Decision and Reference Picture Selection for Low Complexity MV-HEVC

: The Multi-View extension of High Efﬁciency Video Coding (MV-HEVC) has improved the coding efﬁciency of multi-view videos, but this comes at the cost of the extra coding complexity of the MV-HEVC encoder. This coding complexity can be reduced by efﬁciently reducing time-consuming encoding operations. In this work, we propose two methods to reduce the encoder complexity. The ﬁrst one is Early Coding unit Splitting (ECS), and the second is the Efﬁcient Reference Picture Selection (ERPS) method. In the ECS method, the decision of Coding Unit (CU) splitting for dependent views is made on the CU splitting information obtained from the base view, while the ERPS method for dependent views is based on selecting reference pictures on the basis of the temporal location of the picture being encoded. Simulation results reveal that our proposed methods approximately reduce the encoding time by 58% when compared with HTM (16.2), the reference encoder for MV-HEVC.


Introduction
High Efficiency Video Coding (HEVC) is the latest video compression standard, introduced in 2013.When compared to H.264, HEVC offers almost double the compression ratio with nearly the same video quality [1].Moreover, HEVC supports resolutions up to 8192 × 4320 pixels, which is essential to important applications, such as augmented reality, live streaming, and HD video conferencing.With the development of fast Internet, mobile networks, video-enabled devices, and video-sharing websites, the data traffic is rapidly shifting towards video content.By the year 2022, 82% of all the data traffic is expected to be video data [2].Therefore, video will play a dominant role in deciding the future of Internet and consumer devices.Video content generates a huge amount of data, especially when the resolution of the video is high.As video is a sequence of frames/pictures being displayed at a certain frequency, there is a high similarity between the pictures and within a picture in video data.Utilizing these characteristics of the video data, compression algorithms have been developed [3].An encoder generally exploits the spatial and temporal correlation among the video data to reduce the number of bits in which the video information is stored.Video data in compressed form are sent over the network, and the receiver has to decode the compressed data to get the original video.Therefore, both the sender and receiver need to agree upon an encoding algorithm.To address the aforementioned issue, video compression standards have been developed [4].These standards define the compression procedure, so that the consumer devices at the receiver end can reliably decode the compressed video.There are two main research fields in video coding, which are (i) the compression efficiency, which is a measure of the number of bits required to send a certain amount of video content.More information in a lesser number of bits corresponds to higher compression efficiency of a standard and (ii) encoder complexity, which is a measure of time required to encode a certain amount of video.Encoder complexity becomes very important in scenarios where the video information is required to be sent in a shorter span of time.With the rapid development of stereoscopic and auto-stereoscopic displays [5,6], multi-view and 3D video is gaining more popularity.Hence, recent standards also address this type of video data.H.264 has a multi-view extension known as Multi-view Video Coding (MVC) [7], and HEVC has MV-HEVC.The multi-view extensions of the standards use the multi-view video concept in which multiple cameras capture the same scene.The concept of Free View point Video (FVV) [8,9] also needs multi-view videos for virtual view synthesis [10].In addition to the main standard, in the multi-view extension, the inter-view similarity of the video content is also exploited to get more compression.This further increases the encoding complexity.Therefore, more time is required to encode the videos in this format.Our work is one of the latest additions in the aforementioned domain in which we propose two techniques to reduce the encoding complexity of the encoder.The main contribution of our works are listed below.

•
We propose an Early Coding Unit (CUs) Splitting (ECS) method, which is based on Coding Unit (CU) splitting information available from the base view.Moreover, the neighborhood of the co-located Coding Tree Unit (CTU) of the base view is used to derive the threshold depth for current CU in dependent views.This threshold depth value is further improved in the Temporal Level (TL)-4 of the pictures.

•
We present an Efficient Reference Picture Selection (ERPS) method, in which we avoid the pictures in the reference pictures' list, which are least probable for prediction.The proposed ERPS method also works in TL-4 because the reference pictures available have minimum temporal distance at this stage.

•
We believe that the proposed ECS and the ERPS strategies provide much better performance for various sequences than previous published works.
The rest of the paper is structured as follows.Section 2 presents a brief overview of the related works, while Section 3 familiarizes readers about the fundamentals of the splitting and mode selection procedures of the basic coding and prediction blocks.Section 4 explains the complexity measuring parameters for the encoder.Section 5 describes the proposed ECS and the ERPS methods and their initial evaluations for the defined parameters.Section 6 lists detailed results and comparisons with recent state-of-the-art methods.Finally, Section 7 concludes the work and hints towards future research directions.

Literature Review
Below, we briefly consider some important methods that have been proposed to reduce the complexity of HEVC.Shen et al. [11] used different methods, such as the Rate Distortion (RD) cost [12], Motion Vector (MV) distribution [13], and the Coded Block Flag (CBF) [14], to achieve low complexity.Utilizing the observation that the depth of the CUs is generally similar to its neighborhood, the depth of neighboring CUs is used to decide the depth of the CU that is currently being encoded [15][16][17][18][19][20][21].Later, Shen et al. [15] proposed a complexity reduction solution by selecting prediction modes according to three regions based on motion activity.In their work, complexity was reduced by studying the correlation between the different levels of CUs and the neighborhood.In [18], the spatio-temporal correlation between the depth of CTUs was used to reduce the depth level of the current CU accordingly.In [19], the published work used the information of the consecutive frames and CU levels from co-located CUs to avoid certain depth levels.A fast CU decision method, based on the Markov Random Field (MRF), was proposed in [20], which uses the variance of the absolute difference-based feature for CU selection.The aforementioned works are based on the observation that the neighboring decisions are mostly similar in a picture and also in co-located regions of available reference pictures.It is important to state here that all the pictures do not have similar partition decisions.Therefore, there are some regions normally referred to as boundaries, around which the partition decisions are not the same [13].Few researchers have used RD cost for the early decision of CUs [22,23].Lee et al. [22] proposed an RD cost-based fast CU decision method, which decides about the skip mode and CU splitting.Shen et al. [23] proposed three adaptive decision methods, which are early skip mode decision and two decision modes.One is based on prediction size correlation information, while the other is based on RD cost correlation.The aforementioned methods are well cited in the literature.However, they are not designed for multi-view videos.
In MV-HEVC, extra information from the previously-encoded view is also available for prediction.This aspect of multi-view videos of MV-HEVC was not exploited in the above-mentioned methods.MV-HEVC also considers inter-view prediction from co-located pictures of the base view.Therefore, its coding efficiency is further increased at the cost of increased encoding complexity.To the best of our knowledge, very few researchers have worked on reducing the complexity of MV-HEVC.In [24], the researchers proposed an inter-view prediction method for the HEVC-based multi-view video.The aforesaid work mostly used the maximum depth of the co-located CTU of the base view as the threshold for corresponding CUs of dependent views.Moreover, the researchers also encoded the multiple views separately as simulcast HEVC.The proposed method presented a nice idea for exploiting the inter-view correlation between the views, but it was limited to only the co-located CTU in the base view.Due to the disparity between the views, using only the co-located CTU is not a good option.We address this shortcoming in our proposed method by increasing the prediction window size in the base view.In [25], the authors introduced a fast CU decision algorithm.It predicts the depth of the current CU by determining a threshold depth value, which is calculated from the depth values of 1 inter-view co-located, 1 temporally co-located, and 3 neighboring CUs.MV information of the neighborhood CUs of current CU is used to correct threshold values calculated earlier.Due to the disparity between the views, the co-located CU of the base view is not a good option for predicting CU depth for dependent views.We address this issue in our proposed method, by using a larger area around the co-located CU.The temporally co-located CU used in [25] becomes ineffective due to motion and the temporal distance between the pictures.The three adjacent CUs used in [25] are only good when CUs (the CU being predicted and the CU being used for prediction) belong to the same object in the scene.We do not use adjacent CUs for predicting the depth threshold because this does not work when all the CUs involved do not belong to a single object.We also do not use temporally co-located CU; instead, we made use of temporal levels in our method.Recently, in [26], the published work predicted the CU depth of the dependent view through the depth decision made in the same region of the base view.Moreover, the prediction of the depth threshold is achieved by nine CTUs instead of using the CU neighborhood.It addresses the weakness of the method in [24].In our proposed method, we further improve it by adding temporal levels to the threshold prediction.Whereas in [27], the authors suggested methods to implement encoding processes on parallel computing platforms, they also proposed a CU splitting termination method, which is based on Quantization Parameter (QP) values.Their algorithm is based on the observation that, with the increase in the QP value, the chances of CU splitting decrease.
The aforementioned works are nice efforts in the domain of MV-HEVC.However, we observed that still, many aspects of the correlation between the splitting and mode decision of a CU are not exploited.Therefore, our study mainly handles the aforementioned issue and exploits some of these areas.Below, we briefly introduce some basics of the standard, so that the reader can become familiar with the terminology used in the following sections.

Fundamentals
A typical prediction structure for three-view MV-HEVC is shown in Figure 1 in which View-0 is the base view that is encoded first without using inter-view prediction information.View-1 and View-2 are dependent views, as these use the inter-view prediction from the base view to improve the coding efficiency.However, this increases the complexity of the encoder [28,29].MV-HEVC uses the same basic prediction method and partitioning structure of the HEVC with the addition of using inter-view prediction.To be precise, the HEVC uses the quad tree partitioning structure [30][31][32][33].The basic unit used for compression in HEVC is termed CTU, which can have a maximum size of 64 × 64.This maximum size can be controlled through the configuration file.
Figure 2a shows the splitting of a CTU into Coding Units (CUs), their processing order, and corresponding depth levels.Each CTU may contain a single CU or it may be split into four equally-sized square-shaped CUs.This process is continued until it reaches the minimum CU size of 8 × 8.Each CU can be either predicted from previously-encoded reference pictures, that is inter-prediction, or from the same picture being encoded, which is intra-prediction.The CU can be further divided into a single or more Prediction Units (PUs).With a 2N × 2N size of CU, the possible PU sizes are shown in Figure 2b.Intra-prediction only uses sizes 2N × 2N and N × N for PUs, whereas inter-prediction uses all sizes for PUs.The use of size N × N is limited to the minimum CU size.The values for N: (32,16,8,4) for the depth levels D0, D1, D2, and D3, respectively as shown in Figure 2c.The CTU partitioning process of the HEVC is very complex.Therefore, most of the researchers have targeted simplifying this process.For example, Choi et al. [34] proposed an Early CU (ECU) decision method, which decides splitting on the basis of the current best mode decision of a CU depth.If the current best mode is determined to be skip mode, then further splitting is not done.Yang et al. [35] proposed a method that detects the skip decision early.This method is known as Early Skip Detection (ESD) in the related literature.Gweon et al. [36] proposed the CBF-based fast mode decision method.It stops splitting PUs when after the evaluation of a 2N × 2N CU, the root CBF is zero.These methods are available in the HEVC reference encoder (HTM) as optimizing tools for low complexity HEVC. Figure 3 shows the CU and PU decisions' process.
First, the cost for MODE-INTERwith a 2N × 2N partition size is calculated.If it is activated, then the early skip condition is checked.If it is true, then the bit cost is updated, and other modes are left unchecked for the current depth.If it is not true, then the CBF-fast condition is checked.If this condition is true, then the bit cost is updated, and the best current mode is set to MODE-INTER with a 2N × 2N partition.If this condition is not true, then the cost for MODE-SKIPis calculated and the mode cost is updated.Then, MODE-INTER with partition size N × N is calculated, after checking the depth and CU size requirements.Then, first, the mode cost for partition size N × 2N is calculated.Then, the mode cost for partition size 2N × N is calculated and then Asymmetric Motion Partition (AMP) mode evaluation, according to Table 1.After every inter-partition, the CBF-fast method is applied.Meanwhile, inter-modes and intra-modes 2N × 2N and N × N are checked.Mode N × N is only checked at the maximum depth.Then, before going to further splitting, the early CU condition is checked.The optimal depth selection for a CU and its thorough mode decision make the encoder very complex, which has attracted many researchers to find ways to reduce its complexity.The merge/skip mode is selected more than 88% [37] on average.Therefore, most researchers have targeted the early merge mode and early skip mode decision to achieve low complexity.Merge mode finds a PU from spatial and temporal merge candidates, whose motion information can be used for the current PU.If for the current PU, there exists such a PU in its merge candidates, whose transform coefficients and motion vectors are negligible, then it is coded with the skip mode.It is important to state here that the prediction residual is not transmitted in skip mode.The video sequence of pictures is partitioned into coded video sequences, which can be decoded independently.Picture Order Count (POC) is the identification of a picture in the order in which the picture has been generated.Encoding Order (EO) is the order in which the pictures are encoded.Only the pictures that have already been encoded can be used for prediction.The coding structure defines the sequence of pictures, their POC, the EO, and the dependence between the pictures for prediction.The sequence of pictures in the coding structure is generally called the Group Of Pictures (GOP).In the HEVC literature, this set is normally referred to as the Structure Of Pictures (SOP).Figure 4 shows a GOP of eight pictures, with EO and POC.Pictures are grouped into different Temporal Levels (TLs).This division of pictures into TLs is based on the EO of the pictures and the POC of the available reference pictures.At TL-4, the availability of adjacent pictures for prediction is maximum.The decoded picture buffer contains the decoded pictures.Reference pictures are used from these decoded pictures.The reference picture set contains all the pictures that may be used for reference.For the prediction of the current picture, the reference pictures are stored in the reference picture list.There are two types of reference picture lists used in HEVC, which are List-0 and List-1.List-0 is used when the slice is of the P-type.For the B-type slice, both List-0 and List-1 can be used.For uni-prediction, List-0 or List-1 is used, and for bi-prediction, both Lists are used.Inside a reference picture list, the pictures are identified by the reference picture index.The pictures used as a reference from the reference picture lists are selected on the basis of finding the best available match.As the content of the video is mostly similar between consecutive pictures, there is a strong correlation between the pictures, which are located at a short temporal distance from each other.In the next section, we briefly overview the related works.In the next section, we briefly describe the complexity measuring parameters that have also been used in our proposed methodologies and evaluations.Table 1.Fast Asymmetric Motion Partition (AMP) mode evaluation [38].

Conditions Actions
Best mode is (2N × N) Try nL × 2N and nR × 2N Best mode is ((2N × 2N) && (!merge mode)&& (!skip mode)) Try all AMP modes Parent CU is AMP mode Only try merge mode for all AMP modes Parent CU is ((2N × 2N) && (!skipped)) Only try merge mode for all AMP modes Parent CU is ((intra) && (best mode is 2N × N)) Only try merge mode for 2N × nU and 2N × nD Only try merge mode nL × 2N and nR × 2N CU size is 64 × 64 No AMP modes are evaluated

Complexity Measuring Parameters
The computational complexity is due to the calculations of arithmetic functions used during the encoding process.According to the findings in [39], the functions, such as the Hadamard (HAD) transform, Sum of Absolute Difference (SAD), and Sum of Squared Error (SSE), are the main cause of the computational complexity.To measure the complexity reduction of our proposed method, we investigated these parameters.For a block, being encoded at the current size, its best match is searched, spatially, temporally, and in the base view within a search window.This process is repeated for each block size, all possible prediction modes, and for all possible search locations.The best match implies a minimum difference between the current block and its match from all the available options.Equation (1) shows how the difference between the current block and the predicted block of the same size is calculated.
where i and j represent the location of the pixel.SSE and the SAD as described in Equations ( 2) and ( 3), respectively, are used to find the error between the blocks being encoded with its possible matches.
This process also needs to consider bit cost, which is the number of bits required to describe the splitting and prediction information.The overall cost functions for the prediction parameter and mode decisions are described in Equations ( 4) and (5), respectively.This is generally called Rate Distortion (RD) cost.
where B mode and B pred are the bit costs for the mode and prediction decisions, respectively.λ pred and λ mode are Lagrange multipliers, and lu and ch stand for luma and chroma, respectively.w ch is the weighting factor for the chroma part of SSE.These processes are called each time, and a comparison is made to get the best match.The number of times a process is called during the encoding gives insight about the complexity of the encoding.In our comparison, we used these parameters to demonstrate the reduction in encoding complexity.To compare the complexity of the encoder in its original form with our proposed method, we use Equations ( 6)-( 8) to get the percent reduction in these operations.
∆H ADs(%) = H ADs (Orig.)− H AD (Prop.)H ADs (Orig.)× 100% ( 6) where "Orig."means the values for the the original configuration of the reference encoder and "Prop."means the values for the proposed method implemented in the reference encoder.

Proposed Methods
In this section, we explain our proposed complexity reduction methods in detail.In Section 5.1, initially, the proposed ECS method is explained step by step.Then, it is compared with the reference encoder in terms of percentage reduction in the selected operations.In Section 5.2, first, the proposed ERPS method is discussed followed by a comparison with the reference encoder.In all of our observations, analysis, and implementation, we used the standard test sequences as shown in Table 2.
In places where the QP value is not mentioned, averaged values for different QP values (25,30,35,40) are used.

Proposed ECS Approach
The selection of the best CU size is a time-consuming process because the encoder has to go through all possible combinations of sizes and available reference pictures for the selection of the best match on the basis of the minimum RD cost.If the CU size selection process can be reduced, the overall encoding time can be reduced considerably.Therefore, we target the early termination of CU splitting, based on the inter-view and temporal information available.The immediate question that arises is: is there any room for further optimization of the CU size selection process, i.e., if we can predict the maximum CU size early, would we be able to reduce the encoding time?In order to find an answer to this question, we want to know about the percentage relation between CTU depth levels.Table 3 shows the percentage relation between the depth levels (D 0 − D 3 ).As it is evident in Table 3, a high percentage of CUs are best matched at the D 0 level, which is due to the fact that the major portion of a picture is similar.Table 3 gives us an answer to our question that, yes, there is a very small percentage (1.5%) of CUs that are best matched at Depth Level 3, and on average, 76.3% of CUs are matched at Depth Level 0, which means that if the maximum depth level of the majority of the CUs is predicted correctly, then the time-consuming splitting and matching process for higher depth levels can be saved.
In multi-view video coding, different cameras capture the same scene, which means that the content of the videos is almost the same.The slight difference is due to the change in angle of capturing between the scene and the cameras.In other words, the decisions made by the encoder for multiple video streams of multiple cameras should also be very similar.This can also be observed from the CU splitting decisions shown in Figure 5.The CU splitting decisions in the same region between the pictures are very similar.This implies that much correlation exists between the views and corresponding regions of the picture.This is our motivation to use the CU splitting decisions of the same region of the base view to predict the CU depth threshold of the dependent views.By the same region in the base view, we mean a square area/window, centered onthe co-located CU of the dependent view.Adding this with the observations in Table 3, we build our ECS method.To get the depth threshold value for the CU being encoded in the dependent view, we used a window of 3 × 3 CTUs around the co-located CU in the base view.This is graphically shown in Figure 6.The CU, which is currently encoded as shown in Figure 6a, and its co-located CTU along with its neighborhood are depicted in Figure 6b.To get a broader observation area, the co-located CTU and its eight surrounding CTUs were considered in our analysis.CTUs of the base view were assigned the highest depth level of the CUs they contained after encoder decisions.As an example, the co-located CTU of the base view in Figure 6a  Table 4 gives us an idea of how accurately we can predict the depth threshold, in terms of "hit" and "miss".If the depth of the current CU is higher than our assumed depth, then we call it a "miss", otherwise it is called a "hit".It can be seen from the results that the hit percentage is very high, which means that the predicted depth threshold D r1 is highly accurate.Table 5 summarizes the percentages of depths of CUs in the case of a hit.If the CU depth is predicted at D r1 = 0, then the chance to reduce complexity is highest, because the matching process for higher depths is avoided.Similarly, if a depth level D r1 = 3 is predicted for a CU, then it does not reduce the encoding complexity because the encoder has to go through all the depth levels.
After utilizing the inter-view correlation, now we move on to the temporal domain.It can be seen in Figure 4, at TL-4, that the adjacent pictures are available as reference pictures.Therefore, in this case, there is a very high probability that the scene has not changed and that the best match can be found at lower depth levels as compared with lower TL pictures.Table 6 summarizes the relation between depth levels of CTUs and TLs.Here, CTUs are considered instead of CUs because CTUs are of the same size and have a constant number in each picture.Therefore, a CTU is considered at depth level D 3 when it contains at least one CU of depth level D 3 , and a CTU is considered to be at depth level D 2 when it has at least one CU having depth D 2 and no CU with depth D 3 .It can also be seen that at TL-4, above 95% of CTUs are at depth level D 0 and 98.1% of CTUs are at depth levels D 1 and D 2 .Only 1.87% of CTUs have depth level D 2 and D 3 .Therefore, for TL-4, we used D r2 as the depth threshold, shown in Equation (10).
Table 7 shows the average CU depth relation with TLs in the case of the hit scenario.Here, we can see that a high percentage of CUs are at depth level D 0 .These results are very attractive, but when we go through the details of the encoding process, the possible reduction in encoding complexity is not that much, as can be seen from these results.At TL-4, we find that most of the lower depth CUs are encoded in skip mode, which means that these CUs do not go through the time-consuming splitting and matching process for depth levels.Therefore, detecting these lower depths earlier does not play a significant role in the complexity reduction as the percentages of Table 7 are suggesting.We propose an algorithm for early splitting of the CU.We call this new method ECS.Our algorithm is based on CU splitting information of the base view and CU splitting information related to the TLs of the pictures.The flow of the proposed ECS algorithm is shown in Figure 7.In our proposed method, we aim to reduce the complexity of dependent view encoding, where the base view is used for gathering information.Our proposed ECS method can be divided into two main parts.In the first part, we only deal with pictures that do not belong to TL-4.When the encoding for the dependent view starts, we first check whether the current CU belongs to a boundary CTU.If it belongs to a boundary CTU, then the normal encoding process is used.If it is a part of non-boundary CTU, then we compare it with the maximum depth threshold D r1 , as shown in Equation ( 9), which we have calculated for this CU from the base view.If the current CU depth is equal to the calculated depth threshold, then further splitting to higher depth levels of the CU is not done.In the second part, we deal with pictures that belong to TL-4.As shown in Table 6, at TL-4, the percentage of depth level D 0 is very high as compared with lower TLs.On the basis of this observation, we modified our method for TL-4.Therefore, we used depth threshold D r2 Equation (10) for TL-4.
Operations with respect to sizes of the blocks are used as a comparison tool.It can be seen in Table 8 that there is a significant decrease in the number of SSE, SAD, and HAD operations.To show the effect of resolution, average values of 1024 × 768 and 1920 × 1088 pixels sequences are separately calculated.It can be seen that in the case of the SAD64 and SSE64 operations, the percentage reduction in operations is very low, which means that the CUs at depth level D 0 are not reduced.The highest percentage decrease in SAD operations can be observed at SAD4 and SAD8, which is on average 76.2% and 75.6%, respectively.These are followed by SAD16 and SAD32, which on average are 70.8% and 56.8%, respectively.The percentage reduction in operations SAD12 and SAD24 is comparatively low.A similar pattern can be observed in the percentage reduction of SSE and HAD operations, where the percentage reduction in operations decreases with the increase in the size of the block.This happens because we are trying to reduce the splitting process, and in comparison with the original encoder, the lower depth levels of CUs are mostly avoided in our proposed method.We get the higher percentage reduction in operations, which are related to higher depth levels.It can also be seen that this phenomenon is independent of the video content.These results give us a general view of the complexity reduction of the encoder, as we can see for the percentage reduction of the operations.The size of the operation block is directly proportional to the amount of time the operation takes.Therefore, the high percentage reduction in smaller block size operations might not be reflected as much in the overall encoding time.

Proposed ERPS Approach
Since video is a sequence of pictures captured in discrete time intervals, these pictures contain many similar contents.The encoder uses this aspect to compress the video data.The encoder maintains a set of encoded frames/pictures as reference pictures for the picture being encoded.This set of reference pictures is selected on the basis of EO and temporal distances.Some pictures in reference lists may be temporally near and some may be far from the picture being encoded.As video is a sequence of discrete pictures in the time domain, the content similarity between the pictures of the video is inversely proportional to the temporal distance between the pictures.Based on this characteristic of video and the video encoder, we built our reference picture selection method.At TL-4 for a picture being encoded, the immediate previous and next pictures are available in the reference picture list.There is a very high probability that these adjacent pictures might be selected as reference pictures rather than a picture, which is at some temporal distance from the picture being encoded.
To further strengthen our argument, the correlation among these pictures, the reference picture selection in terms of the reference indices of both reference lists is analyzed.Moreover, we want to know which pictures are referenced mostly for encoding each picture, so that we can avoid the matching process for the reference pictures that are least expected to be selected as the reference picture.We performed the analysis for both the reference picture lists.We divided our analysis result on the basis of TLs, as shown in Figure 4. Table 9 illustrates our analysis of reference indices in terms of the percentage for List-0 and List-1 for various temporal levels.It can be observed from these results that for List-1, at TL-4, the selection of both reference Index-1 and Index-2 is less than 2.5%, while Index-0 is selected 97.5%.Now, our logical argument is also backed up by practical results.Reference picture selection is dependent on the temporal relation between the picture being encoded and the picture being referenced for prediction.Now, using the observations in Table 9, we can reduce the computation complexity by avoiding the search and matching process for reference Index-1 and Index-2 in List-1.During the encoding process when the TL of the picture being encoded is 4, then we do not use the reference pictures indexed as 1 and 2 in reference picture List-1, only the reference picture, which is indexed as 0 in List-1, used as the reference picture option.For pictures that do not belong to TL-4, normal encoding process is followed.
Algorithm 1 shows the pseudo-code of the proposed ERPS method in the encoding process, which processes the TL of the picture being encoded and outputs List-1, which contains Index-0, Index-1, and Index-2, as shown in Algorithm 1. Table 10 shows the results' comparisons obtained by our proposed ERPS algorithm with HTM (16.2).We compare both on the basis of the number of SAD, SSE, and HAD operations done in each configuration of the encoder.The results in Table 10 show the percentage reduction in the number of these operations by our proposed ERPS method with respect to the HTM (16.2) encoder configuration using Equation (7).In the case of the SAD operation, the percentage reduction for sizes 8, 16, 32, and 64 generally shows a similar trend.This means that reference picture selection does not affect a particular size operation, as was observed in the case of the ECS.The percentage reduction for these sizes is also observed to be independent of the video content.It can be seen that the percentage decrease in the SAD and the HAD operations is somewhat similar, but the percentage decrease in the SSE operations is very low.From Equations ( 4) and ( 5), we see that the SAD operation is called in the cost function for the prediction parameter decision, and the SSE operation is called in the cost function for the mode decision.The prediction parameter decision process is simplified due to our proposed ERPS method.Therefore, the effect can be seen in the percentage reduction in the SAD and the HAD operations; whereas a slight percentage decrease can be observed in the SSE operation because it is used to calculate the cost function for the mode decision.We can see that there is a noticeable reduction in these operations due to our proposed algorithm, but it is not as much as one would have expected.The reason for that is that at TL-4, a huge majority of the CUs are encoded in skip mode.Table 11 shows the results for the case when both the ECS and ERPS algorithms are applied to the encoder.These results only show the percentage reduction of operations.A general trend similar to the results shown in Table 8 can be observed.Since ERPS is only applied on TL-4 pictures, it does not play the dominant role in operation reduction of the overall proposed method.One main contribution when compared with ECS results can be observed in the percentage reduction of the HAD64 operation.The complexity of the encoder has definitely decreased, but at this point, we do not know how much complexity has decreased because the results do not show the comparison in terms of encoding time.Apart from encoding time, we also need to check other parameters like bitrate, the PSNR, and the Bjøntegaard Delta Bit Rate (BDBR) [40].On the basis of these standard comparison parameters, both methods are compared in the next section.Table 9. Reference indices selected as a percentage for both List-0 and List-1 for different TLs.

Results and Comparisons
We performed detailed simulations according to the test conditions as shown in Table 12.For all the test sequences, we show the result comparisons with HTM (16.2), the reference encoder.Table 13 shows that the proposed ECS and ERPS methods significantly reduced the encoding time of the encoder with very little trade-off.To show the effect of resolution on the proposed methods, average results for the 1024 × 768 and 1920 × 1088 resolution sequences are also shown.The performance parameters, such as percentage change in the bit rate (∆bitrate (%)), percentage change in encoding time (∆time (%)), and the change in PSNR (∆ PSNR (dB)), are calculated using Equations ( 11)- (13).
∆Time(%) = Time (Orig.) − Time (Prop.)Time (Orig.) × 100% ( 12) Table 13 shows the comparative results of the proposed ECS and ERPS methods with the reference encoder.Some general observations regarding the proposed methods are discussed below.

•
From our results, we observed that the proposed ECS and ERPS methods generally showed a similar pattern for all test sequences, which essentially means that our methods are independent of the video content.

•
On average, for high resolution videos, our methods performed slightly better as compared with low resolution videos.This is understandable because as the resolution of the video increases, more pixels define an object, and therefore, larger CU sizes are assigned.

•
A relatively larger area is taken as a source to predict the CU threshold depth.Therefore, there is a very small probability of increase in the error or bitrate.At the same time, considerable reduction in the encoding time was achieved.

•
Another important aspect is the motion in video.Fast motion content video, for example, "Kendo".This case had a slightly lower gain in terms of complexity when compared with other similar resolution videos.This behavior is also understandable because we were thresholding the CU depth on the basis of the neighborhood, which may not be as predictable as in the case of slow-moving or still objects.

•
The window size used to predict the threshold depth can be made adaptive in the future by using motion information and the disparity between the views.

•
Generally, it has also been observed that the CU depth is directly propositional to the motion in the picture.Fast-moving parts of the picture are filled with higher depth level CUs, while parts with lower motion have a lower depth level.Therefore, the PUs along with their motion information can be a good parameter for further improvement.

•
Similarly in reference picture selection, only those pictures in the reference picture list were avoided, which were already available in the other reference list or which were very rarely referenced.Therefore, by avoiding such reference pictures, the encoder's complexity was reduced without compromising its compression efficiency.
In Table 14, we compare our overall results with sate-of-the-art methods [24][25][26].The results shown are average values for different QP values for all the test sequences of Table 2.The discussion below sheds detailed light on the compared works listed in Table 14.

•
As shown in Table 14, the results of our proposed methods when applied simultaneously showed a similar trend for all the test sequences.The encoding times for high resolution videos, such as 1920 × 1088, were improved slightly more than low resolution videos of 1024 × 768.

•
Our reduction in encoding time in comparison with the reference encoder was on average above 58% with very minor loss in other parameters.When applied together, our methods reduced the encoding time by 58.17% on average, with the percentage change in the bitrate of −0.14, the change in PSNR of 0.06, and BDBR of 0.91.

•
In [24], the authors did not use the MV-HEVC reference encoder HTM; instead, they used the reference encoder for HEVC.They first encoded the base view (using the HEVC reference encoder) and then stored the maximum depth level for each CTU.Later, they encoded the dependent views one by one.Moreover, they also restricted the independent views at the maximum CU depth to the maximum depth of the corresponding CTU in the base view.Therefore, their method did not consider the disparity among the views while deciding the maximum depth.The published results are not very appealing.

•
In [25], although the CU splitting process was reduced, the researchers used only a single CU, i.e., the co-located CU from the base view.Due to the disparity among the views, the co-located CU of the dependent view in the base view may represent a different object.As the object may not be located in the same position of the picture in the dependent view, using the co-located CUs' depth information barely gives useful depth information.Moreover, [25] also used the temporally co-located CUs' depth information.If the reference picture is closer to the picture being encoded and the CU being encoded does not belong to a fast-moving object, then the temporally co-located CUs' depth information can be useful.We empirically observed that just using this information is not a solid base to obtain sound results.Furthermore, the main focus of this work was based on using the previously-encoded adjacent CUs to get the depth information for the current CU.The adjacent CU decisions were very similar apart from some boundary areas.In boundary areas, this method might not work.Therefore, the authors used MV information of adjacent CUs to improve the results.Much work has been done to reduce the complexity of HEVC by utilizing the adjacent and temporally-co-located CU splitting information.For MV-HEVC, one also expects to see the utilization of information from the base view as the main source of complexity reduction.

•
In [26], the authors considered a large neighborhood of eight surrounding CTUs of a co-located CTU in the base view to focus on the disparity among views.Their published results were a bit improved in terms of the encoding time, ∆Bitrate, ∆PSNR, and BDBR.

Conclusions and Future Work
We proposed Early CU Splitting (ECS) and Early Reference Picture Selection (ERPS) to reduce the complexity of the MV-HEVC encoder.Our main focus was to exploit the correlation between the views to reduce complexity, which is simple, and they can be used together.We also aimed to focus on the correlation between the base view and dependent views and developed methods that were able to simplify the encoding process of MV-HEVC using inter-view similarity.In the proposed ECS method, we limited the splitting of CUs based on the CU splitting information from the base view in such a way that disparity between the views did not affect our method.We also took advantage of the availability of temporally-adjacent reference pictures, TL-4, to improve our ECS results further.In ERPS, we avoided the prediction search from the least probable reference pictures.Our main focus was not to lose much, while reducing the encoding time.We avoided reference indices, which were referenced very rarely in TL-4.Our proposed methods provided a simple and effective solution.We improved the results by utilizing the TL relation with CU depth decisions.In particular, our proposed ERPS method reduced the encoding time with almost no loss in other parameters.As can be seen in Table 14, in general, our results produced by our proposed methodology showed a similar trend for all the test sequences.In particular, the encoding times for high resolution videos were improved slightly more than low resolution videos.Moreover, our reduction in encoding time, in comparison with the reference encoder HTM (16.2), was on average a little above 58%, with very minor loss in other parameters.When applied together, our methods reduced the encoding time by 58.17% on average, with the percentage change in the bitrate of −0.14, the change in PSNR of 0.06, and BDBR of 0.91.
In the future, we aim to improve the reference picture selection decisions of dependent views on the base view.We believe it can be further refined for View-2 after View-1 has already been encoded.Moreover, we also aim to further improve the CU splitting method.Furthermore, correlation between multiple views and distortion values with the CU splitting can also be investigated to further reduce the encoding time.

Figure 1 .
Figure 1.A typical prediction structure for three view MV-HEVC.

Figure 2 .
Figure 2. The Coding Tree Unit (CTU) partitioning into Coding Units (CUs), processing order with depth information, and possible Prediction Units (PUs).(a) CTU partitioning into CUs with processing order, (b) possible PUs for a CU, and (c) corresponding CUs with depth and size information.

Figure 3 .Figure 4 .
Figure 3. CU splitting and PU mode decision steps, as well as CBFfast, early skip, and early CU tools for HTM (16.2).For prediction modes sizes, please see Figure 2b.

2 Figure 5 .
Figure 5. POC 1 of test sequence "Kendo", showing all three views.The CU splitting decisions can also be seen for all three views.

Figure 6 .
Figure 6.(a) Neighborhood of the co-located CTU in View-0; (b) The current CTU in the dependent view.

Figure 7 .
Figure 7. Flow of the proposed Early Coding unit Splitting (ECS) algorithm.

Table 2 .
Test sequences used in the analysis and results.

Table 3 .
Average CU depth percentages of dependent views.

Table 5 .
Relation between the hit scenario and CU depths.

Table 6 .
Relation of average CTU depth percentages with TLs of the pictures in dependent views.

Table 7 .
Hit scenario relation of average CTU depths with TLs of the pictures in dependent views.

Table 8 .
ECS results in terms of percentage reduction of SAD, SSE, and Hadamard (HAD) operations.

Table 10 .
Efficient Reference Picture Selection (ERPS) results in terms of the percentage reduction of SAD, SSE, and HAD operations.

Table 11 .
ECS + ERPS results in terms of percentage reduction of SAD, SSE, and HAD operations.

Table 12 .
Experimental setup and encoder configurations.

Table 13 .
Results relative to the original encoder are shown for the proposed ECS, ERPS and ECS + ERPS = Combined(COM) result.

Table 14 .
Comparisons among different proposed methods in the literature and our proposed (Prop.)methods.