Frame-Based and Subpicture-Based Parallelization Approaches of the HEVC Video Encoder

.


Introduction
The Joint Collaborative Team on Video Coding (JCT-VC), composed of experts from the ISO/IEC Moving Picture Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG), has developed the most recent video coding standard: High Efficiency Video Coding (HEVC) [1].The emergence of this new standard makes it possible to deal with current and future multimedia market trends, such as 4K-and 8K-definition video content.HEVC improves coding efficiency in comparison with the H.264/AVC [2] High profile, yielding the same video quality at half the bit rate [3].However, this improvement in terms of compression efficiency is bound to a significant increase in the computational complexity.Several works about complexity analysis and parallelization strategies for the HEVC standard can be found in the literature [4][5][6].Many of the parallelization research efforts have been conducted on the HEVC decoding side.In [7], the authors present a variation of Wavefront Parallel Processing (WPP), called Overlapped Wavefront (OWF), for the HEVC decoder, where the decoding of consecutive pictures is overlapped.In [8], the authors combine tiles, WPP, and SIMD (Single Instruction, Multiple Data) instructions to develop an HEVC decoder which is able to run in real time.Nevertheless, the HEVC encoder's complexity is several orders of magnitude greater than the HEVC decoder's complexity.A number of works can also be found in the literature about parallelization on the HEVC encoder side.In [9], the authors propose a fine-grain parallel optimization in the motion estimation module of the HEVC encoder that allows to compute the motion vector prediction in all Prediction Units (PUs) of a Coding Unit (CU) at the same time.The work presented in [10] focuses in the intra prediction module, removing data dependencies between sub-blocks and yielding interesting speed-up results.Some recent works focus on changes in the scanning order.For example, in [11], the authors propose a frame scanning order based on a diamond search, obtaining a good scheme for massive parallel processing.In [12], the authors propose to change the HEVC deblocking filter processing order, obtaining time savings of up to 37.93% over manycore processors, with a negligible loss in coding performance.In [13], the authors present a coarse grain parallelization of the HEVC encoder based on Groups of Pictures (GOPs), especially suited for distributed memory platforms.In [14], the authors compare the encoding performance of slices and tiles in HEVC.In [15], a parallelization of the HEVC encoder at slice level is evaluated, obtaining speed-ups of up to 9.8x for All Intra coding mode and 8.7x for Low-Delay B, Low-Delay P, and Random Access modes, using 12 cores with a negligible rate/distortion (R/D) loss.In [16], two parallel versions of the HEVC encoder using slices and tiles are analyzed.The results show that the parallelization of the HEVC encoder using tiles outperforms the parallel version that uses slices, both in parallel efficiency and coding performance.
In this work, we study two different parallelization schemes: the first one follows a coarse-grain approach, where parallelization is based on frames, and the other one follows a fine-grain approach, focused on the parallelization at subpicture level (tile-based and slice-based).After an analytical study of the presented approaches using the HEVC reference software, called HM (HEVC test Model), data dependencies are identified to determine the most adequate data distribution among the available coding processes.The proposed approaches have been exhaustively analyzed to study both the parallel and the coding performance.
The rest of the paper is organized as follows.In Sections 2 and 3, subpicture-based and frame-based parallelization schemes are presented, respectively.In Section 4, experimental results are presented and analyzed.Finally, several conclusions are drawn in Section 5.

Subpicture-Based Parallel Algorithms
In this section, two parallel algorithms based on both tile and slice partitioning are analyzed.Slices are fragments of a frame formed by correlative (in raster scan order) Coding Tree Units (CTUs) (see Figure 1).In the developed slice-based algorithm, we divide a frame into as many slices as the number of available parallel processes.All the slices of a frame have the same number of CTUs, except for the last one, whose size corresponds to the number of CTUs remaining in the frame.The process used to calculate the size of each slice is shown in Algorithm 1.This algorithm provides load balances above 99% in most of the performed experiments.If the height (or width) of the frame (in pixels) is not a multiple of the CTU size, there will be some incomplete CTUs (see lines number 3 and 4 in Algorithm 1).Besides, the size of the last slice of a frame will always be equal or less than the size of the rest of slices (see line 7).Figure 2 shows the assignment of the slices to the encoding processes performed in our parallel slice-based algorithm.Each parallel process (P i ) encodes one slice (S i ) of the current frame (F k ).This coding procedure is a synchronous process in which a synchronization point is located after the slice encoding procedure.As depicted in Algorithm 1, the size of the last slice is often smaller than the size of the rest of slices, and this fact decreases, theoretically, the maximum achievable efficiency.In the worst cases in our experiments, the theoretical efficiency reaches 94.5% (a drop off of 5.5%) for a video resolution of 832 × 480, 98.9% for a video resolution of 1280 × 720, 99.4% for a video resolution of 1920 × 1080, and 99.2% for a video resolution of 2560 × 1600.In Figure 1, we show the slice partitioning of a Full HD frame (1920 × 1080 pixels), divided into 8 slices.Considering a CTU size of 64 × 64 pixels, the frame height is 17 CTUs, the frame width is 30 CTUs, and the total number of CTUs per frame is 510.Following Algorithm 1, the size of each slice is equal to 64 CTUs, except for the last one, which has a size of 62 CTUs, which entails a theoretical load balance of 99.6%.The other subpicture-based parallel algorithm presented in this work is based on tiles.Tiles are rectangular divisions of a video frame which can be independently encoded and decoded.This is a new feature included in the HEVC standard, which was not present in previous standards.In our tile-based parallel algorithm, depicted in Figure 3, a set of encoding processes (P i ) encode a single frame in parallel.Each frame (F k ) is split in as many tiles (T i ) as the number of processes to use, in such a way that each node processes one tile.When every process has finished the encoding of its assigned tile, synchronization is carried out in order to properly write the encoded bitstream and thus proceed with the next frame.In all the performed experiments, each CTU covers an area of 64 × 64 pixels and each tile consists of an integer number of CTUs.For a specific frame resolution, we can obtain multiple and heterogeneous tile partition layouts.For example, the partition of a frame into 8 tiles can be done by dividing the frame into 1 column by 8 rows of CTUs (1 × 8), 8 columns by 1 row (8 × 1), 2 columns by 4 rows (2 × 4), and 4 columns by 2 rows (4 × 2).In addition, the width of each tile column and the height of each tile row can be set up independently, so we can have a plethora of symmetric and asymmetric tile partition layouts.In the examples shown in Figure 4, a Full HD frame is divided into 8 tiles in different partitions: four homogeneous partitions (1 × 8, 8 × 1, 2 × 4, 4 × 2) and one heterogeneous partition (4 × 2 heterogeneous).
In our experiments, we have used the tile column widths and the tile row heights that produce the most homogeneous tile shapes.Even though we have selected the most homogeneous partitions, as every tile must have an integer number of CTU rows and columns, a tile layout where all the nodes process the same number of CTUs is not always possible.For example, in Figure 4b, the two leftmost tile columns have a width of 3 CTUs, whereas the rest of the tile columns have a width of 4 CTUs.Moreover, even when we get a perfect workload balanced layout, in which each process encodes the same number of CTUs, we cannot guarantee an optimal processing work balance, because the computing resources needed to encode every single CTU may not be exactly the same, as different tiles belonging to the same frame may have different spatial and temporal complexities.
Additionally, we have to take into account that different layouts produce different encoded bitstreams, with different R/D performance, because the existing redundancy between nearby CTUs belonging to different slices or tiles cannot be exploited.In Figure 5, the marked CTUs do not have the same neighbor CTUs available for prediction that they would have in a frame without slice partitions.For a Full HD frame divided into 8 slices, there are 217 CTUs (out of 510) in this situation.This represents a 42.5% of the total.This percentage increases as the number of slices does.The R/D performance depends on the number of CTU neighbors which are not available to exploit the spatial redundancy.This effect can be seen in Figure 6a, where lighter gray squares are CTUs to be coded, crossed darker gray squares are non-available neighbor CTUs, and darker gray squares are neighbor CTUs available to perform the intra prediction.In this figure, the acronyms AL, A, AR, and L correspond, respectively, to Above-Left, Above, Above-Right, and Left neighbor CTUs.In the worst case (CTUs marked with a Γ symbol), no neighbor CTUs are available for prediction.The tile layout also affects both the parallel and the encoder performance.In order to obtain a good parallel performance we should get a balanced computational load.In Table 1, we can see the computational workload, in number of CTUs, assigned to each process, for all the tile layouts presented in Figure 4.The maximum theoretical efficiency that can be achieved is 94%, for the layout based on columns (8 × 1).A CTU is predicted using previously encoded CTUs, and we have to consider that (a) the search range is composed by the neighboring adjacent CTUs, (b) the CTUs are encoded in raster scan order inside a slice or tile partition, and (c) the adjacent neighbors from other slice or tile partitions cannot be used.Considering all these conditions, we can conclude that all CTUs which are located at the border of a vertical or a horizontal tile have their encoding modified because they cannot be predicted using all the desirable neighboring CTUs.Table 2 shows the total number of CTUs that are affected by this effect due to tile partitioning in a Full HD frame divided into 8 tiles.Square-like tile partitions (2 × 4 and 4 × 2) have fewer CTUs affected than column-like (8 × 1) or row-like (1 × 8) tile layouts.The heterogeneous tile layout (4 × 2 h) is affected in a similar way than the square-like tile layouts, but, obviously, this type of heterogeneous layouts should only be used in heterogeneous multiprocessor platforms, where the computing power differs between the available processors.Looking at the values in Table 2, we could think that the row-like layout may have a better R/D performance than the column-like layout, because the row-like layout has fewer CTUs affected by the absence of some neighbors for the prediction.However, we should take into consideration that the encoding performance decrease produced in each CTU depends on its position inside the slice or tile.In the intra prediction procedure, the neighbor CTUs used, if available, are the AR, the A, the AL, and the L neighbor CTUs (see Figure 6).Figures 6b,c show the available CTUs to perform the prediction.The number of available CTUs for the prediction procedure ranges from 0 to 4. As shown in Table 2, the number of CTUs affected in the row-like tile layout (1 × 8) is 210, whereas, for the column-like tile layout (8 × 1), this number rises up to 231.In the row-like tile layout (Figure 7a), most of the CTUs can perform the intra prediction using only 1 CTU neighbor (see O symbol in Figure 6b), whereas, in the column-like tile layout (Figure 7b), approximately half of the affected CTUs can use up to 3 neighboring CTUs (see Ω symbol in Figure 6c).In Section 4, we will analyze the effect of tile layout partitioning in both R/D and parallel performance.
Both slice and tile parallel algorithms include synchronization processes in such a way that only one process reads the next frame to be encoded and stores it in shared memory, which reduces both drive disk accesses and memory requirements.Obviously, a synchronization process before writing or transmitting an encoded frame is also necessary.Therefore, both subpicture parallel approaches are designed for shared memory platforms, since all processes share both the original and the reconstructed frames.

Frame-Based Parallel Algorithms
Now, we will depict the parallel algorithm for the HEVC encoder at frame level, named DMG-AI, which has been specifically designed for the AI coding mode (Figure 8).A full description of the algorithm can be found in [13].In the current paper, the DMG-AI algorithm is tested on an heterogeneous memory framework, managed by the Message Passing Interface (MPI) [17], consisting on N computing nodes (distributed memory architecture).Each node has a shared memory architecture.At each one of the N available nodes, R MPI processes are executed (or mapped) in such a way that every MPI process (P i,j (i = 1 . . .N, j = 1 . . .R)) of the NxR available processes, encodes one different frame.Firstly, each coding process (P i,j ) sends an MPI message to the coordinator process (P coord ) requesting a frame to encode.Note that, in the AI encoding mode, all frames are encoded as I frames, i.e., without using previously encoded frames, making use only of the spatial redundancy.The coordinator process is responsible for the assignation of the video data to be encoded to the rest of processes, and for the collection of both statistical and encoded data in order to compose the final bitstream.The coordinator process performs the distribution of the workload by sending one different frame to each coding process.When the coding process P i,j finishes the encoding of its first received frame (F i,j,0 ), it sends the resulting bitstream to the coordinator process, which assigns a new frame (F i,j,1 ) to the coding process.This procedure is repeated until all frames of the video sequence have been encoded.In this algorithm, when a coding process becomes idle, it is immediately assigned a new frame to encode, so there is no need to wait until the rest of the processes have finished their own work.This fact provides a good workload balance, and, as a consequence, excellent speed-up values are obtained.
This frame-based algorithm is completely asynchronous, so the order in which each encoded frame is sent to the coordinator process is not necessarily the frame rendering order, and the coordinator process therefore must keep track of the encoded data to form the encoded bitstream in the suitable rendering order.The coordinator process is mapped onto a processor that also runs one coding process, because the computational load of the coordinator process is negligible.We want to remark that the DMG-AI algorithm generates a bitstream that is exactly the same as the one produced by the sequential algorithm; therefore, there is no R/D degradation.
The previously described DMG-AI algorithm presents the following drawbacks: (a) a queuing system to manage the distributed resources must be installed in order to correctly map the MPI processes, (b) the number of MPI messages increases as the number of MPI processes does and, depending on the video frame resolution, the messages are often quite small, and (c) the distribution of processes between the computing nodes (i.e., the number of MPI processes per node) is performed before the beginning of the execution time through the queuing system, which in our case is the Sun Grid Engine.As previously stated, the DMG-AI algorithm includes an automatic workload balance system, but the algorithm itself is not able to change the process distribution pattern during execution time.
In order to avoid the aforementioned drawbacks, we propose the new hybrid MPI/OpenMP algorithm named DSM-AI, which does not need a specific queuing system and which outperforms the pure MPI proposal (DMG-AI).The DSM-AI algorithm, depicted in Figure 9, follows a structure that is similar to the DMG-AI algorithm, but only one MPI process is mapped into each computing node, regardless of the number of computing nodes and/or the number of available cores of each computing node.In the DSM-AI parallel algorithm, the intra-node parallelism is exploited through the fork-join thread model provided by OpenMP.The number of created threads can be specified by a fixed parameter or can be obtained during execution time, depending on the current state of the multicore processor (computing node), as shown in Algorithm 2.  In Algorithm 2, when the parameter "NoT" (Number of Threads) is greater than zero, both the DSM-AI and the DMG-AI algorithms work in a similar way and their coding mapping procedures are the same, but the number of MPI communications is lower in the DSM-AI algorithm than in the DMG-AI algorithm.Moreover, no specific queuing system is required by the DSM-AI algorithm.When "NoT" is greater than zero, each MPI process encodes "NoT" consecutive frames.For that purpose, "NoT" threads are created in an OpenMP parallel section.On the other hand, when the parameter "NoT" is equal to zero, the number of threads created to encode consecutive frames depends on the current state of the multicore processor (i.e., the number of available threads).In this case, the current optimal number of threads ("CNoT") is requested to the system, and this number is added to the MPI message, which demands the new block of frames to be encoded.Therefore, each MPI process encodes a block of consecutive frames, but the size of these blocks is set up by the coding process instead of the coordinator process.
The DSM-AI algorithm is not totally asynchronous, because the processing of the OpenMP threads is synchronous.Note that the MPI coding process sends an MPI message to the coordinator process, and this message includes the encoded data provided by all the OpenMP processes, i.e., the OpenMP parallel section must be already closed.However, the bitstream generated by the DSM-AI algorithm and by the DMG-AI algorithm, as well as the sequential algorithm, is exactly the same.
The DSM-AI algorithm has been designed in order to make an efficient use of hardware platforms where other processes may be running (i.e., non-exclusive use computing platforms).However, when the optimal number of threads per computing node is a large number, the synchronization processes may reduce the parallel efficiency.Furthermore, in order to reduce the disk reading contentions, the frame reading processes are serialized.In order to solve that issue, we have developed Algorithm 3, which is used jointly with Algorithm 2. Algorithm 3 limits the number of OpenMP threads per MPI process.When the "NoT" parameter is negative, the absolute value of "NoT" sets the maximum number of OpenMP threads in each MPI process.In this way, we can reduce the number of threads per MPI node.Moreover, with the use of a queuing system, we can correctly map more than one MPI process into each computing node, depending on the "NoT" value.

Results and Discussion
In this section, we present the evaluation of the parallel algorithms detailed in Sections 2 and 3, in terms of parallel and R/D performance.In order to implement the algorithms presented in this work, we have modified the HEVC reference software HM v16.3 [18].For the subpicture-based parallel algorithms, the OpenMP API v3.1 [19] has been used, whereas for the frame-based algorithms we have used MPI v2.2 [17].The parallel platform used in our experiments is an HP Proliant SL390s G7 (a distributed memory multiprocessor with 14 nodes).Each node is equipped with two Intel Xeon X5660 and 48 GB of RAM.Each X5660 includes six processing cores at 2.8 GHz.QDR Infiniband has been used as the communication network.The video sequences used in the experimental tests and their characteristics are shown in Table 3.Four different values of the Quantization Parameter (QP) have been used in the experiments, ranging from low compression rates to high compression rates (22,27,32,37).First, we will analyze both the parallel and the coding performance of the parallel algorithm based on slice partitioning.As stated in Section 2, slice partitioning may slightly limit the parallel efficiency that can be achieved.Regarding the coding performance, in most cases, the horizontal borders inserted by the slice partitioning cause a reduction in the prediction searching area of up to 75%.Moreover, each slice inserts a data header in the bitstream, which reduces the R/D performance.
In Table 3, the speed-up results for the slice-based parallel algorithm are shown.As can be seen, good speed-up values are achieved, but the parallel efficiency slightly decreases as the number of processes increases.Note that, when the searching area is reduced for most of the CTUs that belong to a slice, the computational load associated with that slice decreases.As shown in Figure 5, the computational load for the first slice remains unaltered.The difference, in computational load, increases as the size of slices decreases, i.e., as the number of processes increases.Table 3 presents the coding performance in PSNR and bit rate for the slice-based parallel algorithm.As expected, the PSNR values worsen as the number of slices increases.The inclusion of a data header per slice has a variable impact on the bit rate depending on the number of processes, the video resolution, and the compression rate.As the size of the slice header is fixed, the bit rate increment becomes greater as the number of processes increases.Additionally, for the configurations which produce low bit rates (low resolution and high compression rates), the percentage of bit rate increase is greater than for the rest of configurations.As shown in Table 3, the worst case produces a 10.68% of bit rate increment for PartyScene video sequence (the lowest resolution video sequence tested), 10 processes (the maximum number of processes used), and a QP value of 37 (the highest compression rate evaluated).Now, we analyze the performance of the tile-based parallel algorithm in order to see how the tile partitioning affects both the speed-up and the coding efficiency.First of all, we will analytically study the influence of the frame partition into tiles, i.e., how the nature of the video data may affect the algorithm performance.As stated in Section 2, we have selected the most homogeneous tile partitions for each number of processes that we have tested (2, 4, 6, 8, 9, and 10).In Table 4, we show the tile partitions (layouts) used for the 4 different video resolutions tested (2560 × 1600, 1920 × 1080, 1280 × 720, 832 × 480).The AvgCTU column indicates the average number of CTUs per tile.For the 1P layout, this column indicates the total number of CTUs per frame.The MaxCTU column shows the number of CTUs of the biggest tile for every layout.With the values of these two columns, we can obtain the percentage of workload balance (Bal %) of each partition.A load balance percentage of 100% indicates that all the tiles of one frame have the same number of CTUs, so the workload is perfectly balanced for all processes.Low values in this column (e.g., in the 1 × 10 layout for 1280 × 720 resolution) mean a heavily unbalanced workload distribution, which will probably lead to low parallel efficiencies.As we will see later, the election of an unbalanced layout may lead to underwhelming results.An N/A value in the table means that the corresponding layout is not possible for that resolution.For example, a division of a frame using a 1 × 10 layout (1 column, 10 rows) is not possible for the 832 × 480 resolution, where there are only 8 rows of CTUs in a frame.Now we will verify if the above analysis is empirically consistent.In Tables 5 and 6, we present the encoding speed-up evolution for Traffic, ParkScene, FourPeople, and PartyScene video sequences, respectively.A speed-up of up to 9.35x is obtained when 10 processes are used.Note that, for a particular number of parallel processes, different speed-ups are obtained.This is mainly due to the fact that some tile partition layouts produce an unbalanced processing workload.For example, for a video resolution of 2560 × 1600, a frame consists of 40 × 25 CTUs of 64 × 64 pixels.If we divide the frame using the 10 × 1 layout, then each of the 10 processes will have to encode the same number of CTUs (4 × 25 = 100 CTUs).This means a perfectly balanced workload (Bal% = 100%).But if we divide the frame using the 1 × 10 layout, then 5 processes will have to encode 40 × 2 = 80 CTUs, and the other 5 processes will have to encode 40 × 3 = 120 CTUs, which implies a 50% increase in CTUs (Bal% = 83%).Using the 1 × 10 layout, a speed-up of 7.77x is obtained for the Traffic video sequence, whereas for the 10 × 1 layout we obtain a speed-up of 9.35x.Generally, tile partitioning layouts based on columns of CTUs or on square tiles obtain better parallel performance.Regarding R/D performance, in Figure 10 we show the BD-rate results for each tile partitioning layout using the Bjontegaard method [20].This value measures the bit rate overhead that introduces the tile-based parallel algorithm when compared with the sequential version.As can be seen, the bit rate overhead increases as the number of processes does.This is an expected result because no information from other previously encoded tiles is available to perform the intra prediction.Square-like tile partitioning layouts have a better R/D performance than the rest of layouts.This is mainly because in square-like tile layouts, every single CTU has more neighbors, which are inside the same tile, than in row-like or column-like tile layouts.Therefore, the redundancies of nearby CTUs can be exploited.We can conclude, from the information shown in Figure 10, that row-like tile layouts, i.e., layouts formed by the division of a frame into 1 column by N rows (1 × N layouts), always have the worst R/D performance.

Frame-Based Parallel Algorithms Analysis
So as to evaluate both the DMG-AI and the DSM-AI algorithms, we have run our experiments with 10, 16, 20, and 24 parallel processes.The hardware platform described before consists of 14 nodes (Distributed Memory (DM) architecture), with 12 cores in each one (Shared Memory (SM) platforms).Table 7 shows the different combinations tested, where N denotes the number of computing nodes (DM architecture) and R is the number of processes used in each computing node, so NxR is the total number of parallel processes.When N is equal to 1, we have a pure SM platform, and when R is equal to 1, we have a pure DM system.In this framework, the different setups tested are transparent to the DMG-AI parallel algorithm because all the processing units are MPI processes, regardless of the memory arrangement.On the contrary, in the DSM-AI algorithm, an NxR configuration means that each one of the N MPI processes generates R threads (OpenMP processes).In Table 8, we show the parallel efficiencies obtained for the DMG-AI parallel algorithm using up to 24 processes for the four QP values considered in our experiments and for BasketballDrill, Traffic, Kristen&Sara, and Tennis video sequences.This table illustrates the good parallel performance of the DMG-AI algorithm, being even better for the lowest resolution video sequence.In the worst case, 24P (4 × 6), an average speed-up of 18.2x for Traffic video sequence is obtained, which corresponds to an efficiency of 76%.The efficiency of the DMG-AI algorithm always remains above 75% reaching a maximum efficiency of 98%.The disk access (both for reading the raw video data and for writing the compressed bitstream) is a sequential operation, so disk operations may become a bottleneck.The worst situation occurs when a high number of processes is used and the amount of video data to read is large (high-resolution video sequences).
Table 9 shows the efficiencies obtained by the DSM-AI parallel algorithm, the results being quite similar to those obtained for the DMG-AI algorithm.In general, the DSM-AI algorithm slightly improves the results obtained by the DMG-AI algorithm, except in the case where there is a high number of OpenMP threads with respect to the number of MPI processes.In this case, the synchronization processes performed in the OpenMP parallel sections cause a slight parallel degradation.Finally, we will compare both subpicture-based and frame-based algorithms in terms of parallel and R/D performance.It must be noted that subpicture-based parallel algorithms have been tested using up to 10 processes, whereas frame-based parallel algorithms have been tested using above 10 processes and up to 24 processes.We have compared both parallel algorithms using a fixed number of available processing units (10).The results provided for the tile-based algorithm are the average of the efficiencies obtained with the selected tile layouts.On the other side, the results provided for the DMG-AI algorithm are calculated as the average of the efficiencies obtained with the different tested setups (N × R processes).
Figure 11 shows the comparison of the efficiency results.As can be seen, the results obtained by the frame-based parallel algorithm (DMG-AI) are always better, and taking into account that the frame-based parallel algorithms do not affect the R/D performance, we can conclude that frame-based parallel algorithms are always preferable.
In Figure 12, we show the difference of efficiency between the DSM-AI and the tile-based approaches.The DSM-AI efficiency values are always better than the ones obtained by the tile-based proposal.For high-resolution video sequences, the DSM-AI parallel algorithm shows a slight improvement, but when the video resolution decreases, the improvement is quite significant (up to 37%).In most cases, the best efficiency values are obtained when a low value for the QP is used.In these cases, the parallel efficiency increases as the workload does.
As far as scalability is concerned, frame-based algorithms clearly outperform subpicture-based ones.On the one hand, scalability is limited by the resolution of the video sequence in subpicture-based algorithms, because the video resolution sets the maximum number of parallel processes that can be used.This effect does not occur in frame-based approaches.On the other hand, for subpicture-based algorithms, the higher number of tiles or slices per frame there are, the higher the BD-rate penalty appears.Therefore, we do not have a good scalability with regard to R/D performance in subpicture-based algorithms.As mentioned before, frame-based proposals do not suffer any BD-rate increment because they produce the same bitstream than the sequential algorithm.
As shown in Figures 11 and 12, the proposed DMG-AI and DSM-AI algorithms outperform the analyzed subpicture-based algorithms.Note that the DMG-AI algorithm is specially designed for heterogeneous memory platforms, whereas the new proposal, the DSM-AI algorithm, is also suitable for heterogeneous memory platforms.However, it has been designed in order to optimize the execution inside the multicore processors, including the use of a single multicore.In [21], the authors propose a two-stage parallelization speed-up scheme exploiting CTU level parallelism in order to perform an efficient HEVC Intra encoding.That proposal is based in two main issues: maximizing encoding speed and minimizing compression performance loss, obtaining and average speed-up of 5.02x using 8 processes, i.e., an average efficiency of 62.8%.Recently, the authors in [22] presented an improved version of the mean directional variance in the sliding window algorithm applied to intra-prediction process, which detects the texture orientation of a block of pixels, allowing the parallelization at block level.In this approach, the maximum speed-ups obtained are equal to 3.1x and 3.7x using 5 and 10 processes respectively, i.e., the efficiencies are equal to 62% and 37%.Finally, in [23], the authors propose a collaborative scheduling-based parallel solution, named CSPS, for HEVC encoding, which includes adaptive parallel mode decision, asynchronous frame-level pixel interpolation, and multi-grained task scheduling.This recent proposal, has been applied to low delay coding modes, and it obtains speed-up values of 18.7x, 15.2x, 11.42x, and 7.78x using 24 processes, for TRAFFI (2560 × 1600), PARKSC (1920 × 1080), FOURPE (1280 × 720), and PARTSC (832 × 480), whereas our new proposed parallel algorithm, the DSM-AI algorithm, obtains better speed-up values, equal to 19.0x, 21.6x, 22.8x, and 22.5x, respectively, for the same video sequences, the efficiencies obtained being equal to 79%, 90%, 94%, and 95%, respectively.

Conclusions
In this paper, we compared two parallelization proposals of the HEVC encoder.The first one is based on subpicture partitions (tiles or slices), and they are especially suited for shared memory platforms.They obtain good speed-up values, although for low resolution sequences, the parallel scalability decreases.Moreover, the R/D performance decreases as the number of subpicture partitions increase.The other approach, which is based on frames, is suitable for both shared and distributed memory architectures.It yields good parallel performance, obtaining efficiency values of up to 97%.However, it outperforms subpicture-based proposals, especially when low resolution video sequences are encoded by a high number of processes.The frame-based approaches have been tested using up to 24 processes, showing good scalability without varying the R/D performance.Therefore, we can conclude that the proposed frame-based parallel algorithms for AI mode outperform parallel proposals based on subpicture partitions in terms of parallel performance and R/D performance.It is worth noting that the good scalability of the frame-based approaches and that, if the final application requires the use of tiles or slices, the use of our frame-based parallel proposals does not prevent such use.Both frame-based parallel algorithms obtain similar parallel performance, but the DSM-AI

Algorithm 1
Computation of the size of the slices 1: Obtain the number of processes: p 2: Obtain the width (or height) of a CTU: s 3: Frame width (number of CTUs): f w = ceil(width/s) 4: Frame height (number of CTUs): f h = ceil(height/s) 5: Total number of CTUs per frame: t = f w * f h 6: Regular slice size: ns = ceil(t/p) 7: Last slice size: t − (ns * (p − 1))

Figure 5 .
Figure 5. Partitioning of a Full HD frame into 8 slices (marked Coding Tree Units (CTUs) do not have all its neighbors available for prediction).

Figure 6 .Figure 7 .
Figure 6.Searching area to perform the CTU prediction depending of the position of the borders in a Full HD frame.(a) 8 slices; (b) 1 × 8 tiles; (c) 8 × 1 tiles.

Table 1 .
Workload, in number of CTUs, for the partitioning of a Full HD frame into 8 tiles.

Table 2 .
Number of CTUs with their encoding modified, for the partitioning of a Full HD frame into 8 tiles.

Algorithm 3
Improved setting the number of threads in the DSM-AI algorithm 1: In coding MPI process 2: { 3: Read parameter NoT (Number of Threads) 4: if NoT < 0 then 5: Set NoT = |NoT| frames 6: Request the current optimal number of threads (CNoT) 7: if CNoT > NoT then

Table 3 .
Test video sequences.

Table 4 .
Speed-up for the slice-based parallel algorithm.

Table 4 .
Coding performance of the slice-based parallel algorithm.

Table 4 .
Tile partitions and percentage of load balance.

Table 5 .
Speed-up evolution for high-resolution video sequences for the tile-based parallel algorithm.

Table 6 .
Speed-up evolution for low resolution video sequences for the tile-based parallel algorithm.

Table 7 .
Arrangement of processing units.

Table 8 .
Efficiency of the DMG-AI parallel algorithm.

Table 9 .
Efficiency of the DSM-AI parallel algorithm.