Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder

The proportion of video traffic on the internet is expected to reach 82% by 2022, mainly due to the increasing number of consumers and the emergence of new video formats with more demanding features (depth, resolution, multiview, 360, etc.). Efforts are therefore being made to constantly improve video compression standards to minimize the necessary bandwidth while retaining high video quality levels. In this context, the Joint Collaborative Team on Video Coding has been analyzing new video coding technologies to improve the compression efficiency with respect to the HEVC video coding standard. A software package known as the Joint Exploration Test Model has been proposed to implement and evaluate new video coding tools. In this work, we present parallel versions of the JEM encoder that are particularly suited for shared memory platforms, and can significantly reduce its huge computational complexity. The proposed parallel algorithms are shown to achieve high levels of parallel efficiency. In particular, in the All Intra coding mode, the best of our proposed parallel versions achieves an average efficiency value of 93.4%. They also had high levels of scalability, as shown by the inclusion of an automatic load balancing mechanism.


Introduction
The High Efficiency Video Coding (HEVC) standard [1] was developed by the Joint Collaborative Team on Video Coding (JCT-VC) in 2013, and replaced the previous H.264/ Advanced Video Coding (AVC) standard [2]. The HEVC standard obtains savings in terms of bit rate of almost 50%, with the same visual quality as the previous H.264/AVC standard. However, this reduction is obtained at the expense of a huge increase in the computational complexity of the encoding process [3].
Recently, Cisco released a report called "Forecast and Trends: 2017-2022 White Paper" [4], in which they state that IP video traffic will form 82% of all IP traffic by 2022, representing a four-fold increase between 2017 and 2022. This represents a situation where each second, a million minutes of video content travel through the network. The report also predicts a constant increase in novel services such as video-on-demand (VoD), live internet video, virtual reality (VR) and augmented reality (AR). VoD traffic is expected to double by 2022, mainly due to the increasing numbers of consumers and higher video resolution (4 K and 8 K), bringing the amount of VoD traffic to the equivalent of 10 billion DVDs per month. The impact of user devices on global traffic is even more important when we consider popular services such as ultra-high-definition (UHD) video streaming. We need to take into account the fact that the bit rate for 4 K video is about 15 to 18 Mbps, more than double the bit rate for HD video and nine times more than standard definition (SD) video. The Cisco report estimates that by 2022, nearly 62% of the flat-panel TV sets installed will be UHD.
In order to deal with this increase in IP video traffic, new video coding techniques are required to obtain higher compression rates. Since the release of HEVC, both the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Expert Group (MPEG) have been studying and analyzing new video coding technologies in order to improve the compression capability compared to that obtained by HEVC. To achieve this, a framework of collaboration has been created called the Joint Video Exploration Team (JVET).
The compression enhancements studied by the JVET have been implemented in a software package known as the Joint Exploration test Model (JEM) [5]. Its main purpose is to explore new coding tools that can provide significant improvements at the video coding layer. Following an analysis of the new coding tools that have been proposed within the last few years, JVET has begun developing a future video coding standard called Versatile Video Coding (VVC) [6]. The main goal of this coding standard is to achieve bit rate savings of between 25% and 30% compared to HEVC [7,8].
Preliminary results obtained with the new model (JEM 3.0) show an 18% reduction in bit rate using the All Intra (AI) coding mode configuration [9]. However, this is achieved at the expense of an extremely large increase in computational complexity (60x) with respect to HEVC.
This increase requires the introduction of acceleration techniques that leverage hardware architectures to reduce encoding time. Since JEM is an exploration model, only a few articles have been published on the subject, and most of them are focused on rate distortion (R/D) comparisons between JEM, HEVC and AV1 codecs [10]. Recently, the authors of [11] proposed a pre-analysis algorithm that was designed to extract motion information from a frame in order to accelerate the motion estimation (ME) stage. Their proposal showed that around 27% of the reference frames could be skipped, and that a time saving of more than 62% was achieved on the integer ME operation, with a negligible impact of 0.11% on the Bjøntegaard delta rate (BD rate) [12]. The authors in [13] proposed parallel algorithms based on the group of pictures (GOP) structure, increasing the BD rate when temporal redundancy is exploited.
In this paper, we present two JEM parallel encoder versions that are specifically designed for shared memory platforms, in order to speed up the encoding process for the All Intra (AI) coding mode, as this coding mode is especially useful for video editing. We performed several experimental tests to illustrate the behavior of the parallel versions in terms of their parallel efficiency and scalability. In the first parallel algorithm, a synchronous algorithm named JEM-SP-Sync, a domain decomposition is performed, in which the computational load is not balanced, although the data are almost equally distributed. The second parallel algorithm, named JEM-SP-ASync, is an asynchronous algorithm, also based on a domain decomposition but able to balance the load automatically.
The rest of this paper is organized as follows. In Section 2, we present a brief description of the coding tools introduced in the new JEM video coding standard. Section 3 describes the parallel algorithms developed for the AI coding mode of the JEM standard. Experimental numerical results are presented in Section 4, and in Section 5, some conclusions are drawn.

Description of the New Characteristics of the Joint Exploration Test Model (JEM)
The JEM codec is based on the HEVC reference software, called HM, meaning that the overall architecture of both codecs is quite similar since they share a hybrid video codec design. However, some of the coding stages are modified in the JEM implementation in order to improve the previous standard [14,15]. The R/D performance of JEM is better than in HEVC due to the use of these techniques, but this is achieved at the expense of an increased computational cost for the intra-prediction stage. This section describes the main improvements offered by JEM in comparison to the previous standards, as these could lead to a load imbalance when using parallel algorithms such as those proposed in this work.

Picture Partitioning
The way in which a video frame is split into a set of non-overlapping blocks is called picture partitioning. These non-overlapping blocks are arranged into a quadtree structure, where the root is called a coding tree unit (CTU) [16], and each CTU is further partitioned into smaller blocks. Figure 1 shows the division of a 1280 × 720 pixel frame into 240 CTUs, split into 20 columns by 12 rows, where the last row is composed of incomplete CTUs. The complete CTUs are composed of 64 × 64 pixels. Two of the major differences between HEVC and JEM are the way in which a CTU is further partitioned and the size of the CTU itself. In HEVC, the maximum CTU size is 64 × 64 pixels, and there is the option to further recursively partition it into four square coding units (CU) whose sizes range from 64 × 64 pixels (i.e., no partitioning) to 8 × 8 pixels. The leaf blocks in a CU quadtree form the roots of two independent trees that contain prediction units (PUs) and transform units (TUs).
A PU can have the same size as the CU, or can be further split into smaller PUs of up to 8 × 8 pixels. The PUs store the prediction information in the form of motion vectors (MVs). In intra-prediction mode, HEVC uses a quadtree structure with only square PUs, while in inter prediction mode, asymmetric splitting of PUs is possible, giving up to eight possible partitions for each PU block: 2N × 2N, 2N × N, N × 2N, N × N, 2N × nU, 2N × nD, nL × 2N and nR × 2N. 1,0,0 Figure 2 shows an example of a CTU partition in HEVC and the relationship between CU partitioning, PU partitioning and TU partitioning. The picture partitioning schema is modified in JEM in order to simplify the prediction and transform stages, and further partitions of CUs to form PU and TU trees are avoided. The JEM partitioning schema, called quadtree plus binary tree (QTBT), offers a better match with the local characteristics of each frame [17]. The highest level is a CTU, as in HEVC, but the main change is that block splitting below each branch is a binary partition giving the leaves.
The size of the CTU is larger than in HEVC, with a maximum of 256 × 256 pixels, and only the first partition needs to be square partitioned. Lower partitions can be partitioned further in a quadtree schema, but at the desired level the binary tree ends the partitioning schema. There are two types of splitting in the binary tree: symmetric horizontal and symmetric vertical. The binary tree leaf node is the CU, which is used for prediction and transformation with no further partitioning. Hence, in most cases, the CU, PU and TU have the same size. An example of QTBT is shown in Figure 3; here, the quadtree has two levels (continuous line), after which the binary tree starts (dotted lines). In JEM, a CU can have either a square or a rectangular shape, and consists of coding blocks (CBs) of different color components; for example, a CU may contain one luma CB and two chroma CBs in the YUV420 chroma format.
In HEVC, inter prediction for small blocks is restricted in order to reduce memory accesses for motion compensation, i.e., bi-directional prediction for 4 × 8 and 8 × 4 blocks is not allowed, and 4 × 4 inter-prediction is also disabled. In QTBT, these restrictions are removed, which increases the computational cost of the JEM codec.
The CUs are not partitioned further for transforming or prediction unless the CU is too large for the maximum transform size. The maximum transform size is 128 × 128 pixels, which improves the coding efficiency for higher resolution video, e.g., 1080 p and 4 K sequences.
The following parameters are defined in order to obtain efficient partitioning in a QTBT tree: • CTU size: The root node size of a quadtree; the same concept as in HEVC. • MinQTSize: The minimum allowed size of the leaf node in the quadtree. • MaxBTSize: The maximum allowed size of the root node in the binary tree. • MaxBTDepth: The maximum allowed depth of the binary tree. • MinBTSize: The minimum allowed size of the leaf node in the binary tree.
MaxBTSize and minQTSize are two factors that are critical to the R/D performance and the encoding time. In JEM, these two parameters of the current slice are set adaptively larger when the average CU size of the previous encoded picture is larger, and vice versa for only the P and B slices [17].
At the transform stage, only the lower-frequency coefficients are maintained for transform blocks with sizes (width or height) larger than or equal to 64. For example, for an M × N transform block (where M is the width and N the height), when M is larger than or equal to 64, only the left M/2 columns of transform coefficients are retained. Similarly, when N is larger than or equal to 64, only the top N/2 rows of the transform coefficients are retained. This behavior can be skipped using skip mode for large blocks.
The proposed QTBT approach in JEM uses more partition types than HEVC in order to adapt the resulting partition tree to the contents of the scene. It is guided in this task by a trade-off between rate reduction and distortion reduction, and as we will see in the next section, this is a computationally expensive task. The whole video frame is first partitioned into equally sized (up to 256 × 256) CTUs, and each CTU is then further partitioned into CUs based on the scene contents, i.e., the time needed to process a whole CTU depends on the complexity of the underlying scene in each CTU. A summary of the main differences between HEVC and JEM related to picture partitioning is shown in Table 1.

Spatial Prediction
In order to be able to capture the finer edge directions presented in natural videos, the directional intra-modes in JEM have been extended from 33, as defined in HEVC, to 65. The addition of planar and DC modes gives a total of 67 different prediction modes for JEM. These denser directional intra-prediction modes (see Figure 4) are applied to all PU sizes and both luma and chroma intra-predictions.
The partitioning schema described in the previous section is directed by a ratedistortion optimization (RDO) algorithm that recursively searches for the best possible partitioning schema in terms of an R/D estimation. This algorithm tries all directional intra-modes for each of the possible partitions, (i.e., no partitioning, vertical partitions, horizontal partitions and quadtree partitions), at each recursion level, to find the one with the lowest cost. For a CTU in which the underlying scene is smooth, this recursion ends rapidly, and the number of trials for the 67 directional modes is therefore much lower than if the CTU belongs to a highly textured area within the scene. Thus, the computational effort is not evenly distributed over the CTUs in a video frame, and depends on the content of the scene.
In JEM, the list of most probable modes (MPMs) is extended from the three used in HEVC to six, and the selection procedure for these modes is also changed. In HEVC, the method proposed in [18] was adopted in the standard for building the MPMs list. The 35 intra-modes are divided into two groups: three MPMs and 32 remaining modes. The three MPMs are derived based on the modes of the PUs to the left of and above the current PU. The new procedure followed in JEM [19] uses five neighbors of the current PU: left, above, above left, below left and above right, as shown in Figure 5 [19].  The improvements in JEM described in the following paragraphs, which increase the complexity of the encoder, can also lead to a load imbalance when parallel processing the slices of a frame.
In addition to the changes in the JEM encoder mentioned above, there are also differences in the entropy coding of the MPM list between HEVC and JEM, as explained in [17,19], which lead to a reduction of the contexts used in the entropic encoder to signal the MPM index from nine to three, corresponding to the vertical, horizontal or non-angular class MPM modes.
The interpolation filters are also changed in JEM with respect to HEVC [15]. In HEVC, a two-tap linear interpolation filter is used to generate the intra-prediction block in the directional prediction modes (i.e., excluding the planar and DC predictors). In the JEM, four-tap intra-interpolation filters are used for directional intra-prediction filtering. Cubic interpolation filters are used for blocks smaller than or equal to 64 samples, and Gaussian interpolation filters are used for larger blocks. The filter parameters are set based on the block size, and the same filter is used for all modes.
Another improvement to JEM is made in the boundary prediction filters [15]. In HEVC, after the prediction block has been generated for the vertical or horizontal intra-modes, the leftmost column or the top row of the predicted block are adjusted further using the values of the boundary samples. In JEM, the number of boundary samples is increased from one to four (rows or columns) in order to obtain the predicted value using a two-tap filter (for the first and last angular modes, corresponding to intra-modes 2 and 34 in HEVC) or a three-tap filter (for modes between intra-modes 3-6 and 30-33 in HEVC), as shown in the example in Figure 6. In JEM, the results of the intra-prediction planar mode are also improved by including a position-dependent intra-prediction component (PDPC) method that processes a specific combination of the unfiltered boundary reference samples with the filtered ones, thus improving the perceived quality of the predicted block when the planar mode is used. This process uses different weights and filter sizes (three-tap, five-tap, seven-tap) based on the block size.
To reduce some of the redundancy that remains after the prediction process between the luma and chroma components, JEM uses cross-component linear model (CCLM) prediction. In this process, the chroma samples are predicted based on the reconstructed down-sampled luma samples of the same CU, using a linear model for square blocks. For non-square blocks, additional down-sampling is needed to match the shorter boundary. There are two CCLM modes: single-and multiple-model CCLM modes (MMLM). In the single-model CCLM mode, JEM employs only one linear model to predict the chroma samples, while in MMLM, there can be two models. In MMLM, the models are built based on two groups of boundary samples that serve as a training set for deriving the linear models [15]. A summary of the main differences between HEVC and JEM related to spatial prediction is shown in Table 2.

Parallel Approaches
Slices are fragments of a frame formed by correlative (in raster scan order) CTUs (see Figure 7). These are regions of the same frame that can be decoded (and also encoded) independently, which offers a valid parallelization approach for video encoding and decoding processes. However, slice independence has a drawback in that the existing redundancy between data belonging to different slices cannot be exploited, and thus the coding efficiency in terms of the R/D performance decreases. Moreover, slices are composed of a header and data, and although this is useful in terms of providing an encoded video sequence with error resilience features (since the loss of a single slice does not prevent the rest of the slices in the same frame from being properly decoded), the inclusion of a header in each slice also causes a decrease in the R/D performance. Based on the slice partitioning of the JEM video encoder, each video frame is divided into as many slices as threads to be spawned. The number of threads in the parallel region can be set as a parameter or can be obtained depending on the current state of the computer system. Hence, the number of threads (and consequently the number of slices), and the slice size are computed before starting the encoding process. In this work, we have developed two algorithms, the first of which requires synchronization processes while the second is a completely asynchronous algorithm.

Synchronous Algorithm: JEM-SP-Sync
Algorithm 1 shows the parallel algorithm, called JEM-SP-Sync, which includes synchronization processes. In Algorithm 1, the size of the slice is first computed in numbers of CTUs. To do this, we initially compute the number of horizontal (FrWidth) and vertical (FrHeight) CTUs, and the total number of CTUs (NoCTUs) available in a frame, which will depend on the video resolution. It is worth noting that both the right-hand and bottom CTUs in the frame may be incomplete (see lines 4 and 8). Furthermore, since the slices may not have the same number of CTUs, the algorithm sets the size of the last slice as equal to or smaller than the size of the rest of the slices in order to achieve a better load balance (lines [13][14][15][16][17][18]. Before starting the encoding process of the whole video sequence, each thread computes the CTUs of the slice assigned to that thread, which always remains the same throughout the encoding process (lines 20-26). The encoding process starts by reading the frame to be encoded and storing it in memory. This initial process is performed by a single thread, and a synchronization point is therefore needed to ensure that the process waits until the frame is available (line 30). In a similar way, the reconstructed frame is also stored in the shared memory. This task is carried out by each individual thread after encoding the assigned slice, meaning that another synchronization point is required before applying the "loop filter" process (line 32). The encoded video data stream (i.e., the bit stream) is organized into network abstraction layer units (NALUs), where each NALU is a packet containing an integer number of bytes. To finish the encoding process, the NALUs corresponding to each slice must be written in the correct order to form the final bit stream (line 35). It is worth mentioning that the slice-based parallel strategy for HEVC proposed in [20] obtained good speed-ups when all slices had the same number of CTUs, or differed by a maximum of a single CTU. However, this behavior changes greatly when the JEM encoder is used. As explained in Section 2, changes to the coding procedure introduced in the JEM with respect to HEVC result in significant differences in computational cost when encoding different CTUs. Note that this load imbalance is mainly due to the intrinsic characteristics of the video content.   Figure 8 shows graphically the structure of the synchronous parallel algorithm JEM-SP-Sync. As explained in Algorithm 1, each thread (T x ) always processes the same slice (S x ) in all frames. Before starting the processing of a new frame, the threads must synchronize (Sync) to compose the bitstream of the last frame (line 35 of Algorithm 1).

Asynchronous Algorithm: JEM-SP-ASync
To solve the load imbalance drawback, we designed a parallel algorithm called JEM-SP-ASync, as shown in Algorithm 2, which does not use any type of synchronization during the overall encoding process. The calculation of both the number of CTUs per frame and the number of CTUs per slice (line 3) is identical to that in Algorithm 1 (lines 2-18). Before starting the parallel region, the dimensions of all slices are calculated (lines 3-10) and stored in memory, which will be configured as shared memory (the iniCTUs and endCTUs arrays) since these values will be used by all threads. At the beginning of the parallel region, the first slice to be encoded by every thread is set (line 12) based on the thread identifier (Tid). However, the mapping of slices to threads will change for every new frame, following a round-robin-like scheduling (lines [24][25][26][27]. For this reason, each thread must update the CTUs to be encoded when starting the encoding of a new frame (lines 15 and 16). Since there are no synchronization points, the encoded NALUs must be stored in shared memory. When the encoding process for the slice is complete, each thread checks whether there is a frame for which all of the slices are encoded; if so, this thread stores the complete encoded frame in the bit stream. This procedure should be performed within a parallel critical region (lines 20-23). The proposed mapping of slices to threads included in Algorithm 2 provides an automatic load balancing mechanism without the need for synchronization points or a coordinating process. By alternating the coding slice for each thread from one frame to another, the computational cost per thread tends to balance, with a greater probability as the number of frames to be encoded increases.
In the asynchronous algorithm, shown in Figure 9, there are no synchronization points; to compose the bit stream, each thread after processing a slice checks if all the slices of the frame following the last fully encoded one are already encoded, and in that case the thread will compose that frame and remove the stored data. This process (line 17 in Algorithm 2) does not involve any synchronization point. Furthermore, each thread (T x ) processes a different slice (S x ) in each frame.

Experimental Results
The reference software used in our experiments was JEM-7.0rc1 [21], and the parallel algorithms were developed and tested using the GCC v.4.8.5 compiler [22] and OpenMP [23]. The shared memory platform used consisted of two Intel XEON X5660 hexacores, with up to 2.8 GHz and 12 MB cache per processor, and 48 GB of RAM. The operating system used as CentOS Linux 5.6 for x86/64-bit systems.
The proposed algorithms were tested for the video sequences listed in Table 3, each of which had a different resolution and a different frame rate. In all of the experiments, we used the same number of frames to be encoded (120) for all video sequences. Hence, the number of seconds to be encoded varied depending on the frame rate of the video sequence tested. We used a small number of frames for encoding in order to evaluate the load balancing capability of the JEM-SP-ASync algorithm (Algorithm 2). Note that as the number of frames to be encoded increases, the automatic load balancing method of Algorithm 2 is expected to improve, since it is statistically more likely that the computational cost per thread will be balanced. The values of the quantization parameter (QP) used were 37, 32, 27 and 22. Before addressing the efficiency of the parallel algorithms described in Section 3, we analyze the theoretical load balance index, which will depend on both the resolution of the video sequence and the number of slices (i.e., the number of threads). In all of the experiments, the CTU size was set to 128, based on common testing conditions [24]. Table 4 shows the dimensions of the different video resolutions tested, in numbers of CTUs of 128 × 128 pixels. As mentioned above, there may be incomplete CTUs at the right-hand and bottom edges of the video sequence. This occurs when the number of horizontal/vertical CTUs is not an integer, as shown in Table 4. This is the primary source of potential load imbalance, even if the computational costs associated with different CTUs are similar.  Table 5 shows the theoretical size of the slices, in number of CTUs, required to obtain a balanced load. As previously explained for Algorithm 1, when the number of CTUs per dimension is not an integer, the size is rounded up to the next integer value; otherwise, the number of slices would not match the number of threads.  Table 6 shows the size differences, in numbers of CTUs, between the last slice and the other slices in the same frame. As can be seen, when 12 slices are used in the HD video sequence, the difference reaches nine CTUs. This is the second source of potential load imbalance, and as in the previous case, this holds even if the computational costs associated with different CTUs are similar. In addition, the use of a given number of threads is not recommended in certain cases. For example, when 11 threads are used in the 1280 × 720 video sequence (60 CTUs in Table 6), all slices have six CTUs while the last has none, meaning that it will remain idle throughout the encoding process. Table 6. Difference in size of the last slice (in CTUs). Diff. in Size of the Last Slice

Number of Slices (and Threads
The third and final source of potential load imbalance depends on the encoding complexity of JEM, which does not depend on the number of CTUs but on the intrinsic characteristics of the intra-prediction, which may be affected by the CTU contents. As explained in Section 2, this load imbalance cannot be predicted, as it depends on the intrinsic characteristics of the video content to be encoded, which modifies the amount of processing required to encode each CTU. To show that domain decomposition using slices does not ensure that the load is balanced, we experimentally obtained the computational cost of each slice for the sequences listed in Table 3. Tables 7-10, show the experimental percentages of the computational cost assigned to each slice for the Park Scene, Four People, Party Scene and BQ Square video sequences, respectively. These tables include different numbers of slices per frame setup for each video sequence, and show the relative computation time required by each slice at different compression rates (QPs). As can be observed, none of these schemes achieve correct load balancing, regardless of whether or not the volume of data assigned to each process is balanced.      Table 11 shows the computation times (in seconds) needed to encode 120 frames using the AI coding mode for all video sequences tested, and the average computation time per frame for all QPs tested. As can be seen, the sequential algorithm required an average of 14 min to encode a frame of the Park Scene video sequence, meaning that 28.2 h would be needed to encode the 120 frames of the video sequence (five seconds of video).  × 240  9896  8099  6265  5010  82  67  52  42   Tables 12 and 13 show the parallel efficiency and the computational times, respectively, of the JEM-SP-Sync algorithm for the BQ Square, Party Scene, Four People and Park Scene video sequences encoded at four QP values (22,27,32,37). The computation times were obtained using OpenMP functions and the parallel efficiency, in percentage, was calculated according to Equation (1).
The JEM-SP-Sync algorithm, which includes synchronization points, generally obtains good parallel efficiency when few threads are used (set by NoT value), but does not have good scalability, meaning that its efficiency degrades as the number of threads increases. The high computational cost of the JEM video encoder means that if the load is not perfectly balanced, the parallel performance is strongly affected. In this case, and as mentioned above, the load imbalance may be caused by differences in slice sizes, by some slices having incomplete CTUs, or by the difference in the intrinsic complexity of the CTUs in the coding process in JEM. As shown in Tables 7-10, the computational cost is not balanced despite the quasi-balanced domain decomposition according to the volume of data assigned to each processor. The load imbalance is due to the nature of the data, i.e., the content of each CTU. Since the JEM-SP-Sync algorithm is synchronous, when a thread has completed the processing of the assigned CTUs of one frame, it must wait in an idle state until the rest of threads also complete the assigned CTUs, which in turn decreases parallel efficiency, as shown in Table 12. The second algorithm, JEM-SP-ASync, was developed to avoid the use of synchronization processes, and can improve the parallel efficiency if load balancing is achieved. The process implemented in this algorithm to balance the load assigns a different slice to each thread depending on the frame to be encoded, thus providing automatic load balancing. Tables 14 and 15 show the parallel efficiency and the computational times obtained by the JEM-SP-ASync algorithm for all video sequences tested here. The results confirm that the automatic load balancing process implemented in the JEM-SP-ASync algorithm works correctly and shows very good scalability, especially compared to the JEM-SP-Sync algorithm (Table 12). from one slice to the next, whereas in our parallel algorithms, these variables have the same initial values for all slices.

Conclusions
Some of the most important features of the JEM video encoder in relation to intraprediction have been briefly described here. These features focus on reducing the bitrate in order to minimize the bandwidth required for transmission. These new features cause a dramatic increase in the computational cost of encoding compared with previous video encoder standards, including HEVC. The parallel algorithms developed here make use of slices to implement domain decomposition; however, if the domain decomposition does not allow the volume of data assigned to each process to be perfectly balanced, the high computational cost causes significant cost imbalances. Moreover, a perfect balance of the data to be encoded by each process also does not ensure load balancing, unlike in the case of the HEVC encoder. In the JEM approach, it is not guaranteed that perfect domain decomposition gives rise to a perfect computational load balance. The JEM-SP-Async parallel algorithm was proposed to solve these drawbacks, which, as explained above, did not arise in previous standards. The automatic computational cost balancing system included in the design of the proposed parallel algorithms was validated based on the experimental results. These results show that the average value of efficiency rose from 71.3% to 93.4% when the JEM-SP-ASync algorithm was used instead of the JEM-SP-Sync algorithm. This significantly improved the parallel scalability, e.g., average efficiency, by coding the FOUR video sequence using 12 processes, which increased from 51.6% to 88.8%. These results were obtained by encoding only 120 frames (corresponding to 2.4 or 5 s, depending on the frame rate of the video sequence), and demonstrate correct load balancing even for short video sequences.

Conflicts of Interest:
The authors declare no conflict of interest.