Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders

Pastuszak, Grzegorz

doi:10.3390/electronics12224643

Open AccessArticle

Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders

by

Grzegorz Pastuszak

Faculty of Electronics and Information Technology, Warsaw University of Technology, 00-665 Warsaw, Poland

Electronics 2023, 12(22), 4643; https://doi.org/10.3390/electronics12224643

Submission received: 6 October 2023 / Revised: 9 November 2023 / Accepted: 10 November 2023 / Published: 14 November 2023

(This article belongs to the Special Issue New Technology of Image & Video Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Previous studies have shown that the application of the M-coder in the H.264/AVC and H.265/HEVC video coding standards allows for highly parallel implementations without decreasing maximal frequencies. Although the primary limitation on throughput, originating from the range register update, can be eliminated, other limitations are associated with low register processing. Their negative impact is revealed at higher degrees of parallelism, leading to a gradual throughput saturation. This paper presents optimizations introduced to the generative hardware architecture to increase throughputs and hardware efficiencies. Firstly, it can process more than one bypass-mode subseries in one clock cycle. Secondly, aggregated contributions to the codestream are buffered before the low register update. Thirdly, the number of contributions used to update the low register in one clock cycle is decreased to save resources. Fourthly, the maximal one-clock-cycle renormalization shift of the low register is increased from 32 to 64 bit positions. As a result of these optimizations, the binary arithmetic coder, configured for series lengths of 27 and 2 symbols, increases the throughput from 18.37 to 37.42 symbols per clock cycle for high-quality H.265/HEVC compression. The logic consumption increases from 205.6k to 246.1k gates when synthesized on 90 nm TSMC technology. The design can operate at 570 MHz.

Keywords:

video encoder; binary arithmetic coding; CABAC; entropy coding; UHDTV; FPGA; VLSI; H.265/HEVC; H.264/AVC

1. Introduction

Context Adaptive Binary Arithmetic Coding (CABAC) is used in the H.265/HEVC [1] and H.264/AVC [2] standards as a key module of the entropy encoder. It provides bit rates close to the entropy limit, assuming that symbol probabilities are estimated accurately. The processing complexity of CABAC is high, which is of great importance when requirements on the throughput increase. In particular, since CABAC sequentially processes binary symbols (bins), it can become a bottleneck of video encoders for high bit rates (high qualities) and high-resolution videos (e.g., Ultra High Definition Television UHDTV).

The H.265/HEVC standard at level 6.2 High tier specifies a bit-rate limit of 800 Mbit/s for 8K resolution. This limit corresponds with a bin/symbol rate of about 1 Gbin/s. Such a bin rate can be achieved by architectures capable of processing a series of several bins/symbols in each clock cycle (e.g., 3–4). However, in practice, the required maximal throughput can be significantly higher. Firstly, the number of bins can vary depending on the complexity of coding tree units (CTUs) and frames. Secondly, some specialized applications (e.g., film production and high-quality video storage) can exceed limits on bit rates. Thirdly, Field Programmable Gated Array (FPGA) implementations are usually clocked at lower frequencies than Application-Specific Integrated Circuits (ASICs). Fourthly, some video services need processing at speeds faster than real time to simultaneously generate different quality/resolution streams.

In [3], evaluations of the current state-of-the-art entropy coder architecture show that it becomes a bottleneck of the system for high-quality videos. Therefore, high-throughput CABAC implementations are needed. These can be realized by incorporating several slower parallel units assigned to separate video parts (e.g., wavefronts, frames, or slices). On the other hand, this approach makes the design more complex, involves system latencies, and requires external memories to buffer streams. Thus, single CABAC units are more useful.

High-performance Binary Arithmetic Coding (BAC) can be used in non-normative compression applications apart from standardized video coding. For example, vast amounts of data from physics experiments must be archived on disks, which drives the need for fast data compression.

The CABAC algorithm is challenging to pipeline/parallelize since each input bin can modify the range and low/offset registers, affecting the following modifications. High-speed updates of these variables are performed on separate pipeline stages since the range update affects, but does not depend on, the low update. The throughput can be increased if stages enable more than one bin/symbol to be processed in a clock cycle. For instance, in the M-coder used in H.264/AVC, an architecture was developed to code two bins in each clock cycle, as described in [4]. In [5,6], the throughput was increased by the parallel processing of context-coded (adaptive probabilities) and bypass-mode (probabilities equal to 0.5) bins. Some architectures have been optimized to efficiently encode one context-coded symbol per clock cycle by increasing hardware efficiency [7], processing continuity [8], and operation frequency [9]. Multi-symbol architectures allow for the highest throughputs [10,11,12,13,14,15,16]. They achieve bin/symbol parallelism through the cascade of combinational-logic units within pipeline stages that update the range and low registers. These cascaded units create critical paths that can negatively affect maximal operation frequencies. Consequently, performance tends to saturate when increasing the number of symbols coded in one clock cycle [12]. In [17,18], CABAC modeling is used to obtain high-throughput rate estimations in rate-distortion optimization. However, the approximation techniques used in this study cannot be applied to bit-stream generation in BAC.

In our previous study [19], we proposed the generative multi-symbol architecture for the Binary Arithmetic Coder (BAC) to further increase the number of bins coded in one clock cycle without decreasing maximal frequencies. Although this led to higher throughputs compared to other works, its efficiency was decreased by limitations originating from the low update. In particular, the range processing had to be halted in some clock cycles to add multiple bypass-mode contributions to the low register (and codestream) or to perform upshifts by more than 32 bit positions during renormalizations.

In this study, the BAC architecture is optimized to balance throughputs of the range- and low-register stages. Firstly, more than one bypass-mode subseries can be coded in a clock cycle. Secondly, contributions to the codestream are buffered before the low register update, decreasing the occurrence of hold cycles in the range register processing. Thirdly, contributions are aggregated to increase buffer utilization. Fourthly, the occurrence of hold cycles is minimized by increasing the maximal one-clock-cycle renormalization shift of the low register from 32 to 64 bit positions. Moreover, the number of contributions computed to update the low register in one clock cycle is decreased to save resources. These optimized architectures have established a new state-of-the-art in high-throughput arithmetic coding.

The rest of the paper is organized as follows: Section 2 provides an overview of the BAC algorithm. The state-of-the-art architecture proposed in our previous work is described in Section 3. Section 4 presents the optimization methods. The design and implementation results of the architecture are described in Section 5 and Section 6, respectively. Finally, the study is concluded in Section 7.

2. Arithmetic Coding in Video Standards

The entropy coder serves as the last processing stage in video encoders. At first, syntax elements such as prediction modes, partition modes, quantized coefficients, etc., directed to the entropy coder are binarized according to dedicated coding schemes. Next, context models are assigned to some bins, along with the selection of corresponding probability models. Each probability model consists of two elements. The first one is the index paired with the probability of the least probable symbol (LPS). The second one is the indicator for the most probable symbol (MPS). The selected probability model is submitted into the binary arithmetic coder (BAC) and then updated. The MPS indicator XORed with the input bin provides a symbol distinguishing between LPS and MPS coding.

Figure 1 shows the BAC flowchart. The BAC has two crucial registers for range and offset/low variables. These registers model the subdivided interval after encoding successive input symbols, as illustrated in Figure 2. In the M-coder employed by H.264/AVC and H.265/HEVC, four variants of the probability estimate are determined from the index directed to the BAC through look-up tables (LUTs). These variants correspond to four subranges distinguished by two (6th and 7th) of nine bits of the range register. The selected LPS estimate (rLPS) is used in the interval subdivision, becoming the new range value (interval length) when coding LPS. The subtraction of the rLPS from the range register provides the MPS estimate (rMPS), which becomes the new range value if the MPS is coded. The rMPS is added to the low register for LPSs. When a bypass-mode symbol is coded, the interval is divided into two equal-length subintervals. The low register increases if the bypass-mode symbol is equal to 1.

The range and low registers are shifted up whenever the range register value falls below 256, signified by the most significant bit becoming 0. This renormalization procedure is repeated until the bit becomes 1. The output bit-stream is formed using bits removed from the most significant position of the low register during each renormalization. Following a bypass-mode symbol, renormalization restores the range value before the subdivision. BAC implementations can be simplified since the range register remains unchanged for such symbols. In such cases, the range register value is only used to compute contributions to the low register.

3. Generative Multi-Symbol Architecture

The generative multi-symbol architecture [19] allows for a long series of symbols in each clock cycle. This architecture is generative, meaning that HDL-level parameters specify the maximal number of bins/symbols to be coded in one clock cycle. In particular, the maximal lengths of LPS- (long, NLPS) and MPS-leading (short, NMPS) series are described with the configuration and notation NLPS/NMPS. The division of symbol sequences into long and short series is depicted in Figure 3. Adaptative decreasing (dec) of series lengths makes the frequency of short series lower, increasing throughputs. Generally, the higher throughput increases the consumption of logic resources, and the scaling has no strict limits.

The block diagram of the architecture is shown in Figure 4. Unlike conventional architectures, the generative architecture introduces an additional subblock for range evaluation. The subblock finds range values at points/moments that finish one series and start the next series of symbols. To this end, the architecture takes advantage of the four-variant range processing that is possible when a series starts with LPS. For a given index/context assigned to the LPS, these starting points can assume only four range values, computed with LUTs as the renormalized rLPS. Therefore, the range is simultaneously evaluated for four starting points and symbols, following the first LPS in a series. This evaluation is performed in the pipeline to keep a high clock frequency, where each range unit can include registers at its output, as depicted in Figure 5. The assignment of symbols from successive series to range units and pipeline stages is shown in Figure 6. Since the evaluation starts from known range values, feedback from the last range stage to the first one is avoided. The feedback occurs at the last pipeline stage, where one of four paths is selected to obtain the new range value. Two bits of the range register obtained for the preceding series are used for this selection.

If it is not possible to start a series with the LPS, the series beginning with MPS must be evaluated. In this case, the previous range value is unknown in advance. Thus, processing is performed in the last stage, which limits the length of the MPS-leading series. In particular, such short series can be constrained to just one symbol when the architecture is configured to the highest clock frequency. In this case, registers are inserted between all range units (as seen in Figure 5) along the path. If the design intends to operate at a lower frequency, two or three successive range units are assigned to one stage (see Figure 6d). These configurations allow for short series of two or three symbols, respectively.

Range values determined at the last evaluation stage are the starting points for each symbol series. This removes the need for feedback between the series, a feature found in conventional multi-symbol BAC implementations. Therefore, subsequent regular range processing is pipelined similar to the range evaluation submodule. On the other hand, range units in the regular processing are more complex as they provide context-coded contributions to the low register. These contributions consist of increases (rMPSs) and renormalization shifts, which are assembled and transferred through the pipeline stages to the following subblocks. The increases are directed to the low processing, whereas the shifts are cumulated and directed to the bypass-symbol merging. The accumulation assumes that the shift number for each symbol is the sum of those for previous symbols within the same series.

The bypass-symbol merging aims to insert contributions of bypass-mode symbols among contributions of context-coded symbols. To this end, cumulated shifts (CSs) for context-coded symbols are increased by the lengths of preceding bypass-mode subseries. Simultaneously, the CS of preceding context-coded contributions (CS before bypass—CSBB) is selected for bypass-mode contributions. The architecture assumes that only one subseries of up to 16 bypass-mode symbols/bins can be merged in one clock cycle. The shift update is repeated if there are more bypass-mode subseries in a series of context-coded symbols/contributions. The preceding pipeline stages (range evaluation and regular processing) are kept on hold in such cases.

Updated CSs for all contributions are forwarded to the low-processing submodule to downshift the corresponding increases (rMPSs), as shown in Figure 7 and Figure 8. The number of shifted bit positions is limited to 32 in one clock cycle to avoid critical paths in following operations. If more significant shifts are required for a series, more clock cycles are taken, and hold states are introduced to preceding pipeline stages. After the downshifting, the contributions are summed in the adder tree to form the joint contribution. This is downshifted before its addition to the low register. This operation is controlled by the five least significant bits of the rate counter, which accumulate all shifts. If the rate exceeds a multiple of 32, four bytes are taken from the most significant part of the low register and written to the carry buffer. Simultaneously, the contents of the register are shifted up by 32 bit positions, and the previous value of the carry buffer is directed to the output register. If a carry is identified in the low register, the previous value of the buffer is incremented.

4. Optimizations

In [3,18], the BAC architecture can be configured for series of up to 15 context-coded symbols. Support for longer series is achieved by extending the adder tree with more contributions to the low register (see Figure 7). This modification allows for higher throughputs, as summarized in Table 1 for H.265/HEVC Low Delay Configuration. The results were obtained for five video sequences (BasketballDrive, BQTerrace, Kimono, Traffic, PeopleOnTheStreet). One might expect the throughput to be proportional to the maximal length of long (LPS-leading) series. However, the gain is significantly smaller when the maximal length exceeds 15. The reason lies in increased frequencies of hold states, as shown in Table 1. Hold states are more frequent in longer series since there are more bypass-mode subseries and more total renormalization shifts greater than 32.

The following subsections describe modifications introduced to the architecture to maximize the throughput while keeping resource costs as low as possible. Since the throughput is related to the frequency of hold states, the successive modifications are reported/evaluated only in terms of the latter. Additionally, only three configurations have been selected to obtain concise summaries. A more detailed evaluation of the optimized architecture is included in Section 6.

4.1. Parallel Bypass-Mode Subseries

The first optimization introduced to the architecture is bypass subseries parallelism (BP), which is the ability to accept two or three bypass-mode subseries in one clock cycle. The input interface has been modified to support a configurable number of subseries. The main modifications are introduced to the bypass-symbol merging submodule, as shown in Figure 9. Each subseries contains processing elements, providing separate contributions. These contributions serve as inputs to the adder tree in the low-processing submodule. The number of the subseries is configured at the HDL level. Each subseries consists of up to 16 bins, and they directly follow the previous subseries or some context-coded symbols.

Table 2 shows the reduction in the frequency/percentage of hold states when processing one (BP1), two (BP2), or three (BP3) bypass-symbol subseries in one clock cycle. It is evident that this parallelism allows for a significant reduction in the occurrence of hold states. On the other hand, the reduction is limited by other factors, as seen in the improvement between BP2 and BP3 configurations, especially for lower QPs. The main limitation stems from the constraint on the total renormalization shift within one clock cycle.

4.2. Buffer for Contributions

The BAC architecture features variable symbol rates, which differ between the range and low processing stages. This difference introduces hold states and decreases the utilization of logic resources. To further reduce the negative impact of hold states, the optimized architecture incorporates a buffer for contributions corresponding to context-coded symbols. This modification imposes the buffering of range values selected for bypass-symbol subseries. These buffers must be implemented using registers, due to the variable length of series and subseries received from the range evaluation and regular processing. In the optimized architecture, the hold signal is activated when either of the two buffers is full, meaning the number of empty cells is smaller than the amount of new data.

The buffering changes the division into series, as the temporal symbol rates at the input and output are different and decoupled. Since CSs are related to series start points, the cumulation for new series must be carried out after reading from the buffer. The complexity of this operation involves an additional pipeline stage at the buffer output.

Two buffer sizes with capacities of 31 (BU31) and 63 (BU63) context-coded contributions are evaluated for the BP2 configuration. Corresponding sizes of the range buffer are 15 and 31. The evaluation results are summarized in Table 3 (columns without the aggregation). The analysis also considers the case when no buffering is used (BU0). The buffer improves the symbol rate (reduces the number of hold states) to a certain extent when its size is larger. However, the symbol rate is lower for the capacity of 31 contributions and longer-series configurations when compared to the architecture without the buffer. This inefficiency originates from the separation of hold signals at the input and output of the buffer, which is required to avoid critical paths. On the other hand, this separation prevents writing into the buffer when there is sufficient space to write a new series due to simultaneous reading from the almost full buffer releases. In such cases, the series is written in the next clock cycle.

4.3. Aggregated Contributions

Each symbol within a series can provide a non-empty contribution to the low register. Thus, the architecture must incorporate one adder and one shifter for each symbol (see Figure 7), regardless of whether MPSs have zero-valued increases (low register is not increased) and shift numbers of zero or one bit. This means that the utilization of hardware resources is inefficient. In particular, the buffer cells often keep zeros, and these empty contributions can be removed from the series without impacting the low register. Moreover, shift numbers assigned to successive MPSs can be aggregated and added to the one assigned to the preceding LPS. However, certain limits constrain aggregation between context-coded contributions, due to bypass-symbol subseries and MPS-leading series. In the first case, shift numbers preceding the subseries must be preserved to allow for the renormalization of the corresponding bypass-symbols contributions. In the second case, aggregation across series boundaries can be disabled to simplify the hardware data flow. Thus, series that start with MPS and/or include the MPS followed by the bypass-symbol subseries provide at least one contribution with zero increase.

The impact of aggregation on the symbol rate (the number of bin/symbols per clock cycle) is shown in Table 3 (under ‘aggregation for greater NL’). As seen, aggregation significantly reduces the frequency of hold states. The improvement is much more significant for a buffer size of 31, where the problem of simultaneous writing and reading becomes marginal. Moreover, the small difference between configurations with buffer sizes of 31 and 63 cells suggests that the first option (BU31) may be selected.

4.4. Shorter Contribution Series for Low Processing

The register buffer enables flexibility in specifying the series lengths supported by the input and output ports. Although aggregation decreases the number of contributions written to the buffer on average, the range processing must support rare series with maximal lengths (e.g., all LPSs and/or MPSs followed by bypass symbols). Therefore, the buffer’s input port should be ready to accept these series. On the other hand, the number of contributions included in series released from the buffer can be smaller, while still preserving a balanced throughput between the range and low submodules.

Shorter series at the low processing submodule reduce hardware resource usage, as reported in Section 6.1. On the other hand, shortening negatively affects BAC throughput. However, the extent of this impact depends on the mutual relationship between the maximal lengths of both buffer ports. The maximal series lengths (NLs) set on the output port and in the low processing are selected experimentally to achieve a slight deterioration in throughput and considerable resource savings. In particular, configurations with LPS-leading series lengths of 15, 21, and 27 have their NL reduced to 7, 11, and 15, respectively. The losses in symbol rate are shown in Table 3 when comparing results for longer (NL-15/21/27) and shorter (NL-7/11/15) series.

4.5. Wider Renormalization Subrange

The basic BAC architecture assumes that the total bit shift positions for renormalization in one clock cycle cannot exceed 32. Thus, the number of bit positions to which contributions can be upshifted is limited to 31. This limitation has a marginal impact on throughput in configurations with shorter LPS-leading series (smaller than 15). However, higher numbers of CSs for configurations that support longer series relatively frequently often exceed the limit of 32 bit positions, requiring more clock cycles to process all contributions in a series.

The problem is solved by extending the allowed renormalization subrange to 64-bit positions in one clock cycle. This modification entails higher utilization of hardware resources and longer signal paths. Firstly, CSs applied to contributions can range from 0 to 63. Secondly, upshifted contributions, the low register, and the following registers need more bits in their representation. Thirdly, the bit width of the output interface is increased to 64.

The impact of the wider renormalization subrange on the symbol rate is summarized in Table 4. In particular, the architecture is evaluated for two maximal renormalization subranges of 32 and 64 bits, denoted as R32 and R64, respectively. Additionally, the wider subrange is shown for two configurations of bypass-symbol parallelism. As can be seen, increasing the renormalization subrange from 32 to 64 bits leads to a significant reduction in the frequency/percentage of hold states. Moreover, the wider subrange allows for a higher gain between BP2 and BP3 configurations of the bypass-symbol parallelism, compared to the results in Table 2. Consequently, hold states become a small fraction of coding time, even for the configuration with long series including up to 27 symbols.

5. Architecture

The optimizations described in Section 4 are applied to the basic BAC architecture [19]. The modified architecture has a similar structure to that shown in Figure 3. However, there are differences due to the incorporation of buffers for context-coded contributions and range values. The main changes are introduced at the lower level. More specifically, the range evaluation submodule remains unchanged, whereas the remaining three submodules are optimized.

5.1. Range Processing

The description of the range evaluation submodule is skipped since it is the same as in the basic BAC architecture; details can be found in [19]. The regular range processing is done through a chain of units, each of which is assigned to successive symbols in the series. The architecture embeds NLPS units, whose numbers match the maximal series length. Although the regular range processing preserves the configuration-dependent division of the chain into pipeline stages, the units of the chain are modified to support the aggregation of contributions to the low register.

The architecture of a single unit is shown in Figure 10. The unit is internally divided into three stages to minimize critical paths. In the first stage, the input index is mapped into four LPS probability estimate (rLPS) variants and their renormalized values. These values are used in the second stage to update the range variable. The updated range value is forwarded to the same stage of the following unit. Both stages are the same as in the basic architecture. The third stage involves collecting context-coded contributions (rMPSs, shifts, and bypass flags) and range values used to compute bypass-mode contributions. These are appended to the data transferred in the arrays between the range processing units. The data arrays are initialized with zeros at the input to the first unit. In consecutive units, the selective insertion gradually fills the arrays with non-zero values. Consequently, the units at the beginning of the chain have lower complexity than those at the end.

If aggregation is disabled, the selective insertion of contributions is fixed and depends on the unit’s position in the chain. Otherwise, insertion is controlled by the series length variable. It is incremented whenever a contribution in a given unit is appended to the array. The conditional appending is performed by a multiplexer, which discards contributions corresponding to most MPSs. Although such contributions have rMPSs equal to 0, their shift numbers can be 0 or 1. Therefore, the shift numbers for the discarded and preceding contributions must be added. This addition involves the shift chain initialized to 0 whenever LPS or bypass indicators are encountered in a series.

The shift numbers in the series of contributions should be cumulated to facilitate the simultaneous renormalization of increases added to the low register. Since the buffering of contributions changes the division into series, the cumulation must follow the reading. Therefore, the full cumulation performed in the basic architecture range units is not present in the modified ones. Although the aggregation involves similar operations in the shift chain as the full cumulation, there are two crucial differences. Firstly, shift numbers are cumulated only for contributions aggregated into one. Secondly, not all but only selected values taken from the shift chain are forwarded to the low processing. The selected values are the same as the aggregated ones.

5.2. Buffers

The modified architecture incorporates two buffers for context-coded contributions (main buffer) and range values used to compute bypass-symbol contributions. The block diagram of the buffers is shown in Figure 11. Data written to the buffers are received from the regular range processing submodule. The data read from the buffer are forwarded to the bypass-symbol merging submodule. The main goal of the buffering is to minimize hold states at write ports since they propagate to range processing stages and decrease the average throughput. Hold states occur when any of the two buffers is full, i.e., writing new data would cause an overflow. To avoid an overflow, the number of input items at the write port of a buffer is added to the current buffer fullness. Then, the result is compared to the maximal capacity (overflow element in Figure 11). If the addition result for the two buffers is not greater than the corresponding maximal capacities, the data are written, as indicated by the inactive buffer full signal. The active signal prevents writing even when reading simultaneously releases sufficient space. This suboptimality is the price of separating time-critical signals, i.e., buffer full and hold. On the other hand, this inefficiency is negligible, as concluded from Table 3 when considering buffer sizes of 31 and 63.

Series of contributions read from the main buffer are NL in length unless the number of valid items in the buffer is smaller. The range buffer is read at a variable rate, and the number of utilized range values depends on the LAST signal received from the bypass-symbol merging submodule. For both buffers, the number of cells released during the reading is subtracted from the buffer fullness register, whereas the number of input data items is added. Input data are upshifted by the number of positions resulting from the current buffer fullness. The upshifting adjusts data to empty cells in each buffer.

5.3. Bypass-Symbol Merging

The merging submodule combines context-coded and bypass-mode contributions. In the modified architecture, the main changes relate to the parallel processing of bypass-symbol subseries as described in Section 4.1. However, shorter contribution series for low processing and the wider renormalization subrange also impact the size of data arrays. The submodule is depicted in Figure 9 and Figure 12 for general and detail views, respectively. In Figure 12, three parts of the submodule are distinguished. They are responsible for bypass-mode contributions, the cumulation of shift numbers, and the phase control. The first part includes the buffer for bypass-mode data consisting of 16-bin subseries, their lengths, and LAST flags indicating if the subseries are finished or continued in the following one. Input bypass-mode data are appended to the valid ones in the buffer using the demultiplexer. Before writing to the buffer register, such a concatenated array is downshifted using a multiplexer. This operation is controlled by the BU (Bypass Utilized) signal. Written bin subseries are multiplied by values taken from the range buffer to compute bypass-symbol increases to the low register. The number of multiplication units matches that of bypass-mode contributions the BAC can code in one clock cycle.

The most complex and time-critical part of the bypass-symbol merging submodule is the cumulation of shift numbers corresponding to successive symbols in each series. The input shift numbers for context-coded symbols are cumulated in a separate stage to avoid critical paths. At the same input stage, flags indicating the location of bypass subseries are mapped to numbers of preceding context-coded symbols (number before bypass in Figure 12). The numbers are used to select CSs for bypass-symbol contributions in the second stage. The third stage includes two paths for context-coded and bypass-mode contributions, where lengths of bypass-symbol subseries are selectively added to CSs of the following contributions. The result (arrays CS[NL:0] and CSBB[BP:0]) is directed to the low processing and the phase control part. Hold states are activated if there are more bypass-symbol subseries in the current series or the renormalization shift exceeds the allowable subrange (R32 or R64). When it happens, CSs kept in registers are updated with additional results related to the utilized bypass-symbol subseries. The update forms feedback loops with multiplexers driven by the BU signal.

The submodule computes the number of preceding bypass subseries for each context-coded symbol and converts it to the unary representation to facilitate the control. The simultaneous bitwise downshifting of unary variables allows for detection when the update of CSs for corresponding context-coded symbols is completed (RDY—Shift Ready). The downshifting by one bit corresponds to utilizing one bypass subseries with the active LAST flag. The LST signal indicates this condition.

The phase control part of the bypass-symbol merging submodule determines the scheduling for each series. In particular, it generates signals indicating hold states and the utilization of bypass subseries. The signals depend on CSs and the shift-phase register pointing to an active renormalization subrange. For the R64 configuration, 0, 1, 2, and 3 values correspond to [0, 64], [65, 128], [129, 192], and [193, 256] subranges, respectively. Most significant bits of CSs are compared to the phase. The equality indicates that the shifts fall in the assigned subrange, which enables forwarding associated contributions to the low processing. The shift-phase register is incremented if all such contributions have been forwarded and there are other contributions related to the next subrange. The hold signal is active until all contributions in the current series are not utilized/forwarded.

5.4. Low Processing and Bit-Stream Generation

The low processing and codestream generation submodule follows the same data flow as shown in Figure 7. However, the shorter contribution series, the bypass series parallelism, and the wider renormalization subrange (R64) affect the design at the low level. In particular, the first optimization decreases the number of input contributions (rMPS and CS). Consequently, it involves fewer downshifters, adders, and associated registers. On the other hand, bypass subseries parallelism requires one (BP2) or two (BP3) more paths for contributions. The R64 configuration makes the design more complex. Firstly, downshifters have an additional multiplexer layer. Secondly, the bit widths of downshifted increases/contributions and buffers are doubled. In the case of the low register, the bit width is increased from 88 to 132 bits.

6. Implementation Results

The optimized BAC architecture is specified in VHDL and parameterized to change hardware configurations easily. The maximal lengths of symbol series processed in one clock cycle are specified by two constants. The third one determines the maximal length of the contribution series. Apart from the two constants assigned to short and long series in the range processing, the additional third one limits the series length for the low processing. Each optimization described in Section 4 can be independently enabled or disabled at the HDL level, with one exception. Shorter contribution series are dependent on the presence of the buffer. The design is verified against data produced by the HM (version 13.0) and JM (version 11.0) reference models [20,21] for all configurations.

6.1. Synthesis Results

The architecture is synthesized for selected ASIC and FPGA technologies. In the first case, the design is synthesized with Synopsys Design Compiler using TSMC 90 nm standard cell library under worst-case conditions. In the second case, the target device is Arria II GX, and Altera Quartus II software is used as a synthesis tool. Generally, the architecture can be easily synthesized for different ASIC and FPGA technologies since the HDL specification does not refer to technology-dependent cells. It is up to the synthesis tool to apply technology-level optimizations in addition to architectural ones described in previous sections.

The clock rate limit depends on the maximal length of MPS-leading (short) series and the number of bypass subseries the architecture can code in one clock cycle. The main limitation originates from the maximal length of short series. However, if the bypass subseries parallelism is set to three, the design with short series set to one symbol has a much lower frequency. In the case of FPGA devices, maximal frequencies are 200, 154, and 105 MHz for one-, two-, and three-symbol short series, respectively. The TSMC 90 nm technology allows for higher frequencies with 650, 570, and 400 MHz limits, respectively. The limitation is usually due to the number of time-critical range update units assigned to one pipeline stage of the evaluation and the regular processing submodules. However, updating CSs with bypass symbol contributions introduces critical paths for configurations that reduce short series to one symbol. In particular, the ability to process two (BP2) and three (BP3) bypass symbol subseries decreases the maximal frequencies to 620 and 570 MHz, respectively. More advanced technologies (e.g., 65 nm, 28 nm, Stratix 10, ultraScale+) allow for higher clock frequencies and performance. In practice, this requires more effort in placement and routing to balance clock paths and meet setup/hold time requirements.

Table 5 and Table 6 summarize the resource consumption for TSMC 90nm and FPGA Arria II GX, respectively. Selected configurations are shown in the table headers. The FPGA implementation additionally utilizes one, two, or three DSP units to multiply bypass subseries and ranges. DSP units are much more efficient (critical paths and area) than general logic cells, and their use is decided by the synthesis tool.

Generally, longer series and more bypass subseries increase resource consumption. The relation between the amount of logic resources and the number of symbols in the series is nearly proportional. Resource consumption is decreased for configurations able to code more symbols in MPS-leading series as the design requires fewer pipeline stages (fewer registers). However, this ability negatively affects clock frequencies, as described above.

6.2. Throughput

Different BAC configurations are evaluated for the average throughput, measured as the number of symbols processed in one clock cycle. These configurations are determined by long/short series lengths and enabled optimizations. Results are summarized in Table 7, Table 8 and Table 9. They are obtained for five video sequences (same as in Section 4) and four quantization parameters (4, 12, 22, and 37). For each sequence, 30 frames are coded with the HM H.265/HEVC using the low-delay (LD) and all-intra (AI) configurations. Common Test Conditions [22] are applied for the remaining settings. Additionally, the design is evaluated for LD H.264/AVC. Evaluations summarized in Table 7 and Table 8 assume that throughputs are limited only by the BAC, i.e., the binarization and the context/probability modeling provide sufficient data at the BAC input.

The most important results are for small QPs (e.g., 4 or 12), where the number of symbols to code is the largest, and video quality is high. For these QPs, higher throughputs are achieved due to longer bypass-mode subseries and the more frequent appearance of LPSs. The latter facilitates the formation of longer input series. Generally, configurations capable of coding longer series achieve better results. Successive columns in Table 7 show the impact of introduced optimizations. As shown, they increase the throughput, and the improvement is more significant for configurations capable of coding longer series. However, buffering without aggregation is not beneficial, as discussed in Section 4.3.

The left part of Table 8 provides evaluation results for the H.265/HEVC AI Profile. The impact of hardware configurations and QPs on the throughput is generally similar to the LD Profile case (Table 7). On the other hand, the achieved throughputs are higher compared to the LD Profile. The differences are mainly due to the distribution of bypass-mode symbols.

Symbol rates for LD H.264/AVC are summarized in the last four columns of Table 8. In general, throughputs for H.265/HEVC are significantly higher than those for H.264/AVC due to the difference in symbol distributions. In the second case, long sequences of MPSs often force short series at the BAC input. Moreover, bypass subseries in H.264/AVC often consist of one symbol, and their number in a series is often higher. Such distributions often introduce hold states. The buffering and the parallel processing of more than one bypass subseries (BP2 and BP3) significantly improve the throughput by reducing the frequency of hold states. On the other hand, the wider renormalization subrange (R64) has a small impact, as hold states caused by the scattering of bypass symbols divide series into parts with smaller CSs (less than 32).

The BAC architecture is evaluated along with the other modules of the entropy coder proposed in [3] for H.265/HEVC LD and AI Profiles. The achieved symbol rates are summarized in Table 9. Compared to the results shown in Table 7 and Table 8, they are significantly lower for larger QPs and configurations with longer LPS-leading series. The architecture of the entropy coder is balanced when the maximal series length is set to 15. The BAC optimizations for this length provide considerable improvements when the QP is lower. However, other modules that compose the entropy coder should be optimized to utilize the higher BAC throughputs for extended series lengths.

6.3. Comparison

Table 10 compares the optimized architecture with other architectures described in the literature. The fastest hardware configuration of the optimized architecture is considered. Compared to prior work, it achieves the highest symbol rate and throughput. For the high-quality H.265/HEVC compression, the advantage is greater since more bypass-mode symbols can be coded in each clock cycle. For a QP equal to 4, the 27/2 optimized architecture achieves a symbol rate of 37.42, which is 9.39 bins per clock cycle better compared to the case for a QP equal to 22. The design described in [11] shows no improvement for small QPs, as the symbol rate does not depend on the QP. The other works [12,13,14,15] do not report throughputs for small QPs. However, improvements are expected to be much smaller, as these other designs can code fewer bypass-mode symbols compared to the generative architectures. Moreover, in these designs, rates of context-coded symbols are close to their upper bounds for a QP equal to 22.

The power consumption of the generative architectures depends on the selected configuration. Table 11 includes results for two configurations. Power is estimated using Synopsys Design Compiler based on the switching activity of the simulated design (post-synthesis) for TSMC 65 nm and 90 nm, with the same clock frequency of 570 MHz. The design described in [16] is optimized for power and synthesized for CMOS 65 nm ST PDK. Power estimation results are obtained using Cadence RTL Compiler with tool-inferred stimuli. This design has lower power consumption and significantly lower throughputs compared to the generative architectures.

The basic [19] and optimized architectures have throughputs exceeding the limits of previous implementations. This gain comes at increased hardware costs caused by the higher number of coded symbols and deeper pipelining. On the other hand, the generative architectures allow for different configurations to adjust the required throughput and hardware costs. The significantly higher throughputs of the optimized architecture make it attractive for high-resolution and high-quality video compression, especially when the clock frequency is limited by the technology (e.g., FPGA).

7. Conclusions

The optimized generative BAC architecture outperforms prior arts. Optimization methods applied to the basic generative architecture allow for significant throughput improvements at relatively small hardware costs. Firstly, the methods reduce the frequencies of hold states generated by the bypass-symbol merging submodule. Secondly, the buffering decreases the impact of hold states on range processing. Thirdly, series lengths at the low processing are reduced to save resources. The optimizations can be selectively applied to the architecture, considering required throughputs and costs; they are beneficial for achieving higher throughputs.

Although the proposed optimizations significantly increase throughputs, the other modules of the entropy coder also affect speed performance. Evaluations with the modules proposed in [3] reveal that their throughput should be significantly increased to balance the optimized BAC architecture. Their optimization will be the subject of research in future work.

Funding

This research received no external funding.

Data Availability Statement

Video data are available on request: ftp://ftp.ient.rwth-aachen.de/ctc/ (accessed on 1 July 2020).

Conflicts of Interest

The author declares no conflict of interest.

References

ISO/IEC 23008-2:2013; High Efficiency Video Coding (HEVC), ITU-T H.265 and ISO/IEC Standard 23008-2 (MPEG-H Part 2). ISO/IEC: Washington, DC, USA, 2013.
ISO/IEC 14496-10:2003; Advanced Video Coding (AVC), ITU-T H.264 and ISO/IEC Standard 14496-10 (MPEG-4 Part 10). ISO/IEC: Washington, DC, USA, 2005.
Pastuszak, G. Multisymbol Architecture of the Entropy Coder for H.265/HEVC Video Encoders. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2573–2583. [Google Scholar] [CrossRef]
Pastuszak, G. High-Efficient Architectures of the Context Adaptive Binary Arithmetic Coder for H.264/AVC. In Proceedings of the International Workshops on Systems, Signals and Image Processing (IWSSIP’04), Poznań, Poland, 13–15 September 2004; pp. 167–170. [Google Scholar]
Osorio, R.R.; Bruguera, J.D. A New Architecture for fast Arithmetic Coding in H.264 Advanced Video Coder. In Proceedings of the 2005 8th Euromicro Conference on Digital System Design (DSD’05), Porto, Portugal, 30 August 2005; pp. 298–305. [Google Scholar]
Osorio, R.R.; Bruguera, J.D. High-throughput architecture for H.264/AVC CABAC compression system. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 1376–1384. [Google Scholar] [CrossRef]
Peng, B.; Ding, D.; Zhu, X.; Yu, L. A hardware CABAC encoder for HEVC. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Bangkok, Thailand, 25–28 May 2013; pp. 1372–1375. [Google Scholar]
Chen, J.-W.; Wu, L.-C.; Liu, P.-S.; Lin, Y.-L. A high-throughput fully hardwired CABAC encoder for QFHD H.264/AVC main profile video. IEEE Trans. Consum. Electron. 2010, 56, 2529–2536. [Google Scholar] [CrossRef]
Tian, X.; Le, T.M.; Jiang, X.; Lian, Y. Full RDO-support power-aware CABAC encoder with efficient context access. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 1262–1273. [Google Scholar] [CrossRef]
Fei, W.; Zhou, D.; Goto, S. A 1 Gbin/s CABAC encoder for H.264/AVC. In Proceedings of the European Signal Processing Conference, EUSIPCO 2011, Barcelona, Spain, 29 August–2 September 2011; pp. 1524–1528. [Google Scholar]
Tsai, C.-H.; Tang, C.-S.; Chen, L.-G. A flexible fully hardwired CABAC encoder for UHDTV H.264/AVC high profile video. IEEE Trans. Consum. Electron. 2012, 58, 1329–1337. [Google Scholar] [CrossRef]
Zhou, D.; Zhou, J.; Fei, W.; Goto, S. Ultra-High-Throughput VLSI Architecture of H.265/HEVC CABAC. Encoder for UHDTV Applications. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 497–507. [Google Scholar] [CrossRef]
Chen, C.; Liu, K.; Chen, S. High-throughput Binary Arithmetic Encoder architecture for CABAC in H.265/HEVC. In Proceedings of the IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Hangzhou, China, 25–28 October 2016. [Google Scholar]
Ramos, F.L.L.; Zatt, B.; Porto, M.; Bampi, S. High-Throughput Binary Arithmetic Encoder using Multiple-Bypass Bins Processing for HEVC CABAC. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018. [Google Scholar]
Li, W.; Yin, X.; Zeng, X.; Yu, X.; Wang, W.; Fan, Y. A VLSI Implement of CABAC Encoder for H.265/HEVC. In Proceedings of the IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Qingdao, China, 31 October–3 November 2018. [Google Scholar]
Ramos, F.L.L.; Zatt, B.; Porto, M.; Bampi, S. Energy-Throughput Configurable Design for Video Processing Binary Arithmetic Encoder. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1163–1177. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, C. A highly parallel hardware architecture of table-based CABAC bit rate estimator in an HEVC intra encoder. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 1544–1558. [Google Scholar] [CrossRef]
Cai, Y.; Fan, Y.; Huang, L.; Zeng, X.; Yin, H.; Zeng, B. A Fast CABAC Hardware Design for Accelerating the Rate Estimation in HEVC. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2385–2395. [Google Scholar] [CrossRef]
Pastuszak, G. Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders. IEEE Trans. Circuits Syst.-I Regul. Pap. 2020, 67, 891–902. [Google Scholar] [CrossRef]
HEVC Software Repository—HM-16.0 Reference Model. Available online: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.0/ (accessed on 1 May 2018).
H.264/AVC Reference Software JM17.2. Available online: http://iphome.hhi.de/suehring/tml/download/ (accessed on 1 May 2018).
Bossen, F. Common test conditions and software reference configurations. In Proceedings of the Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Doc. JCTVC-J1100, Stockholm, Sweden, 11–20 July 2012. [Google Scholar]

Figure 1. Flowchart of the binary arithmetic coder used in H.264/AVC and H.265/HEVC.

Figure 2. Interval subdivision in the binary arithmetic coder used in H.264/AVC and H.265/HEVC.

Figure 3. Example of the division of the symbol sequence into MPS-leading (short) and LPS-leading (long) series with straightforward (a) and adaptive (b) approaches. M and L correspond to MPS and LPS, respectively. Upper descriptions assume NLPS = 15 and NMPS = 3, whereas the bottom ones assume NLPS = 9 and NMPS = 1.

Figure 4. Block diagram of the BAC architecture [19].

Figure 5. Evaluation of range values. Units distinguished in blue are used for the MPS coding.

Figure 6. Examples of symbol series and their assignment to units. (a,b) depict inputs to the range evaluation for 9/1 and 15/3 configurations, respectively. (c,d) depict the time assignment to units in the pipeline for 9/1 and 15/3 configurations, respectively. L and M correspond to LPS and MPS, respectively. Dots indicate the following series. Dashes indicate unused units.

Figure 7. Simplified diagram of low-register processing and codestream generation. Pipelining is not shown.

Figure 8. Example of code-stream generation from increases in three clock cycles. The horizontal dimension corresponds to bit positions in the code stream. Indices i, j, k, l, m, p, and r correspond to LPS positions. Dark areas indicate valid bits. The right parts show an extended view of the left ones.

Figure 9. General view of the bypass-symbol merging submodule with parallel coding of three bypass-mode subseries.

Figure 10. Range units for regular processing.

Figure 11. Detail view of buffers for contributions (main) and ranges.

Figure 12. Detailed view of the bypass-symbol merging submodule with parallel coding of two bypass-mode subseries.

Table 1. Average symbol rates and percentages of hold states for H.265/HEVC and the Low Delay configuration.

QP	Symbol Rate				Frequency of Hold States [%]
QP	9/1	15/3	21/3	27/3	9/1	15/3	21/3	27/3
4	6.91	13.81	17.15	18.42	12.21	27.71	40.99	53.22
12	5.91	12.50	15.95	17.04	6.32	18.88	32.23	46.26
22	5.28	11.27	14.11	15.10	5.51	19.23	33.58	47.46
37	5.24	10.17	11.69	12.01	7.44	26.13	42.20	54.85

Table 2. Percentages of hold states for three configurations of bypass subseries parallelism.

QP	15/3			21/3			27/3
QP	BP1	BP2	BP3	BP1	BP2	BP3	BP1	BP2	BP3
4	27.71	13.20	11.81	40.99	24.97	22.66	53.22	38.33	36.21
12	18.88	6.31	4.41	32.23	14.88	10.76	46.26	27.19	21.54
22	19.23	5.06	2.68	33.58	13.4	9.55	47.46	25.78	17.11
37	26.13	6.91	2.07	42.20	18.84	8.33	54.85	33.06	20.86

Table 3. Percentages of hold states for different buffering, aggregation, and low series length (NL) configurations.

QP	15/3 BP2							21/3 BP2							27/3 BP2
	BU0	BU31	BU63	Aggregation on				BU0	BU31	BU63	Aggregation on				BU0	BU31	BU63	Aggregation on
				NL-15		NL-7					NL-21		NL-11					NL-27		NL-15
				BU31	BU63	BU31	BU63				BU31	BU63	BU31	BU63				BU31	BU63	BU31	BU63
4	13.20	10.12	2.78	0.03	0.00	0.28	0.01	24.97	45.4	19.16	7.45	4.64	13.14	11.80	38.33	51.19	36.57	25.78	23.72	30.05	29.80
12	6.31	5.07	0.79	0.03	0.00	0.12	0.00	14.88	44.45	9.83	4.11	1.92	7.36	5.63	27.19	50.15	25.98	18.66	15.39	23.02	22.04
22	5.06	4.24	0.59	0.1	0.01	0.14	0.01	13.40	42.73	9.02	5.29	3.47	8.04	6.40	25.78	49.51	24.35	18.54	16.35	22.04	20.93
37	6.91	5.87	1.31	0.59	0.10	0.62	0.13	18.84	41.93	14.77	11.5	10.19	14.98	13.11	33.06	48.05	31.97	27.39	26.1	30.64	29.95

Table 4. Percentages of hold states for three configurations with different renormalization subranges and bypass subseries parallelism.

QP	15/3 BU31 NL7			21/3 BU31 NL11			27/3 BU31 NL15
	R32 BP2	R64		R32 BP2	R64		R32 BP2	R64
	R32 BP2	BP2	BP3	R32 BP2	BP2	BP3	R32 BP2	BP2	BP3
4	0.28	0.04	0.01	13.14	2.79	0.64	30.05	14.76	5.55
12	0.12	0.03	0.01	7.36	1.19	0.23	23.02	8.64	2.28
22	0.14	0.07	0.01	8.04	2.03	0.22	22.04	11.29	2.69
37	0.62	0.37	0.08	14.98	6.73	0.63	30.64	20.87	6.31

Table 5. Resource consumption [kgates] for TSMC 90 nm.

Long/Short Series	BU0			BU31
				BP2	Aggregation on
					BP2	NL Reduced
	BP1	BP2	BP3		BP2	R32 BP2	R32 BP3	R64 BP2	R64 BP3
15/1	116.8	128.1	137.1	124.6	142.6	124.6	130.6	133.6	140.7
15/2	101.5	109.4	124.4	118.2	122.1	108.1	118.1	118.1	127.0
15/3	90.1	95.6	102.6	103.5	108.0	95.9	99.93	101.6	106.1
21/1	181.8	194.2	205.4	204.5	214.3	191.7	198.5	204.1	212.4
21/2	149.1	160.7	168.3	170.2	178.0	158.7	171.2	172.7	181.4
21/3	134.1	141.1	148.6	148.9	156.1	138.2	143.1	145.9	151.2
27/1	259.6	274.4	289.0	284.9	299.2	269.8	279.6	284.9	293.4
27/2	205.6	220.7	227.8	231.3	241.8	215.3	230.3	233.3	246.1
27/3	184.7	193.0	202.5	202.3	213.6	188.1	193.4	196.8	203.6

Table 6. Resource consumption [ALUTs] for Arria II GX.

Long/Short Series	BU0			BU31
				BP2	Aggregation on
					BP2	NL Reduced
	BP1	BP2	BP3		BP2	R32 BP2	R32 BP3	R64 BP2	R64 BP3
15/1	13,453	14,372	15,646	16,122	16,547	14,535	15,197	15,406	16,052
15/3	9915	10,843	11,946	12,307	12,808	11,047	11,727	11,832	12,666
21/1	21,784	23,000	24,393	24,725	25,695	22,824	23,514	23,954	24,522
21/3	15,088	16,214	17,596	17,596	18,451	15,963	16,646	16,950	17,686
27/1	31,838	33,437	35,317	35,381	36,645	32,872	33,421	34,258	34,819
27/3	20,860	22,315	24,221	23,681	24,874	21,510	22,345	22,686	23,616

Table 7. Average symbol rates for H.265/HEVC LD Profile.

QP	Long/Short Series	BU0			BU31
					BP2	Aggregation on
						BP2	NL Reduced
		BP1	BP2	BP3		BP2	R32 BP2	R32 BP3	R64 BP2	R64 BP3
4	15/1	12.44	14.82	15.04	15.37	16,82	16.79	16.80	16.82	16.82
	15/2	13.43	16.26	16.53	16.90	18.76	18.71	18.73	18.75	18.76
	15/3	13.81	16.84	17.12	17.48	19.55	19.50	19.52	19.54	19.55
	21/1	16.55	20.95	21.59	15.43	25.87	24.44	25.09	26.99	27.48
	21/2	16.99	21.69	22.37	15.85	26.93	25.31	26.07	28.24	28.84
	21/3	17.15	21.96	22.66	15.99	27.26	25.55	26.35	28.68	29.35
	27/1	18.21	24.04	24.86	19.16	29.11	27.46	28.53	33.30	36.76
	27/2	18.37	24.33	25.17	19.35	29.42	27.73	28.83	33.74	37.42
	27/3	18.42	24.43	25.28	19.40	29.48	27.78	28.90	33.85	37.63
12	15/1	11.10	12.73	12.96	12.90	13.52	13.51	13.52	13.52	13.52
	15/2	12.11	14.06	14.34	14.26	15.02	15.01	15.02	15.02	15.03
	15/3	12.50	14.58	14.89	14.79	15.62	15.60	15.61	15.62	15.62
	21/1	15.26	19.06	19.95	12.61	21.44	20.75	21.46	22.06	22.27
	21/2	15.77	19.84	20.80	12.94	22.40	21.64	22.43	23.09	23.32
	21/3	15.95	20.12	21.10	13.06	22.74	21.94	22.77	23.46	23.70
	27/1	16.81	22.72	24.44	15.66	25.43	24.09	26.09	28.52	30.49
	27/2	16.98	23.03	24.90	15.80	25.78	24.40	26.47	28.98	31.03
	27/3	17.04	23.13	24.92	15.84	25.89	24.48	26.60	29.14	31.21
22	15/1	9.65	11.26	11.53	11.36	11.83	11.82	11.83	11.83	11.83
	15/2	10.80	12.78	13.11	12.90	13.48	13.47	13.49	13.48	13.49
	15/3	11.27	13.42	13.78	13.55	14.18	14.18	14.20	14.19	14.20
	21/1	13.11	17.01	18.13	11.43	18.57	18.03	19.09	19.22	19.58
	21/2	13.83	18.15	19.40	11.96	19.89	19.29	20.47	20.61	21.01
	21/3	14.11	18.59	19.89	12.15	20.40	19.77	21.00	21.15	21.56
	27/1	14.62	20.59	22.97	14.19	22.55	21.57	23.94	24.56	27.02
	27/2	14.97	21.25	23.85	14.48	23.33	22.29	24.81	25.78	28.03
	27/3	15.10	21.48	24.06	14.58	23.60	22.55	25.12	25.78	28.41
37	15/1	8.46	10.41	10.89	10.52	11.04	11.04	11.10	11.07	11.10
	15/2	9.66	12.18	12.82	12.32	13.01	13.02	13.08	13.04	13.08
	15/3	10.17	12.98	13.69	13.13	13.91	13.90	13.98	13.94	13.99
	21/1	10.25	13.97	15.60	10.33	15.08	14.54	16.15	15.83	16.83
	21/2	11.28	15.80	17.83	11.35	17.19	16.52	18.53	18.12	19.34
	21/3	11.69	16.59	18.61	11.76	18.10	17.36	19.59	19.13	20.46
	27/1	10.83	15.57	18.12	12.38	16.77	16.09	18.59	18.11	21.15
	27/2	11.68	17.26	20.36	13.45	18.71	17.87	20.93	20.35	24.07
	27/3	12.01	17.97	21.32	13.87	19.51	18.62	21.93	21.30	25.36

Table 8. Average symbol rates for H.265/HEVC AI and H.264/AVC LD.

QP	Long/Short Series	BU31, Aggregation on, NL Reduced
		H.265/HEVC AI				H.264/AVC LD
		BU0 R32 BP1	BU31 R32 BP2	BU31 R64 BP2	BU31 R64 BP3	BU0 R32 BP1	BU31 R32 BP2	BU31 R32 BP3	BU31 R64 BP2	BU31 R64 BP3
4	15/2	12.63	20.10	20.12	20.13	5.39	9.41	10.62	9.41	10.62
	15/3	13.23	21.68	21.71	21.71	5.65	10.09	11.59	10.13	11.59
	21/2	15.80	27.96	30.59	31.46	6.05	11.14	13.77	11.26	13.90
	21/3	16.13	28.68	31.70	32.85	6.23	11.56	14.64	11.72	14.79
	27/2	17.78	30.54	34.88	41.18	6.39	11.28	14.26	12.16	15.88
	27/3	17.96	30.68	35.22	41.89	6.51	11.31	14.76	12.39	16.55
12	15/2	13.00	18.16	18.20	18.21	5.35	9.67	11.39	9.68	11.39
	15/3	13.37	18.89	18.93	18.94	5.48	9.98	11.98	10.06	11.98
	21/2	16.29	24.43	27.42	28.14	5.89	10.99	14.10	11.19	14.35
	21/3	16.44	24.63	27.81	28.60	5.97	11.15	14.53	11.35	14.80
	27/2	17.54	26.52	31.48	35.94	6.13	10.72	14.04	11.81	15.76
	27/3	17.59	26.55	31.60	36.11	6.18	10.91	14.27	11.84	16.07
22	15/2	11.97	16.93	16.98	17.00	5.49	9.08	9.96	9.08	9.96
	15/3	12.32	17.63	17.69	17.71	5.83	9.97	11.14	10.01	11.14
	21/2	14.77	22.24	25.25	26.42	6.18	10.75	12.71	10.96	12.93
	21/3	14.90	22.47	25.59	26.85	6.47	11.48	13.95	11.73	14.23
	27/2	15.78	24.46	28.38	33.47	6.56	11.10	13.36	12.03	14.83
	27/3	15.83	24.55	28.57	33.66	6.80	11.61	14.36	12.64	16.08
37	15/2	9.72	14.13	14.17	14.19	4.75	5.42	5.49	5.42	5.49
	15/3	10.10	14.93	14.98	15.00	5.83	6.87	6.99	6.87	6.99
	21/2	11.49	17.91	20.16	22.05	5.78	6.88	7.04	6.90	7.06
	21/3	11.68	18.30	20.71	22.73	6.96	8.54	8.82	8.57	8.85
	27/2	12.03	19.35	21.87	27.71	6.65	8.20	8.48	8.38	8.65
	27/3	12.13	19.74	22.37	28.21	7.87	9.91	10.39	10.18	10.65

Table 9. Average symbol rates for the H.265/HEVC entropy coder.

QP	LD				AI
	BU0 BP1 15/2 [19]	BU31 R64 BP3			BU0 BP1 15/2 [19]	BU31 R64 BP3
	BU0 BP1 15/2 [19]	15/2	21/2	27/2	BU0 BP1 15/2 [19]	15/2	21/2	27/2
4	13.28	16.68	16.89	16.90	12.63	17.61	17.44	17.44
12	10.94	11.77	11.93	11.93	12.36	14.39	14.48	14.49
22	7.51	7.65	7.80	7.81	10.68	11.70	11.85	11.86
37	3.53	3.54	3.55	3.55	6.63	6.74	6.78	6.78

Table 10. Comparison of different BAC architectures.

	Tsai [11]	Zhou [12]	Li [15]	Ramos [14]	15/2, BU0, BP1, R32 [19]	27/2, BU31, BP3, R64 [This Work]
Standard	AVC	AVC/HEVC	HEVC	HEVC	AVC/HEVC	AVC/HEVC
Technology [nm]	130	90	65	65	90	90
Symbol rate QP = 22	5	4.36/4.38	4.63	4.94	5.48/10.8	14.83/28.03
max. frequency [MHz]	254	420	516	537	570	570
max. throughput [Mbin/s]	1270	1831/1839	2390	2653	3124/6156	8453/15,977
Gate count [k]	10.3	64.1 ¹	106.5 ¹	33	101.4	246.1

¹ CABAC without memory.

Table 11. Comparison of power consumption for H.265/HEVC LD; QP equal to 22.

	Ramos [16]		15/2, BU31, BP2, R32		27/2, BU31, BP3, R64
	Alt	ET-conf.2	15/2, BU31, BP2, R32		27/2, BU31, BP3, R64
Power [mW]	23.14	11.77	43.96	72.51	105.10	171.10
Technlogy [nm]	65	65	65	90	65	90
Symbol rate	4.31	2.99	13.47	13.47	28.03	28.03
throughput [Mbin/s]	2263	1516	7678	7678	15,977	15,977
Gate count [k]	20.76	21.22	91.22	122.1	206.8	246.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pastuszak, G. Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders. Electronics 2023, 12, 4643. https://doi.org/10.3390/electronics12224643

AMA Style

Pastuszak G. Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders. Electronics. 2023; 12(22):4643. https://doi.org/10.3390/electronics12224643

Chicago/Turabian Style

Pastuszak, Grzegorz. 2023. "Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders" Electronics 12, no. 22: 4643. https://doi.org/10.3390/electronics12224643

APA Style

Pastuszak, G. (2023). Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders. Electronics, 12(22), 4643. https://doi.org/10.3390/electronics12224643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization of the Generative Multi-Symbol Architecture of the Binary Arithmetic Coder for UHDTV Video Encoders

Abstract

1. Introduction

2. Arithmetic Coding in Video Standards

3. Generative Multi-Symbol Architecture

4. Optimizations

4.1. Parallel Bypass-Mode Subseries

4.2. Buffer for Contributions

4.3. Aggregated Contributions

4.4. Shorter Contribution Series for Low Processing

4.5. Wider Renormalization Subrange

5. Architecture

5.1. Range Processing

5.2. Buffers

5.3. Bypass-Symbol Merging

5.4. Low Processing and Bit-Stream Generation

6. Implementation Results

6.1. Synthesis Results

6.2. Throughput

6.3. Comparison

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI