Area-Time Efﬁcient Two-Dimensional Reconﬁgurable Integer DCT Architecture for HEVC

: In this paper, we present area-time efﬁcient reconﬁgurable architectures for the implementation of the integer discrete cosine transform (DCT), which supports all the transform lengths to be used in High Efﬁciency Video Coding (HEVC). We propose three 1D reconﬁgurable architectures that can be conﬁgured for the computation of the DCT of any of the prescribed lengths such as 4, 8, 16, and 32. It is shown that matrix multiplication schemes involving fewer adders can be used to derive parallel architectures for 1D integer DCT of different lengths. A novel transposition buffer is designed to be used for the proposed 2D DCT architecture, which offers double the throughput without increasing the size of the transposition buffer. We determine the optimal pipeline locations in the proposed design through the precise estimation of propagation delays and the critical path so that the area-delay-product is optimized and all the output samples are obtained in the same cycle in spite of the recursive nature of the structure. Implementation results show that the proposed 2D integer DCT architectures provide signiﬁcantly higher throughput per unit area than the existing designs for HEVC.

Ahmed et al. [8] proposed a multiplier-less lifting-based approach applied to a sparse decomposition of the DCT matrices. Shen et al. [14] proposed another multiplier-less implementation using a multiple constant multiplication (MCM) approach for the integer DCT of lengths four and eight. For the DCT of lengths 16 and 32, however, they used conventional multipliers. Park et al. [9] used Chen's factorization of the DCT where the butterfly operation of the DCT computation was implemented by the processing element (PE) with only shifters, adders, and multiplexors. Budagavi and Sze [10] proposed a unified hardware structure that can be used for forward, as well as inverse DCTs. Efficient MCMbased architectures and transposition buffers for integer DCT for HEVC were proposed in [12,13]. However, the transposition unit has a significant overhead of complexity in this design.
As a major deviation from the earlier video coding standards, HEVC allows DCTs of four different lengths. The DCT unit for HEVC is therefore required to be configurable to support different transform lengths, N = 4, 8, 16, and 32. Based on this requirement, in this paper, we propose efficient reconfigurable structures of the DCT unit that can be used for integer DCTs of different lengths. Specifically, three reconfigurable DCT architectures based on a constant matrix multiplication (CMM) and the MCM are proposed. 2D integer DCT architectures having novel transposition buffers are also proposed. The innovations of the proposed architectures are summarized as follows. • A hardware-oriented algorithm for integer DCT computation for HEVC is proposed. • Three different flexible hardware architectures for the integer DCT are proposed, each with advantages in terms of area, delay, or power. • A novel matrix-vector-product unit that involves fewer adders than the existing method [12,13] is proposed. • A novel 2D integer DCT architecture having double the throughput and less latency (without increasing the size of the transposition buffer) is proposed. • A novel low-cost pipeline strategy is proposed to reduce the critical path of the proposed 1D integer DCT. • Comparisons with existing methods are provided in terms of various metrics such as gate counts, maximum usable frequency, throughput, latency, etc.
The rest of the paper is organized as follows. In the next section, we discuss the key features of the integer DCT for HEVC. We describe three proposed reconfigurable architectures of 1D integer DCT in Section 3. In Section 4, we discuss the implementation of 2D DCT. In Section 5, we compare the area and time complexities of the proposed designs with those of the existing designs. Section 6 presents the conclusions.

Key Features of Integer DCT for HEVC
The N-point integer DCT of an input vector where the integer DCT kernel C N is an N × N matrix of integers and y = [y(0), y(1), . . . , y(N − 1)] is an output vector. The kernel matrices C N for N = 4, 8, 16, and 32 for HEVC were defined by JCT-VC [3]. The N-point integer DCT for HEVC can be computed by a partial butterfly approach using an (N/2)-point DCT and a product of the (N/2) × (N/2) matrix by an (N/2)-point vector as: where: for i = 0, 1, · · ·, N/2 − 1. C N/2 is an (N/2)-point integer DCT kernel of size (N/2) × (N/2). S N/2 is also an integer matrix of size (N/2) × (N/2) derived from the first N/2 columns of the odd-indexed rows of C N , such that the (i, j) th entry of S N/2 can be defined as: where c 2i+1,j N is the (2i + 1, j)th entry of C N . Note that all even DCT outputs are given by Equation

Proposed Reconfigurable Architectures for 1D Integer DCT
In this section, we propose three reconfigurable architectures for the computation of the 1D integer DCT of different lengths for HEVC.
3.1. Proposed 1D DCT Architecture-1 Figure 1 shows the proposed 1D DCT Architecture-1. The architecture in Figure 1a is for the integer DCTs of lengths N = 8, 16, and 32, whereas the architecture in Figure 1b is for the computation of the four-point integer DCT. Figure 1c describes the structure of the shift-add unit in Figure 1b. Sixteen-or eight-or four-point integer DCTs can be computed by the 32-point DCT structure by setting the sel control signal to the MUX unit appropriately. The input adder unit (IAU) in Figure 1a computes a(i) and b(i) for i = 0, 1, · · ·, N/2 − 1 by Equation (3). The MUX unit, which consists of N/2 2:1 MUXes, selects either a(i) or x(i) for i = 0, 1, · · ·, N/2 − 1, depending on whether it is used to compute Equation (2a) or the DCT of a lower size, respectively. Odd-indexed coefficients of y(i) for i = 1, 3, · · ·, N − 1 are computed by the matrix-vector-product unit (MVPU) according to Equation (2b). The computation of Equation (2b) could be realized as a CMM problem [15][16][17][18]. We find that the algorithm of Boullis and Tisserand [18] is very efficient when it is used for the computation of the MVPU. We generated the MVPU algorithm for S 4 · b 4 , S 8 · b 8 , and S 16 · b 16 and list these in Table 1 The complexity of the proposed 1D DCT Architecture-1 is compared to that in [12] in Table 2 since the DCT computation in [12] also uses a partial butterfly approach based on Equations (2a) and (2b). Table 1 of [12] shows the algorithm of the reference DCT architecture to be compared. Specifically, the IAUs of the two structures are the same, and the MVPU of the proposed structure corresponds to the MCM of [12]. The total number of adders listed in the seventh column of Table 2 can be referred to Table 2 of [12]. The proposed Architecture-1 uses 14.6% fewer adders compared with [12] for N = 32. Note that pipeline registers are not used for the implementation of the MVPU. The structure requires only input registers after the IAU to hold the incoming data. Table 1. Computation of S N/2 · b N/2 in the matrix-vector-product unit (b i = b(i) and y 2×i+1 = y(2 × i + 1) for i = 0, 1, 2, ..., N/2 − 1). t 2 = 32t 39 + 2t 54 + 4t 62 + t 93 + 2t 106 − 8t 108 − 32b 14 ; t 1 = 2t 53 + 4t 87 − 32t 40 − 2t 99 − 8t 104 − 32b 9 − 64b 10 − b 14 ; y 1 = t 5 + 2t 11 + t 49 + t 69 + 4t 72 + t 108 + 128b 0 + 32b 12 − 8t 24 − 2t 24 ; y 3 = t 2 + 2t 16 + 4t 41 + t 84 − 2t 23 − 8t 23 − t 47 − t 86 − 128b 10 ; 11 ; 15 ; Adding pipeline registers in the critical path is the general practice in VLSI design. However, in the recursive design as in the proposed DCT, the locations of pipeline registers need to be carefully decided. Furthermore, the pipelining in the MVPU may significantly increase the silicon area. We could find optimal pipeline locations through precise estimation of propagation delays in the critical path so that area-delay-product can be optimized and all the output samples can be obtained in the same cycle even in the recursive structure. Architecture-1 for the eight-point DCT involves four pipeline registers before the MVPU to reduce the propagation delay, as shown in Figure 1a. In order to obtain all eight DCT coefficients in the same cycle, four more registers are inserted after the first four adders in the four-point integer DCT unit, as shown in Figure 1b. Similarly, eight and 16 registers are located before the MVPU for 16-and 32-point DCTs, respectively. Note that Architecture-1 involves two pipeline stages for any of the DCT sizes, N = 4, 8, 16, and 32. The array of AND gates is used to disable the IAU, pipeline registers, and MVPU in the computation of the DCT of a lower size in order to reduce the power consumption.    The proposed Architecture-2 is an extension of Architecture-1, as shown in Figure 2, which incorporates coarse-grained reconfiguration with additional hardware units to maintain the same throughput rate for all the transform lengths. It uses an extra (N/2)point architecture over the structure of Figure 1a 18,89,50], which is the second column of S 4 . As shown in Figure 4a, the CSAU can be designed based on the SAU with several MUXes; however, it does not need any additional adder. Each CSAU is designed differently so that the last four MUXes have different combinations of coefficients corresponding to the desired configuration. Since the signs of the coefficients in each configuration may also be different, the OAU is implemented using add/subunits instead of adders or subtractors, as shown in Figure 4b. When the 16-point DCT structure is to be used for the computation of the four-point DCT, the intermediate results of Stage-2 in the three-stage adder-tree in the OAU are directed to the output y(i), unlike the computation for eight-and 16-point DCTs. Therefore, the MUX Unit-3, which involves an array of 2:1 MUXes, is used to select the desired outputs of Stage-2 or outputs of Stage-3. Similarly, a MUX Unit-3 consisting of 3:1 MUXes is used to select the desired outputs of Stages-2, 3, and 4 for the 32-point DCT structure. Since MUX Unit-3 is not required for the eight-point DCT structure, it is indicated by dashed lines as shown in Figure 3.    Figure 5 shows the proposed architecture for 2D integer DCT using N-point reconfigurable 1D integer DCTs. It consists of two sections corresponding to two stages of 2D DCT computation by row-column decomposition based on the separable property of 2D DCT. The input section of the proposed structure consists of two N-point 1D integer DCTs, to compute the 1D DCT of all the N columns of the N × N input matrix. The first N-point DCT unit (S1-A) computes the DCTs of the first (N/2) columns of the N × N input matrix, while the second N-point DCT unit (S1-B) computes the DCTs of the last (N/2) columns of the input matrix. A transposition unit consisting of four (N/2) × (N/2) buffers is employed to reorder the DCT coefficients of different columns to be fed to the output section row-wise. The output section consists of two 1D parallel N-point DCT In each cycle, two columns of the N × N input matrix x are fed to the input section of the proposed 2D DCT architecture as inputs. For example, if Column-0 and column-(N/2) of the input matrix are fed, respectively, to the DCT units S1-A and S1-B in the input section in the current clock cycle, then Column-1 and column-(N/2 + 1) are fed concurrently to S1-A and S1-B, respectively, in the next clock cycle. Hence, the entire N × N input matrix is fed to the proposed architecture in N/2 consecutive clock cycles. The first two rows of 2D DCT from the output section are available after (N/2 + 2) clock cycles where N/2 clock cycles are for transposition and two clock cycles for the input and output sections. 3) x(i,N-1) 3) x(i,N-1) In every N/2 cycle, the proposed 2D DCT architecture can process a new N × N input matrix, resulting in a throughput rate of 2N DCT coefficients per clock cycle. Therefore, the 2D DCT architecture using 32-point Architecture-1 has 8, 16, 32, and 64 coefficients per clock cycle for the 4-, 8-, 16-, and 32-point DCT computations, respectively. Furthermore, the 2D DCT architecture employing 32-point Architecture-2 or 3 can produce 64 DCT coefficients per clock cycle irrespective of the transform size.

High-Throughput 2D Integer DCT Architecture
The conventional way to obtain double the throughput is to use double the hardware comprised of two DCT units and the transposition buffers. However, the proposed 2D DCT transposition technique does not increase the transposition buffer size to increase the throughput. The 32 × 32 transposition buffer occupies a 2.1 times larger area than the 32-point integer DCT unit; therefore, the savings of the transposition buffer result in the significant savings in the total silicon area.

Implementation Results
The proposed 2D architectures using 32-point 1D Architectures-1, -2, and -3 are coded in VHDL and synthesized by Synopsys Design Compiler using the TSMC 90nm CMOS library. Table 3 lists the gate count, maximum usable frequency (MUF), samples/cycle, throughput (TPT), latency, power consumption, and energy per sample (EPS) of the proposed 2D architectures and existing architectures [6,7,[11][12][13]. Throughput per second per gate (TSG) (Ksamples/s/gate) is also listed in the last column in Table 3. Architecture-n in Table 3 means 2D architectures based on conventional transposition by a 32 × 32 transposition buffer and two proposed 1D architectures-n. The architecture-n † means 2D architectures using four 1D architectures-n and four (N/2) × (N/2) transposition buffers as proposed in Section 4. Existing designs [6,7,13] and proposed Architectures-1 and -1 † have different throughputs depending on the transform length. In Table 3, note that the TPT and TSG values of these designs were calculated based on the maximum throughput in the case of the 32-point DCT computation, and we specifically indicated * with the values. Architectures-1 and 2 can have a 400 MHz MUF, whereas the MUF of Architecture-3 is found to be 380 MHz. Architecture-2 † offers the smallest EPS among the proposed and existing architectures, which have a constant throughput. The throughput of Architecture-1 varies with the DCT size; however, Architectures-2 and -3 yield 32 samples in every cycle, resulting in 12.80 and 12.16 giga samples per second (GSPS) for all different DCT sizes. The 32-point architecture-n † uses four 16 × 16 transposition buffers, which involves the same area as 32 × 32 transposition buffers used in 32-point architecture-n, but has double the throughput of 64 samples per cycle and half the transposition latency. Therefore, Architectures-2 † and -3 † have 25.60 and 24.33 GSPS for all the transform sizes. The TSGs of the 2D architectures using the proposed Architecture-1 are the best for the computation of the 32-point DCT, but decrease for DCTs of lower sizes. The 2D architectures using 1D Architecture-3 offer better TSGs than those using 1D Architecture-2, but have a lower throughput. Architecture-n † provides 1.32 times better TSG than architecture-n on average for 32-point DCT computations of Architectures 1, 2, and 3. The proposed 2D architectures offer higher throughput, as well as lower latencies and, accordingly, higher TSGs than the existing 2D architectures.

Summary and Conclusions
In this paper, we propose area-time efficient architectures for 2D integer DCT to be used in HEVC. Three reconfigurable architectures that support the computation of integer DCT of different sizes are proposed. We propose a novel transposition buffer and a 2D DCT architecture, which provides double the throughput and half the transposition latency without increasing the size of the transposition buffer. Implementation results show that the proposed architectures provide more throughput per unit area than the existing integer DCT architectures for HEVC. The DCT is a basic transform that can be used for the implementation of other widely used transforms such as discrete Fourier transform (DFT), discrete Hartley transform (DHT), and discrete sine transform (DST).