Low-Complexity Loeffler DCT Approximations for Image and Video Coding

This paper introduced a matrix parametrization method based on the Loeffler discrete cosine transform (DCT) algorithm. As a result, a new class of eight-point DCT approximations was proposed, capable of unifying the mathematical formalism of several eight-point DCT approximations archived in the literature. Pareto-efficient DCT approximations are obtained through multicriteria optimization, where computational complexity, proximity, and coding performance are considered. Efficient approximations and their scaled 16- and 32-point versions are embedded into image and video encoders, including a JPEG-like codec and H.264/AVC and H.265/HEVC standards. Results are compared to the unmodified standard codecs. Efficient approximations are mapped and implemented on a Xilinx VLX240T FPGA and evaluated for area, speed, and power consumption.


Introduction
Discrete time transforms have a major role in signal-processing theory and application. In particular, tools such as the discrete Haar, Hadamard, and discrete Fourier transforms, and several discrete trigonometrical transforms [1,16] have contributed to various image-processing techniques [15,40,52,60]. Among such transformations, the discrete cosine transform (DCT) of type II is widely regarded as a pivotal tool for image compression, coding, and analysis [16,27,56]. This is because the DCT closely approximates the Karhunen-Loève transform (KLT) which can optimally decorrelate highly correlated stationary Markov-I signals [16].
Indeed, the recent literature reveals a significant number of works linked to DCT computation. Some noteworthy topics are: (i) cosine-sine decomposition to compute the eight-point DCT [57]; (ii) low-complexity-pruned eight-point DCT approximations for image encoding [22]; (iii) improved eight-point approximate DCT for image and video compression requiring only 14 additions [58]; (iv) HEVC multisize DCT hardware with constant throughput, supporting heterogeneous coding unities [26]; (v) approximation of feature pyramids in the DCT domain and its application to pedestrian detection [55]; (vi) performance analysis of DCT and discrete wavelet transform (DWT) audio watermarking based on singular value decomposition [42]; (vii) adaptive approximated DCT architectures for HEVC [53]; (viii) improved Canny edge detection algorithm based on DCT [75]; and (ix) DCT-inspired feature transform for image retrieval and reconstruction [68].
In fact, several current image-and video-coding schemes are based on the DCT [5], such as JPEG [67], MPEG-1 [62], H.264 [70], and HEVC [59]. In particular, the H.264 and HEVC codecs employ low-complexity discrete transforms based on the eight-point DCT. The eight-point DCT has also been applied to dedicated image-compression systems implemented
Fast algorithms can minimize the number of arithmetic operations required for the DCT computation [16,56]. A number of fast algorithms have been proposed for the eight-point DCT [2,25,43]. The vast majority of DCT algorithms consist of the following factorization [16]: where A is the additive matrix that represents a set of butterfly operations, M is a multiplicative matrix, and P is a permutation matrix that simply rearranges the output components to natural order. Matrix A is often fixed [16] and given by: Among the DCT fast algorithms, the Loeffler DCT achieves the theoretical lower bound of the multiplicative complexity for 8-point DCT, which consists of 11 multiplications [47]. In this work, multiplications by irrational quantities are sought to be substituted with trivial multipliers, representable by simple bit-shifting operations (Sections 2.2 and 3.2). Thus, we expect that approximations based on Loeffler DCT could generate low-complexity approximations. Therefore, the Loeffler DCT algorithm was separated as the starting point to devise new DCT approximations.
The Loeffler DCT employs a scaled DCT with the following transformation matrix: Such scaling eliminates one multiplicand because 2 2 · c 4 = 1. Therefore, we can write the following expression: Matrix M carries all multiplications by irrational quantities required by Loeffler fast algorithm. It can be further decomposed as: where In order to achieve the minimum multiplicative complexity, the Loeffler fast algorithm uses fast rotation for the rotation blocks in matrix B and C [7,16]. Since each of the three fast rotations requires three multiplications, and we have the two additional multiplications on matrix D, the Loeffler fast algorithm requires a total of 11 multiplications.
Hereafter, we adopt the following notation: where 0 4 is the the 4 × 4 null matrix, and Submatrices E α α α and O α α α compute the even and odd index DCT components, respectively.
Note that the expression in Matrix (18) implies that the inverse of the matrix M α α α is a matrix with the same structure, whose coefficients are a function of the parameter vector α α α. For the matrix inversion to be well-defined, we must have: det(E α α α ) · det(O α α α ) = 0. By explicitly computing det(E α α α ) and det(O α α α ), we obtain the following condition for matrix inversion:

Near Orthogonality
Some important and well-known DCT approximations are nonorthogonal [11,32]. Nevertheless, such transformations are nearly orthogonal [21,65]. Let A be a square matrix. Deviation from orthogonality can be quantified according to the deviation from diagonality [21] of A · A , which is given by the following expression: where diag(·) returns a diagonal matrix with the diagonal elements of its argument and · F denotes the Frobenius norm [16]. Therefore, considering T α α α , we obtain: Nonorthogonal transforms have been recognized as useful tools. The signed DCT (SDCT) is a particularly relevant DCT approximation [32] and its deviation from orthogonality is 0.20. We adopt such deviation as a reference value to discriminate nearly orthogonal matrices. Thus, for T α α α , we obtain the following criterion for near orthogonality:

Assessment Criteria
In this section, we describe the selected figures of merit for assessing the performance and complexity of a given DCT approximation. We separated the following performance metrics: (i) total error energy [20,65]; (ii) mean square error (MSE) [16,51]; (iii) unified coding gain [16,65], and (iv) transform efficiency [16]. For computational complexity assessment, we adopted arithmetic operation counts as figures of merit.

Total Error Energy
Total error energy quantifies the error between matrices in a Euclidean distance way. This measure is given by References [20,65]:

Mean Square Error
The MSE of given matrix approximationĈ α α α is furnished by: where R xx represents the autocorrelation matrix of a Markov I stationary process with correlation coefficient ρ, and trace(·) returns the sum of main diagonal elements of its matrix argument. The (i, j)-th entry of R xx is given by ρ |i− j| , i, j = 0, 1, . . . , 7 [1,16]. The correlation coefficient is assumed as equal to 0.95, which is representative for natural images [16].

Unified Transform Coding Gain
Unified transform coding gain provides a measure to quantify the compression capabilities of a given matrix [65]. It is a generalization of usual transform coding gain as in Reference [16]. Let g k and h k be the kth row ofĈ α α α and C α α α , respectively. Then, the unified transform coding gain is given by Reference [37]: where , sum(·) returns the sum of elements of its matrix argument, operator • denotes the element-wise matrix product, B k = g k 2 2 , and · 2 is the usual vector norm.

½½
Another measure for assessing the coding performance is the transform efficiency [2]. Let the matrix R XX =Ĉ α α α · R xx ·Ĉ ⊤ α α α be the covariance matrix of the transformed signal X. The transform efficiency ofĈ α α α is given by [2]:

½½
Based on the Loeffler DCT factorization, a fast algorithm for T α α α is obtained and its signal flow graph ½½ (SFG) is shown in Fig. 1. Stage 1, 2, and 3 correspond to matrices A, M α α α , and P, respectively. The ½¾¼ computational cost of such algorithm is closely linked to the selected parameter values α k , k = 1, 2, . . . , 6.

½¾
The number of additions and bit-shifts can be evaluated by inspecting the discussed algorithm ( Fig. 1). Thus, we obtain the following expressions for the addition and bit-shifting counts, respectively: where 1 A (x) = 1, if x ∈ A, and 0 otherwise.

Computational Cost
Based on Loeffler DCT factorization, a fast algorithm for T α α α is obtained and its signal flow graph (SFG) is shown in Figure 1. Stage 1, 2, and 3 correspond to matrices A, M α α α , and P, respectively. The computational cost of such an algorithm is closely linked to selected parameter values α k , k = 1, 2, . . . , 6. Because we aim at proposing multiplierless approximations, we restricted parameter values α k to set P = {0, ±1/2, ±1, ±2}, i.e., α α α ∈ P 6 . The elements in P correspond to trivial multiplications that affect null multiplicative complexity. In fact, the elements in P represent only additions and minimal bit-shifting operations.
The number of additions and bit-shifts can be evaluated by inspecting the discussed algorithm ( Figure 1). Thus, we obtain the following expressions for the addition and bit-shifting counts, respectively: where 1 A (x) = 1, if x ∈ A, and 0 otherwise.

Multicriteria Optimization and New Transforms
In this section, we introduce an optimization problem that aims at identifying optimal transformations derived from the proposed mapping (Matrix (13)). Considering the various performances and complexity metrics discussed in the previous section, we set up the following multicriteria optimization problem [3,24]:   [16] subject to: i the existence of inverse transformation, according to the condition established in Matrix (20); ii the entries of the inverse matrix must be in P ; to ensure both forwarded and inverse low-complexity transformations; iii the property of orthogonality or near-orthogonality according to the criterion in Equation (24).
Being a multicriteria optimization problem, Problem (32) is based on objective function set F = { (·), MSE(·), −C g (·), −η(·), A(·), S(·)}. The problem in analysis is discrete and finite since there is a countable number of values to the objective function. However, the nonlinear, discrete nature of the problem renders it unsuitable for analytical methods. Therefore, we employed exhaustive search methods to solve it. The discussed multicriteria problem requires the identification of the Pareto efficient solutions set [24], which is given by:

Efficient Solutions
The exhaustive search [24] returned six efficient parameter vectors, which are listed in Table 1. For ease of notation, we denote the low-complexity matrices and their associated approximations linked to efficient solutions according to: Table 2 summarizes the performance metrics, arithmetic complexity, and orthonormality property of obtained matricesĈ i , i = 1, 2, . . . , 6. We included the DCT for reference as well. Note that all DCT approximations except those byĈ 3 are orthonormal.

Comparison
Several DCT approximations are encompassed by the proposed matrix formalism. Such transformations include: the SDCT [32], the approximation based on round-off function proposed in Reference [19], and all the DCT approximations introduced in Reference [21]. For instance, the SDCT [32] is another particular transformation fully described by the proposed matrix mapping. In fact, the SDCT can be obtained by taking f (α α α 1 ), where α α α 1 = 1 1 1 1 1 1 . Nevertheless, none of these approximations is part of the Pareto efficient solution set induced by the discussed multicriteria optimization problem [24]. Therefore, we compare the obtained efficient solutions with a variety of state-of-the-art eight-point DCT approximations that cannot be described by the proposed Loeffler-based formalism. We separated the Walsh-Hadamard transform (WHT) and the Bouguezel-Ahmad-Swamy (BAS) series of approximations labeled BAS 1 [11], BAS 2 [10], BAS 3 [12], BAS 4 [13], BAS 5 [14] (for a = 1), BAS 6 [14] (for a = 0), BAS 7 [14] (for a = 1/2), and BAS 8 [15]. Table 3 shows the performance measures for these transforms. For completeness, we also show the unified coding gain and the transform efficiency measures for the exact DCT [16].
Some approximations, such as the SDCT, were not explicitly included in our comparisons. Although they are in the set of matrices generated by Loeffler parametrization, they are not in the efficient solution set. Thus, we removed them from further analyses for not being an optimal solution.
In order to compare all the above-mentioned transformations, we aimed at identifying the Pareto frontiers [24] in two-dimensional plots considering the performance figures of the obtained efficient solution as well as the WHT and BAS approximations. Thus, we devised scatter plots considering the arithmetic complexity and performance measures. The resulting plots are shown in Figure 2. Orthogonal transform approximations are marked with circles, and nonorthogonal approximations with cross signs. The dashed curves represent the Pareto frontier [24] for each selected pair of the measures.
Transformations located on the Pareto frontier are considered optimal, where the points are dominated by the frontier correspond to nonoptimal transformations. The bivariate plots in Figure 2a,b reveal that the obtained Loeffler-based DCT approximations are often situated at the optimality site prescribed by the Pareto frontier. The Loeffler approximations perform particularly well in terms of total error energy and the MSE, which capture the matrix proximity to the exact DCT matrix in a Euclidean sense. Such approximations are particularly suitable for problems that require computational proximity to the the exact transformation as in the case of detection and estimation problems [38,39]. Regarding coding performance, Figure 2c,d shows that transformationsĈ 1 ,Ĉ 3 ,Ĉ 6 , and BAS 6 are situated on the Pareto frontier, being optimal in this sense. These approximations are adequate for data compression and decorrelation [16].

Image Compression
We implemented the JPEG-like compression experiment described in References [11,12,14,14,32] and submitted the standard Elaine image to processing at a high compression rate. For quantitative assessment, we adopted the structural similarity (SSIM) index [69] and the peak signal-to-noise rate (PSNR) [5,27] measure. Figure 3 shows the reconstructed     Elaine image according to the JPEG-like compression considering the following transformations: DCT,Ĉ 1 ,Ĉ 2 ,Ĉ 3 ,Ĉ 4 , and BAS 6 . We employed fixed-rate compression and retained only five coefficients, which led to 92.1875% compression rate.
Despite very low computational complexity, the approximations could furnish images with quality comparable to the results obtained from the exact DCT. In particular, approximationsĈ 1 andĈ 2 offered good trade-off, since they required only 14-16 additions and were capable of providing competitive image quality at smaller hardware and power requirements (Section 6). Figure 4 shows the PSNR and SSIM for different number of retained coefficients. Considering PSNR measurements, the difference between the measurements associated to the BAS 6 and to the efficient approximation were less than ≈1 dB. Similar behavior was reported when SSIM measurements were considered.

Experiments with the H.264/AVC Standard
To assess the Loeffler-based approximations in the context of video coding, we embedded them in the x264 [72] software library for encoding video streams into the H.264/AVC standard [61]. The original eight-point transform employed in H.264/AVC is a DCT-based integer approximation given by Reference [28]: The fast algorithm for the above transformation requires 32 additions and 14 bit-shifting operations [28].    and their specific mathematical formulations can be found in [67]. For that, we followed the procedure ½ specified in [66][67][68]. We adopted the same testing point as determined in [69] Figure 5: Average rate-distortion curves of the modified H.264/AVC software for the selected 11 CIF videos from Reference [73] for the PSNR (dB) in terms of bitrate.
We encoded 11 common intermediate format (CIF) videos with 300 frames from a public video database [73] using the standard and modified codec. In our simulation, we employed default settings and the resulting video quality was assessed by means of the average PSNR of chrominance and luminance representation considering all reconstructed frames. Psychovisual optimization was also disabled in order to obtain valid PSNR values and the quantization step was unaltered.
We computed the Bjøntegaard delta rate (BD-Rate) and the Bjøntegaard delta PSNR (BD-PSNR) [6] for modified codec compared to the usual H.264 native DCT approximation. BD-Rate and BD-PSNR were automatically furnished by the x264 [72] software library. A comprehensive review of Bjøntegaard metrics and their specific mathematical formulations can be found in Reference [31]. For that, we followed the procedure specified in References [6,31,66]. We adopted the same testing point as determined in Reference [9] with fixed quantization parameter (QP) in {22, 27, 32, 37}. Following References [31,66], we employed cubic spline interpolation between the testing points for better visualization. Figure 5 shows the resulting average BD-Rate distortion curves for the selected videos, where H.264 represents the integer DCT used in the x264 software library shown in Matrix (34). We note the superior performance ofĈ 3 andĈ 4 approximations compared to BAS 2 and BAS 6 transforms. This performance is more evident for small bitrates. As the bitrate increases, the quality performance ofĈ 3 ,Ĉ 4 , BAS 2 , and BAS 6 approximates native DCT implementation. Table 4 shows the BD-Rate and BD-PSNR measures for the 11 CIF video sequence selected from Reference [73].
Note that Loeffler approximationsĈ 1 andĈ 2 , again, show better performance in terms of BD-PSNR measures than which demands 72 additions and 62 bit-shifting operations, when considering the factorization method based ¾½½ on odd and even decomposition described in [2]. Besides a particular 8-point integer transformation, the HEVC ¾½¾ standard also employs 4-, 16-, and 32-point transforms [22]. To obtain a modified version of the H.265/HEVC ¾½¿ codec, we derived 16-and 32-point approximations based on the scalable recursive method proposed in [71].

¾½
The 8-point Loeffler-based efficient solutions were employed as fundamental building blocks for scaling up.

¾½
In order perform image quality assessment, we encoded the first 100 frames of one standard sequence of ¾½ each A to F video classes defined in the Common Test Conditions and Software Reference Configurations ¾¾¼ (CTCSRC) document [72]. The selected videos and their attributes are made explicit in the first two ¾¾½ columns of Table 6. Classes A and B present high-resolution content for broadcasting and video on demand   6 . We attribute the difference in quality betweenĈ 3 andĈ 4 in relation toĈ 1 andĈ 2 to the associated complexity discrepancy as depicted in Table 2. The reduction in the number of additions and bit-shifting operations come at the expense of performance.

Experiments with the H.265/HEVC Standard
We also embedded Loeffler-based approximations into the HM-16.3 H.265/HEVC reference software [35]. The H.265/HEVC employs the following integer transform: which demands 72 additions and 62 bit-shifting operations, when considering the factorization method based on odd and even decomposition described in Reference [16]. Besides a particular eight-point integer transformation, the HEVC standard also employs four-, 16-, and 32-point transforms [59]. To obtain a modified version of the H.265/HEVC codec, we derived 16-and 32-point approximations based on the scalable recursive method proposed in Reference [36]. The eightpoint Loeffler-based efficient solutions were employed as fundamental building blocks for scaling up. Since the four-point DCT matrix only possesses integer coefficients, it was employed unaltered. The additive complexity of the scaled 16-and 32-point approximations is 2 · A(α α α) + 16 and 4 · A(α α α) + 64, respectively [36]. For the bit-shifting count, we have 2 · S(α α α) and 4 · S(α α α), respectively.
In order to perform image-quality assessment, we encoded the first 100 frames of one standard sequence of each A to F video classes defined in the Common Test Conditions and Software Reference Configurations (CTCSRC) document [8]. The selected videos and their attributes are made explicit in the first two columns of Table 6. Classes A and B present highresolution content for broadcasting and video on demand services; Classes C and D are lower-resolution contents for the same applications; Class E reflects typical examples of video-conferencing data; and Class F provides computer-generated image sequences [54]. All encoding parameters, including QP values, were set according to the CTCSRC document for the Main profile and All-Intra (AI), Random Access (RA), Low Delay B (LD-B), and Low Delay P (LD-P) configurations. In the AI configuration, each frame from the input video sequence is independently coded. RA and LD (B and P) coding modes differ mainly by order of coding and outputting. In the former, the coding order and output order of the pictures (frames) may differ, whereas equal order is required for the latter (Reference [71], p. 94). RA configuration relates to broadcasting and streaming use, whereas LD (B and P) represent conventional applications.
It is worth mentioning that Class A test sequences are not used for LDB and LDP coding tests, whereas Class E videos are not included in the RA test cases because of the nature of the content they represent (Reference [71], p. 93).
As figures of merit, we computed the BD-Rate and BD-PSNR for the modified versions of the codec. Figures 7-10 depict the rate-distortion curves and Table 6 presents the Bjøntegaard metrics for the considered sets of eight-to 32point approximations for modes AI, RA, LDB, and LDP. BD-Rate and BD-PSNR are also automatically furnished by HM-16.3 H.265/HEVC reference software [35]. We employed the cubic spline interpolation based on four testing points on Figures 7-10, as specified in References [31,66]. The curves denoted by HEVC represents the native integer approximations for the eight-, 16-, and 32-point DCT in the HM-16.3 H.265/HEVC software [35]. These four testing points were determined by the specification for HEVC as in Reference [9] with fixed QP ∈ {22, 27, 32, 37}. One can note that there is no significant image-quality degradation when considering approximate transforms. Furthermore, in all the cases, it can be seen thatĈ 4 performs better with a degradation of no more than 0.52 dB.
The Loeffler-based approximations could present competitive image quality while possessing low computational complexity. Table 5 shows the arithmetic costs of both the original and modified extended versions, and the H.265/HEVC native transform. As a qualitative result, Figure 11 depicts the first frame of the 'BasketballDrive' video sequence resulting from  the unmodified H.265/HEVC [35] and from the modified codec according to the discussed approximations and their scaled versions at QP = 32. We also display the related frames according to the BAS 2 and BAS 6 transforms.

FPGA Implementation
To compare the hardware-resource consumption of the discussed approximations, they were initially modeled and tested in Matlab Simulink and then physically realized on FPGA. The FPGA used was a Xilinx Virtex-6 XC6VLX240T installed on a Xilinx ML605 prototyping board. FPGA realization was tested with 100,000 random eight-point input test vectors using hardware cosimulation. Test vectors were generated from within the Matlab environment and routed to the physical FPGA device using a JTAG-based hardware cosimulation. Then, measured data from the FPGA was routed back to Matlab memory space.
We separated the BAS 6 approximation for comparison with efficient approximations because its performance metrics lay on the Pareto frontier of the plots in Figure 2. The associated FPGA implementations were evaluated for hardware complexity and real-time performance using metrics such as configurable logic blocks (CLB) and flip-flop (FF) count, critical path delay (T cpd ) in ns, and maximum operating frequency (F max ) in MHz. Values were obtained from the Xilinx FPGA                   synthesis and place-route tools by accessing the xflow.results report file. In addition, static (Q p ) and dynamic power (D p in mW/GHz) consumption were estimated using the Xilinx XPower Analyzer. We also reported area-time complexity (AT) and area-time-squared complexity (AT 2 ). Circuit area (A) was estimated using the CLB count as a metric, and time was derived from T cpd . Table 7 lists the FPGA hardware resource and power consumption for each algorithm.
Considering the circuit complexity of the discussed approximations and BAS 6 [15], as measured from the CBL count for the FPGA synthesis report, it can be seen from Table 7 that T 1 is the smallest option in terms of circuit area. When considering maximum speed, matrix T 5 showed the best performance on the Vertex-6 XC6VLX240T device. Alternatively, if we consider the normalized dynamic power consumption, the best performance was again measured from T 1 .

Conclusions and Final Remarks
In this paper, we introduced a mathematical framework for the design of eight-point DCT approximations. The Loeffler algorithm was parameterized and a class of matrices was derived. Matrices with good properties, such as low-complexity, invertibility, orthogonality or near-orthogonality, and closeness to the exact DCT, were separated according to a multicriteria optimization problem aiming at Pareto efficiency. The DCT approximations in this class were assessed, and the optimal transformations were separated.
The obtained transforms were assessed in terms of computational complexity, proximity, and coding measures. The We demonstrated that the proposed method is a unifying structure for several approximations scattered in the literature, including the well-known approximation by Lengwehasatit-Ortega [44], and they share the same matrix factorization scheme. Additionally, because all discussed approximations have a common matrix expansion, the SFG of their associated fast algorithms are identical except for the parameter values. Thus, the resulting structure paves the way to performance-selectable transforms according to the choice of parameters. Extensions for larger blocklengths can be achieved by considering scalable approaches such as the one suggested in Reference [36]. Alternatively, one could employ direct parameterization of the 16-point Loeffler DCT algorithm described in Reference [47].

A Abbreviations
The following abbreviations are used in this manuscript: