1. Introduction
Discrete time transforms have a major role in signal-processing theory and application. In particular, tools such as the discrete Haar, Hadamard, and discrete Fourier transforms, and several discrete trigonometrical transforms [
1,
2] have contributed to various image-processing techniques [
3,
4,
5,
6]. Among such transformations, the discrete cosine transform (DCT) of type II is widely regarded as a pivotal tool for image compression, coding, and analysis [
2,
7,
8]. This is because the DCT closely approximates the Karhunen–Loève transform (KLT) which can optimally decorrelate highly correlated stationary Markov-I signals [
2].
Indeed, the recent literature reveals a significant number of works linked to DCT computation. Some noteworthy topics are: (i) cosine–sine decomposition to compute the 8-point DCT [
9]; (ii) low-complexity-pruned 8-point DCT approximations for image encoding [
10]; (iii) improved 8-point approximate DCT for image and video compression requiring only 14 additions [
11]; (iv) HEVC multisize DCT hardware with constant throughput, supporting heterogeneous coding unities [
12]; (v) approximation of feature pyramids in the DCT domain and its application to pedestrian detection [
13]; (vi) performance analysis of DCT and discrete wavelet transform (DWT) audio watermarking based on singular value decomposition [
14]; (vii) adaptive approximated DCT architectures for HEVC [
15]; (viii) improved Canny edge detection algorithm based on DCT [
16]; and (ix) DCT-inspired feature transform for image retrieval and reconstruction [
17]. In fact, several current image- and video-coding schemes are based on the DCT [
18], such as JPEG [
19], MPEG-1 [
20], H.264 [
21], and HEVC [
22]. In particular, the H.264 and HEVC codecs employ low-complexity discrete transforms based on the 8-point DCT. The 8-point DCT has also been applied to dedicated image-compression systems implemented in large web servers with promising results [
23]. As a consequence, several algorithms for the 8-point DCT have been proposed, such as: Lee DCT factorization [
24], Arai DCT scheme [
25], Feig–Winograd algorithm [
26], and the Loeffler DCT algorithm [
27]. Among these methods, the Loeffler DCT algorithm [
27] has the distinction of achieving the theoretical lower bound for DCT multiplicative complexity [
28,
29].
Because the computational complexity lower bounds of the DCT have been achieved [
28], the research community resorted to approximation techniques to further reduce the cost of DCT calculation. Although not capable of providing exact computation, approximate transforms can furnish very close computational results at significantly smaller computational cost. Early approximations for the DCT were introduced by Haweel [
30]. Since then, several DCT approximations have been proposed. In Reference [
31], Lengwehasatit and Ortega introduced a scalable approximate DCT that can be regarded as a benchmark approximation [
5,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40]. Aiming at image coding for data compression, a series of approximations have been proposed by Bouguezel-Ahmad-Swamy (BAS) [
5,
34,
35,
36,
37,
41]. Such approximations offer very low complexity and good coding performance [
2,
39].
The methods for deriving DCT approximation include: (i) application of simple functions, such as signum, rounding-off, truncation, ceil, and floor, to approximate the elements of the exact DCT matrix [
30,
39]; (ii) scaling and rounding-off [
2,
32,
39,
42,
43,
44,
45,
46,
47]; (iii) brute-force computation over reduced search space [
38,
39]; (iv) inspection [
33,
34,
35,
41]; (v) single-variable matrix parametrization of existing approximations [
37]; (vi) pruning techniques [
48]; and (vii) derivations based on other low-complexity matrices [
5]. The above-mentioned methods are capable of supplying single or very few approximations. In fact, a systematized approach for obtaining a large number of matrix approximations and a unifying scheme is lacking.
The goal of this paper is two-fold. First we aim at unifying the matrix formalism of several 8-point DCT approximations archived in the literature. For that, we consider the 8-point Loeffler algorithm as a general structure equipped with a parametrization of the multiplicands. This approach allows the definition of a matrix subspace where a large number of approximations could be derived. Second, we propose an optimization problem over the introduced matrix subspace in order to discriminate the best approximations according to several well-known figures of merit. This discrimination is important from the application point of view. It allows the user to select the transform that fits best to their application in terms of balancing performance and complexity. The optimally found approximations are subject to mathematical assessment and embedding into image- and video-encoding schemes, including the H.264/AVC and the H.265/HEVC standards. Third, we introduce hardware architecture based on optimally found approximations realized in field programmable gate array (FPGA). Although there are several subsystems in a video and image codec, this work is solely concentrated on the discrete transform subsystem.
The paper unfolds as follows.
Section 2 introduces a novel DCT parametrization based on the Loeffler DCT algorithm. We provide the mathematical background and matrix properties, such as invertibility, orthogonality, and orthonormalization, are examined.
Section 3 reviews the criteria employed for identifying and assessing DCT approximations, such as proximity and coding measures, and computational complexity. In
Section 4, we propose a multicriteria-optimization problem aiming at deriving optimal approximation subject to Pareto efficiency. Obtained transforms are sought to be comprehensively assessed and compared with state-of-the-art competitors.
Section 5 reports the results of embedding the obtained transforms into a JPEG-like encoder, as well as in H.264/AVC and H.265/HEVC video standards. In
Section 6, an FPGA hardware implementation of the optimal transformations is detailed, and the usual FPGA implementation metrics are reported.
Section 7 presents our final remarks.
4. Multicriteria Optimization and New Transforms
In this section, we introduce an optimization problem that aims at identifying optimal transformations derived from the proposed mapping (Matrix (
13)). Considering the various performances and complexity metrics discussed in the previous section, we set up the following multicriteria optimization problem [
57,
58]:
subject to:
- i
the existence of inverse transformation, according to the condition established in Matrix (
20);
- ii
the entries of the inverse matrix must be in ; to ensure both forwarded and inverse low-complexity transformations;
- iii
the property of orthogonality or near-orthogonality according to the criterion in Equation (
24).
Quantities and are in negative form to comply to the minimization requirement.
Being a multicriteria optimization problem, Problem (
32) is based on objective function set
. The problem in analysis is discrete and finite since there is a countable number of values to the objective function. However, the nonlinear, discrete nature of the problem renders it unsuitable for analytical methods. Therefore, we employed exhaustive search methods to solve it. The discussed multicriteria problem requires the identification of the Pareto efficient solutions set [
57], which is given by:
4.1. Efficient Solutions
The exhaustive search [
57] returned six efficient parameter vectors, which are listed in
Table 1. For ease of notation, we denote the low-complexity matrices and their associated approximations linked to efficient solutions according to:
and
, respectively.
Table 2 summarizes the performance metrics, arithmetic complexity, and orthonormality property of obtained matrices
,
. We included the DCT for reference as well. Note that all DCT approximations except those by
are orthonormal.
4.2. Comparison
Several DCT approximations are encompassed by the proposed matrix formalism. Such transformations include: the SDCT [
30], the approximation based on round-off function proposed in Reference [
38], and all the DCT approximations introduced in Reference [
39]. For instance, the SDCT [
30] is another particular transformation fully described by the proposed matrix mapping. In fact, the SDCT can be obtained by taking
, where
. Nevertheless, none of these approximations is part of the Pareto efficient solution set induced by the discussed multicriteria optimization problem [
57]. Therefore, we compare the obtained efficient solutions with a variety of state-of-the-art 8-point DCT approximations that cannot be described by the proposed Loeffler-based formalism. We separated the Walsh–Hadamard transform (WHT) and the Bouguezel–Ahmad–Swamy (BAS) series of approximations labeled
[
34],
[
35],
[
41],
[
36],
[
37] (for
),
[
37] (for
),
[
37] (for
), and
[
5].
Table 3 shows the performance measures for these transforms. For completeness, we also show the unified coding gain and the transform efficiency measures for the exact DCT [
2].
Some approximations, such as the SDCT, were not explicitly included in our comparisons. Although they are in the set of matrices generated by Loeffler parametrization, they are not in the efficient solution set. Thus, we removed them from further analyses for not being an optimal solution.
In order to compare all the above-mentioned transformations, we aimed at identifying the Pareto frontiers [
57] in two-dimensional plots considering the performance figures of the obtained efficient solution as well as the WHT and BAS approximations. Thus, we devised scatter plots considering the arithmetic complexity and performance measures. The resulting plots are shown in
Figure 2. Orthogonal transform approximations are marked with circles, and nonorthogonal approximations with cross signs. The dashed curves represent the Pareto frontier [
57] for each selected pair of the measures. Transformations located on the Pareto frontier are considered optimal, where the points are dominated by the frontier correspond to nonoptimal transformations. The bivariate plots in
Figure 2a,b reveal that the obtained Loeffler-based DCT approximations are often situated at the optimality site prescribed by the Pareto frontier. The Loeffler approximations perform particularly well in terms of total error energy and the MSE, which capture the matrix proximity to the exact DCT matrix in a Euclidean sense. Such approximations are particularly suitable for problems that require computational proximity to the the exact transformation as in the case of detection and estimation problems [
59,
60]. Regarding coding performance,
Figure 2c,d shows that transformations
,
,
, and
are situated on the Pareto frontier, being optimal in this sense. These approximations are adequate for data compression and decorrelation [
2].
6. FPGA Implementation
To compare the hardware-resource consumption of the discussed approximations, they were initially modeled and tested in Matlab Simulink and then physically realized on FPGA. The FPGA used was a Xilinx Virtex-6 XC6VLX240T installed on a Xilinx ML605 prototyping board. FPGA realization was tested with 100,000 random 8-point input test vectors using hardware cosimulation. Test vectors were generated from within the Matlab environment and routed to the physical FPGA device using a JTAG-based hardware cosimulation. Then, measured data from the FPGA was routed back to Matlab memory space.
We separated the
approximation for comparison with efficient approximations because its performance metrics lay on the Pareto frontier of the plots in
Figure 2. The associated FPGA implementations were evaluated for hardware complexity and real-time performance using metrics such as configurable logic blocks (CLB) and flip-flop (FF) count, critical path delay (
) in ns, and maximum operating frequency (
) in MHz. Values were obtained from the Xilinx FPGA synthesis and place-route tools by accessing the
xflow.results report file. In addition, static (
) and dynamic power (
in
) consumption were estimated using the Xilinx XPower Analyzer. We also reported area-time complexity (
) and area-time-squared complexity (
). Circuit area (
A) was estimated using the CLB count as a metric, and time was derived from
.
Table 7 lists the FPGA hardware resource and power consumption for each algorithm.
Considering the circuit complexity of the discussed approximations and
[
5], as measured from the CBL count for the FPGA synthesis report, it can be seen from
Table 7 that
is the smallest option in terms of circuit area. When considering maximum speed, matrix
showed the best performance on the Vertex-6 XC6VLX240T device. Alternatively, if we consider the normalized dynamic power consumption, the best performance was again measured from
.
7. Conclusions and Final Remarks
In this paper, we introduced a mathematical framework for the design of 8-point DCT approximations. The Loeffler algorithm was parameterized and a class of matrices was derived. Matrices with good properties, such as low-complexity, invertibility, orthogonality or near-orthogonality, and closeness to the exact DCT, were separated according to a multicriteria optimization problem aiming at Pareto efficiency. The DCT approximations in this class were assessed, and the optimal transformations were separated.
The obtained transforms were assessed in terms of computational complexity, proximity, and coding measures. The derived efficient solutions constitute DCT approximations capable of good properties when compared to existing DCT approximations. At the same time, approximation requires extremely low computation costs: only additions and bit-shifting operations are required for their evaluation.
We demonstrated that the proposed method is a unifying structure for several approximations scattered in the literature, including the well-known approximation by Lengwehasatit–Ortega [
31], and they share the same matrix factorization scheme. Additionally, because all discussed approximations have a common matrix expansion, the SFG of their associated fast algorithms are identical except for the parameter values. Thus, the resulting structure paves the way to performance-selectable transforms according to the choice of parameters.
Moreover, approximations were assessed and compared in terms of image and video coding. For images, a JPEG-like encoding simulation was considered, and for videos, approximations were embedded in the H.264/AVC and H.265/HEVC video-coding standards. Approximations exhibited good coding performance compared to exact DCT-based JPEG compression and the obtained frame video quality was very close to the results shown by the H.264/AVC and H.265/HEVC standards.
Extensions for larger blocklengths can be achieved by considering scalable approaches such as the one suggested in Reference [
71]. Alternatively, one could employ direct parameterization of the 16-point Loeffler DCT algorithm described in Reference [
27].