1. Introduction
The discrete cosine transform (DCT) has long been widely used to address a variety of data processing tasks [
1,
2,
3,
4,
5]. It has been applied, for example, to the computation of three-dimensional DCTs for high-definition television encoding [
6] and to multicarrier modulators as an alternative to DFT-based systems for orthogonal frequency-division multiplexing and discrete multitone modulation [
7]. Moreover, the DCT is used in the encoding and decoding of transformer neural networks [
8], instance segmentation [
9], image compression [
10,
11], and video coding [
12,
13,
14]. It is also employed in real-time applications such as audio denoising [
15,
16], data filtering [
17], and feature extraction [
18].
The widespread use of the DCT stems from the theoretical result that, in terms of energy compaction, the conventional DCT closely approximates the optimal, signal-dependent Karhunen-Loève transform. This approximation holds when the underlying signal can be modeled as a first-order stationary Markov process. However, natural signals and images often exhibit complex structures and dynamics that do not strictly satisfy this first-order Markov assumption [
19].
For this reason, the literature distinguishes eight types of the DCT, namely DCT types I-VIII [
1,
20,
21]. In this classification, the conventional DCT corresponds to the type-II DCT (DCT-II). Recently, a set of DCTs and discrete sine transforms (DSTs) has been incorporated into the Versatile Video Coding (VVC) standard [
3,
5,
13,
22,
23]. In particular, the Adaptive Multiple Transform (AMT) scheme was introduced to encode residual signals in inter-coded blocks. Depending on the coding mode, the encoder selects, for each block, a transform pair from a predefined pool of DCT/DST candidates [
4,
23]. The newly introduced transforms include the DCT-VIII and the type-VII DST (DST-VII) for multiple block sizes. Extended AMT proposals also considered the type-V DCT and type-I DST; however, these were not adopted due to negligible rate-distortion gains and higher computational complexity compared with DCT-II, DST-VII, and DCT-VIII. Consequently, for inter-predicted blocks, a fixed transform set consisting of DCT-II, DST-VII, and DCT-VIII is used. Since AMT evaluates two candidates along both horizontal and vertical directions, it supports 15 valid transform pairs per block, which significantly increases encoder complexity [
4,
23].
Consequently, numerous studies have focused on the development of fast algorithms for computing DCTs and DSTs. Such algorithms reduce the number of multiplications and additions compared with direct two-dimensional matrix multiplication, which requires
arithmetic operations. However, most existing papers concentrate on DCT-II implementations [
20,
24,
25], while relatively few studies investigate the relationships among DST-VII, DCT-VIII, and other trigonometric transforms or propose low-complexity algorithms for these transforms [
4,
26,
27,
28].
This article focuses on the development of fast algorithms for the DCT-VIII. As discussed in the literature [
3,
5,
13,
22,
23], the DCT-VIII has a wide range of applications in image and video coding; however, its computational cost remains a significant challenge. The following section briefly reviews existing low-complexity algorithms for the DCT-VIII.
1.1. State of the Art of the Problem
Existing fast DCT-VIII algorithms have been developed either in the spectral domain, based on fast DFT/DCT/DST algorithms [
4,
26,
27,
28], or by exploiting repeated entries and structural properties of DCT-VIII matrices [
12,
29,
30,
31].
The first strategy was adopted in [
4], where the DCT-VIII was decomposed into a preprocessing matrix, a DST-VII core, and a post-processing matrix by exploiting the linear relationship between the DCT-VIII and DST-VII. Fast computation was further achieved by leveraging the relationship between the DST-VII and the DFT. To efficiently compute DFTs for multiple DCT-VIII block sizes, prime-factor and Winograd algorithms were employed for
and 32. The resulting integer DCT-VIII kernels were approximated using norm scaling and bit shifts to ensure compatibility with quantization at each stage of video coding. As a result, the proposed integer DCT-VIII algorithms significantly reduced overall computational complexity, with only minor losses in Bjøntegaard Delta Bitrate (BD-rate). In particular, the numbers of additions and multiplications were reduced by 38% and 80.3%, respectively.
Based on the relationship between the DCT-VIII and DST-VII [
1,
26], fast DST-VII algorithms derived from DFT methods [
28] can be applied to DCT-VIII computation. For instance, in [
27], the DCT-VIII was expressed in terms of DST-VII using simple operations such as permutation and sign inversion. This shows that existing DST-VII algorithms can be reused for the DCT-VIII, eliminating the need to develop separate fast algorithms for each transform. Accordingly, [
27] employed Winograd factorization of the DFT matrix to factorize the DCT-VIII matrices for
and
.
The main advantage of the first strategy for fast DCT-VIII algorithms is increased codec flexibility, which is crucial for advanced video coding schemes such as AMT. The relationships between transforms enable the implementation of multiple odd-type transforms simultaneously, simplifying codec design.
However, the first strategy supports the development of fast DCT-VIII algorithms only for limited transform sizes, such as and 32. While reorderings and sign inversions are computationally inexpensive, the underlying base transform (DFT or DST-VII) still needs to be implemented efficiently. For integer transforms, additional issues may arise related to approximation accuracy and quantization.
The second strategy exploits repeated elements in the rows or columns of the transform matrix [
12,
29]. These repetitions allow multiple multiplications by coefficients of the same absolute value within a basis row to be combined into a single multiplication. For instance, [
12,
29] identified three structural patterns within the DST-VII matrix columns suitable for fast algorithm design. The patterns are non-overlapping, meaning that each column follows exactly one pattern:
A pattern consisting of several groups with a fixed number of elements, regardless of sign changes, where an identity exists among the sums within each group (e.g., the sum of two elements equals another element).
A pattern in which a subset of column elements is repeated across other columns through permutation.
A pattern in which a column contains only a single unique value, ignoring sign changes and zeros.
It has been shown that the proposed algorithms for N = 16, 32, and 64 in the state-of-the-art codec can provide average overall decoding time savings of 7% and 5% under All Intra and Random Access configurations, respectively.
In [
30], the linearity of DCT-II, DST-VII, and DCT-VIII was exploited to reduce decoder complexity. Leveraging this property, the inverse transform can be accelerated by dividing a block into subblocks containing a single non-zero coefficient and summing their individual inverse transforms, rather than performing a full inverse transform on the entire block. The authors compared this linearity-based approach with the standard fast inverse transforms used in VVC (VTM-8.2) to determine when to switch methods. A precomputed threshold was applied: if the number of non-zero coefficients was below the threshold, the linearity-based method was used; otherwise, the conventional inverse transform was applied. Decoding time savings reached 4% in All-Intra and approximately 10% in Random Access compared to the standard VVC inverse transform.
Thus, the arithmetic complexity of fast DCT-VIII algorithms based on the second strategy largely depends on the structure of the transform matrix. In particular, it is affected by the repetition and placement of the matrix entries. The relative performance gains vary with matrix size. For very large transforms, the advantage becomes less pronounced compared to direct matrix-vector multiplication.
1.2. The Main Contributions of the Paper
To overcome the limitations of existing strategies for designing fast DCT-VIII algorithms, we propose using the structural approach [
32,
33] to reduce the computational complexity of this transform. Over time, we have refined this approach and successfully applied it to develop efficient algorithms for discrete trigonometric transforms [
34]. Unlike other methods, the structural approach does not rely on specific analytical properties of the transform. Instead, it depends solely on the structure of the transform matrix, particularly the repetition and arrangement of its elements. This makes the structural approach applicable to a wide variety of transform matrices.
According to the structural approach, the transform matrix is first preprocessed by permuting selected rows and columns and changing the signs of certain entries. Submatrices of the resulting matrix are then matched to templates defined in [
32,
33], which represent matrix patterns with known factorizations, such as butterfly-type or circular-convolution-type patterns. The factorizations of these submatrices are subsequently combined to produce a factorization of the original transform matrix. This factorization is then used to construct the corresponding fast transform algorithm, which is represented by a data flow graph.
Unlike the first strategy, which relies on relationships between DCT-VIII and other trigonometric transforms, the structural approach reduces the number of arithmetic operations required for the direct matrix-vector product. Moreover, it can be applied to matrices of any size, not just those that are powers of two.
Compared to the second strategy, the structural approach generalizes and formalizes the ideas presented in [
12,
29]. The first pattern corresponds to a butterfly-type module, the second pattern represents a circular convolution, and the third pattern corresponds to a fan-shaped fragment of a data flow graph representing the sum of several identical entries. By using the structural approach, it is possible to formalize the rules and patterns from [
12,
29,
30,
31] and to develop fast DCT-VIII algorithms for arbitrary input lengths
N.
The novelty of this research lies in its departure from conventional approaches to DCT-VIII algorithm design. Existing methods have predominantly relied on spectral and algebraic properties of the transform matrix, such as representations based on the discrete Fourier transform or explicit algebraic relationships among matrix elements. Consequently, prior algorithm designs were driven by analytical expressions describing the spectral characteristics of the transform matrix.
In contrast, this paper proposes, for the first time, the construction of fast DCT-VIII algorithms based on structural pattern recognition. Specifically, structural patterns of submatrices within the DCT-VIII transform matrix are identified and exploited, following the framework introduced in the foundational studies [
32,
33]. This approach emphasizes the direct recognition and factorization of recurring structural configurations, rather than relying on spectral or algebraic interpretations of the transform.
The structured approach applied in this paper for the construction of fast DCT-VIII algorithms offers the following practical benefits:
The primary contributions of this research are as follows:
For selected matrix sizes, we have developed factorizations of DCT-VIII matrices into sparse and diagonal matrices for input lengths from 3 to 7. These factorizations were obtained using two techniques: comparison with patterns from [
32] and identification of circular-convolution structures [
33].
The correctness of these factorizations has been verified both mathematically and through implementation in MATLAB R2023b.
Based on the DCT-VIII matrix factorizations, we have designed efficient algorithms for this transform using data flow graphs. A unique feature of these graphs is that each path from an input vertex to an output vertex contains only one multiplication, which reduces both processing time and resource usage.
The remainder of the paper is organized as follows.
Section 2 presents the mathematical background of the DCT-VIII.
Section 3 derives factorizations of DCT-VIII matrices and introduces fast DCT-VIII algorithms using data flow graphs.
Section 4 provides a comparative analysis of the arithmetic operations required by the proposed algorithms versus direct matrix-vector multiplication.
Section 5 compares the structure of the proposed algorithms with existing methods developed using the same strategy. Finally,
Section 6 concludes the paper with a comprehensive summary of the research.
2. Preliminary Background
2.1. Obtaining a Matrix with the DCT-VIII Coefficients
The DCT-VIII that we used in our solutions can be calculated as follows [
1]:
where
In matrix notation, the DCT-VIII can be represented as follows:
where
The DCT-VIII in matrix notation is as follows:
The notations employed in this work are summarized in
Table 1.
2.2. The Method of Creating Data Flow Graphs
In this work, we use a method for drawing data flow graphs. In this subsection, we explain how these graphs are constructed and how to interpret them for clarity.
Values of an input vertex are placed on the left side of the graph, while output values appear on the right. Solid lines indicate data flow, and dashed lines indicate data flow with a sign change. Rectangular blocks labeled represent Hadamard matrices, and circles represent multipliers.
When drawing a graph from a specific expression, the process is reversed. The last matrix in the expression is drawn first from the left, while the first matrix is drawn first from the right.
2.3. Common Matrices for Different Solutions
For clarity, we present below the matrices that may be involved in several of the solutions:
2.4. The Steps for Constructing Fast DCT-VIII Algorithms
Let us outline the procedure for constructing fast DCT-VIII algorithms based on the structural approach [
32,
33].
Step 1. Permutation of rows and/or columns of the initial transform matrix. At this stage, the appropriate permutation matrices are constructed.
Step 2. Extraction of identical matrix entries. The permuted matrix is represented as the sum of two matrices. The first matrix contains entries that are identical up to a sign change and is excluded from further processing to reduce the number of arithmetic operations. The second matrix contains the remaining entries of the permuted matrix.
Step 3. Identification and factorization of structural matrix patterns. Submatrices matching predefined structural patterns are extracted from the second matrix obtained in Step 2. Each extracted submatrix is factorized according to the corresponding pattern factorizations described in [
32].
Step 4. Factorization of the initial transform matrix. If necessary, Steps 1–3 are recursively applied to the extracted submatrices. Using the resulting submatrix factorizations, the factorization of the original DCT-VIII matrix is constructed, including the permutation matrices obtained in Step 1.
Step 5. Enhancement of the resulting factorization. The final factorization of the transform matrix is refined to further reduce the number of addition operations.
4. Results
The obtained results were verified in two stages. First, the correctness of the DCT-VIII matrix factorizations presented in
Section 3 was experimentally validated. Using MATLAB R2023b, the original DCT-VIII matrices were computed from expressions (
3), (
5), (
7), (
9), and (
11). These matrices were then compared with the products of the corresponding factorized matrices derived from expressions (
4), (
6), (
8), (
10), and (
12). For size
N ranging from 3 to 7, the factorized matrices exactly matched the original matrices, thereby confirming the correctness of the proposed algorithms.
Second, the correctness of the software implementations of the proposed algorithms was verified, which in turn confirms the correctness of the constructed data flow graphs. The algorithms represented by the data flow graphs were implemented in software, and the corresponding implementations are presented as pseudocode in
Appendix A.
For each proposed N-point DCT-VIII algorithm, 30 sequences of N random numbers were generated. Each sequence was applied as input to the corresponding algorithm, and the resulting output was compared with the result obtained from the direct matrix-vector multiplication for the same input. If the results coincided for all 30 test sequences, the algorithm was deemed correct. In this manner, the correctness of the pseudocode and the corresponding data flow graphs was confirmed for the developed DCT-VIII algorithms with N ranging from 3 to 7.
Next, the arithmetic complexity of the proposed DCT-VIII algorithms was evaluated and compared with that of direct matrix-vector multiplication. The number of multiplications was determined by counting the vertices labeled with factors (circles) in the data flow graphs, while additions were estimated by counting vertices where two edges converge. Zero entries were not considered. Overall, the proposed DCT-VIII algorithms achieve a reduction of approximately 53% in the number of multiplications and an increase of about 21% in the number of additions for input sizes from 3 to 7.
Table 3 summarizes the number of arithmetic operations required by the proposed DCT-VIII algorithms. Percentage differences relative to the direct matrix-vector product are indicated in parentheses. A plus sign denotes an increase in the number of operations, whereas a minus sign indicates a reduction.
In
Table 4, we present the number of arithmetic operations for DCT-VIII algorithms based on the fast Fourier transform (FFT). Detailed explanations of how the number of operations for these algorithms is calculated are provided in
Appendix B.
Our analysis shows that, for
N = 4, the proposed DCT-VIII algorithm using the structural approach requires the same number of multiplications and additions compared to the DCT-VIII algorithm from [
27], which is based on the relationship between DST-VII and DCT-VIII. Moreover, the proposed solutions require a significantly lower number of the arithmetic operations than the DCT-VIII algorithms based on the FFT [
4]. To the best of our knowledge, no other algorithms are available in the literature for direct comparison. We also consider it inappropriate to compare the proposed fast DCT-VIII algorithm with related integer-transform algorithms due to their differing mathematical foundations.
5. Discussion
Let us analyze the obtained algorithms. In the literature, DCT-VIII algorithms are typically presented only for input sequence lengths that are powers of two. This restriction arises because fast algorithms for this transform were developed primarily for video coding. However, other applications of DCT-VIII are also known. For example, in [
17], it was used for fractional-pixel motion compensation in high-efficiency video coding, which requires fast algorithms for input sequences whose lengths are not powers of two.
We now examine the stages of the DCT-VIII matrix factorization based on the structural approach, which was used to construct the proposed algorithms (
Table 5).
For
, the DCT-VIII matrix is fully characterized by the circular-convolution structure. Applying a factorization of this structure enabled the factorization of the 3-point DCT-VIII matrix. Using the terminology of [
12,
29], only the second pattern listed in
Section 1.1 was applied for
. Specifically, this pattern involves a subset of column elements that is repeated in other columns with permutation.
For
, two patterns were identified in the transform matrix. The first pattern is a circular convolution applied to a
submatrix. In the second pattern, the column and row of the original matrix contain the same values up to a sign. In the data flow graph, this pattern corresponds to a fan-shaped structure, where several edges leave a single node toward different outputs or enter a single node from multiple inputs. This structure matches the third pattern from
Section 1.1 and allows a reduction in the number of multiplications by performing preliminary additions of the inputs, followed by a single multiplication.
For
, a structure consisting of four
submatrices was extracted from the DCT-VIII matrix, and each submatrix was subsequently factorized. This corresponds to the first pattern described in [
12,
29], specifically a pattern consisting of several groups with a fixed number of elements, regardless of sign changes. For the entries of the DCT-VIII matrix not included in this structure, a direct multiplication of the input vector by the corresponding transform matrix entries was used. Thus, in this case, the structural approach was applied jointly with direct matrix-vector multiplication.
For
, a two-level hierarchical structure was identified in the DCT-VIII matrix. The upper level consists of a decomposition into
submatrices, while the lower level represents the structure of these submatrices in the form of circular convolutions. In this case, patterns 1 and 2 from
Section 1.1 and from [
12,
29] are successively identified.
For
, two patterns were identified in the matrix. Unlike the case of
, the extracted structure is non-hierarchical. A
matrix was extracted from the original transform matrix and represented by four
submatrices. This corresponds to pattern 1 according to the terminology of [
12,
29]. The remaining entries of the original DCT-VIII matrix for
correspond to pattern 3 from
Section 1.1, as they include several groups with identical elements, regardless of sign changes.
Identifying such structures enabled a significant reduction in the number of multiplications required to implement the DCT-VIII for small N compared to direct matrix-vector multiplication. However, the number of additions increases due to the preliminary summation of matrix entries that are multiplied by the same factor.
Another limitation of the structural approach is that it is better suited for constructing fast algorithms for short data sequences. As the sequence length increases, identifying the structure of the transform matrices becomes more difficult. Moreover, as discussed earlier, the efficiency of the proposed fast algorithms strongly depends on the structural properties of these matrices.
6. Conclusions
The fast DCT-VIII algorithms presented in this paper were constructed using the structural approach described in [
32,
33]. This approach derives fast algorithms from the structural properties of transform matrices for various input lengths. In this context, the structure of a transform matrix refers to properties such as symmetries in certain submatrices, recursion of matrix patterns, factorization of matrix patterns, and algebraic relationships between matrix entries.
The presented fast DCT-VIII algorithms were developed for short input sequences, specifically for lengths from 3 to 7. The derivation of the fast DCT-VIII algorithm for
has already been addressed in several studies [
27,
31]. The proposed algorithms reduce computational complexity by significantly decreasing the number of multiplications compared to direct matrix-vector computation. On average, they achieve a reduction of approximately 53% in the number of multiplications and an increase of about 21% in the number of additions for input sizes from 3 to 7.
The algorithms are represented using data flow graphs. A key advantage of the proposed designs is that the input-output path in each graph contains only a single multiplication. It is well known that when more than one multiplication lies on the input-output path, the operand format doubles with each multiplication, introducing additional data-processing challenges. The proposed algorithms completely avoid this issue.
Implementing the constructed fast DCT-VIII algorithms ensures numerical stability, as the structural approach generally preserves the orthogonality of the transform and maintains good conditioning.
Additionally, the fast DCT-VIII algorithms for short input sequences can be reused as building blocks for other DCT and DST types due to cross-relations between transforms. For example, a DST-VII can be converted into a DCT-VIII through permutation and input reversal.
Although practical codecs commonly employ large transform sizes (e.g., 16, 32, and 64) [
35,
36], algorithms for such large transforms are typically constructed from smaller kernels using well-established nesting techniques [
37,
38]. Consequently, optimizing the computation of small-scale transforms remains critically important. Moreover, for this class of transforms, algorithmic computational complexity increases rapidly with transform size. To mitigate this issue, one can synthesize new large-scale orthogonal transforms from small-scale discrete transforms used as kernels. As a result, the local characteristics of the original transforms are preserved while achieving lower computational complexity than conventional large-transform implementations.