Next Article in Journal
Lightweight Kalman Spoofing Detection in Platoons of Vehicles
Previous Article in Journal
BiLSTM-FuseNet: A Deep Fusion Model for Denoising High-Noise Near-Infrared Spectra
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast DCT-VIII Algorithms for Short-Length Input Sequences

by
Mateusz Raciborski
1,*,
Marina Polyakova
2 and
Aleksandr Cariow
3
1
Faculty of Computer Science and Telecommunications, Maritime University of Szczecin, Waly Chrobrego 1-2, 70-500 Szczecin, Poland
2
Institute of Computer Systems, Odesa Polytechnic National University, Shevchenko Ave., 1, 65044 Odesa, Ukraine
3
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Zolnierska 49, 71-210 Szczecin, Poland
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 207; https://doi.org/10.3390/electronics15010207 (registering DOI)
Submission received: 30 November 2025 / Revised: 26 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026

Abstract

Discrete cosine transforms (DCTs) are widely used in intelligent electronic systems for storing, processing, and transmitting data. Their popularity stems, on the one hand, from their unique properties and, on the other hand, from the availability of fast algorithms that minimize the computational and hardware complexity of their implementation. Until recently, the Type VIII DCT had been one of the least studied variants, with virtually no publications addressing fast algorithms for its implementation. However, this situation has changed, making the development of efficient implementation methods for this transform a timely and important research problem. In this paper, several algorithmic solutions for implementing the Type VIII DCT are proposed. A set of Type VIII DCT algorithms for small lengths N = 3, 4, 5, 6, 7 is presented. The effectiveness of the proposed solutions is due to the possibility of successful factorization of small-sized DCT-VIII matrices, leading to a reduction in the computational complexity of implementing transforms of this type. Compared with direct matrix–vector computation, the proposed algorithms achieve an approximate 53% reduction in the number of multiplications, at the cost of an increase of about 21% in the number of additions. This work continues a series of previously published studies aimed at creating a library of small-sized algorithms for discrete trigonometric transforms.

1. Introduction

The discrete cosine transform (DCT) has long been widely used to address a variety of data processing tasks [1,2,3,4,5]. It has been applied, for example, to the computation of three-dimensional DCTs for high-definition television encoding [6] and to multicarrier modulators as an alternative to DFT-based systems for orthogonal frequency-division multiplexing and discrete multitone modulation [7]. Moreover, the DCT is used in the encoding and decoding of transformer neural networks [8], instance segmentation [9], image compression [10,11], and video coding [12,13,14]. It is also employed in real-time applications such as audio denoising [15,16], data filtering [17], and feature extraction [18].
The widespread use of the DCT stems from the theoretical result that, in terms of energy compaction, the conventional DCT closely approximates the optimal, signal-dependent Karhunen-Loève transform. This approximation holds when the underlying signal can be modeled as a first-order stationary Markov process. However, natural signals and images often exhibit complex structures and dynamics that do not strictly satisfy this first-order Markov assumption [19].
For this reason, the literature distinguishes eight types of the DCT, namely DCT types I-VIII [1,20,21]. In this classification, the conventional DCT corresponds to the type-II DCT (DCT-II). Recently, a set of DCTs and discrete sine transforms (DSTs) has been incorporated into the Versatile Video Coding (VVC) standard [3,5,13,22,23]. In particular, the Adaptive Multiple Transform (AMT) scheme was introduced to encode residual signals in inter-coded blocks. Depending on the coding mode, the encoder selects, for each block, a transform pair from a predefined pool of DCT/DST candidates [4,23]. The newly introduced transforms include the DCT-VIII and the type-VII DST (DST-VII) for multiple block sizes. Extended AMT proposals also considered the type-V DCT and type-I DST; however, these were not adopted due to negligible rate-distortion gains and higher computational complexity compared with DCT-II, DST-VII, and DCT-VIII. Consequently, for inter-predicted blocks, a fixed transform set consisting of DCT-II, DST-VII, and DCT-VIII is used. Since AMT evaluates two candidates along both horizontal and vertical directions, it supports 15 valid transform pairs per block, which significantly increases encoder complexity [4,23].
Consequently, numerous studies have focused on the development of fast algorithms for computing DCTs and DSTs. Such algorithms reduce the number of multiplications and additions compared with direct two-dimensional matrix multiplication, which requires O ( N 2 ) arithmetic operations. However, most existing papers concentrate on DCT-II implementations [20,24,25], while relatively few studies investigate the relationships among DST-VII, DCT-VIII, and other trigonometric transforms or propose low-complexity algorithms for these transforms [4,26,27,28].
This article focuses on the development of fast algorithms for the DCT-VIII. As discussed in the literature [3,5,13,22,23], the DCT-VIII has a wide range of applications in image and video coding; however, its computational cost remains a significant challenge. The following section briefly reviews existing low-complexity algorithms for the DCT-VIII.

1.1. State of the Art of the Problem

Existing fast DCT-VIII algorithms have been developed either in the spectral domain, based on fast DFT/DCT/DST algorithms [4,26,27,28], or by exploiting repeated entries and structural properties of DCT-VIII matrices [12,29,30,31].
The first strategy was adopted in [4], where the DCT-VIII was decomposed into a preprocessing matrix, a DST-VII core, and a post-processing matrix by exploiting the linear relationship between the DCT-VIII and DST-VII. Fast computation was further achieved by leveraging the relationship between the DST-VII and the DFT. To efficiently compute DFTs for multiple DCT-VIII block sizes, prime-factor and Winograd algorithms were employed for N = 4 , 8 , 16 , and 32. The resulting integer DCT-VIII kernels were approximated using norm scaling and bit shifts to ensure compatibility with quantization at each stage of video coding. As a result, the proposed integer DCT-VIII algorithms significantly reduced overall computational complexity, with only minor losses in Bjøntegaard Delta Bitrate (BD-rate). In particular, the numbers of additions and multiplications were reduced by 38% and 80.3%, respectively.
Based on the relationship between the DCT-VIII and DST-VII [1,26], fast DST-VII algorithms derived from DFT methods [28] can be applied to DCT-VIII computation. For instance, in [27], the DCT-VIII was expressed in terms of DST-VII using simple operations such as permutation and sign inversion. This shows that existing DST-VII algorithms can be reused for the DCT-VIII, eliminating the need to develop separate fast algorithms for each transform. Accordingly, [27] employed Winograd factorization of the DFT matrix to factorize the DCT-VIII matrices for N = 4 and N = 8 .
The main advantage of the first strategy for fast DCT-VIII algorithms is increased codec flexibility, which is crucial for advanced video coding schemes such as AMT. The relationships between transforms enable the implementation of multiple odd-type transforms simultaneously, simplifying codec design.
However, the first strategy supports the development of fast DCT-VIII algorithms only for limited transform sizes, such as N = 4 , 8 , 16 , and 32. While reorderings and sign inversions are computationally inexpensive, the underlying base transform (DFT or DST-VII) still needs to be implemented efficiently. For integer transforms, additional issues may arise related to approximation accuracy and quantization.
The second strategy exploits repeated elements in the rows or columns of the transform matrix [12,29]. These repetitions allow multiple multiplications by coefficients of the same absolute value within a basis row to be combined into a single multiplication. For instance, [12,29] identified three structural patterns within the DST-VII matrix columns suitable for fast algorithm design. The patterns are non-overlapping, meaning that each column follows exactly one pattern:
  • A pattern consisting of several groups with a fixed number of elements, regardless of sign changes, where an identity exists among the sums within each group (e.g., the sum of two elements equals another element).
  • A pattern in which a subset of column elements is repeated across other columns through permutation.
  • A pattern in which a column contains only a single unique value, ignoring sign changes and zeros.
It has been shown that the proposed algorithms for N = 16, 32, and 64 in the state-of-the-art codec can provide average overall decoding time savings of 7% and 5% under All Intra and Random Access configurations, respectively.
In [30], the linearity of DCT-II, DST-VII, and DCT-VIII was exploited to reduce decoder complexity. Leveraging this property, the inverse transform can be accelerated by dividing a block into subblocks containing a single non-zero coefficient and summing their individual inverse transforms, rather than performing a full inverse transform on the entire block. The authors compared this linearity-based approach with the standard fast inverse transforms used in VVC (VTM-8.2) to determine when to switch methods. A precomputed threshold was applied: if the number of non-zero coefficients was below the threshold, the linearity-based method was used; otherwise, the conventional inverse transform was applied. Decoding time savings reached 4% in All-Intra and approximately 10% in Random Access compared to the standard VVC inverse transform.
Thus, the arithmetic complexity of fast DCT-VIII algorithms based on the second strategy largely depends on the structure of the transform matrix. In particular, it is affected by the repetition and placement of the matrix entries. The relative performance gains vary with matrix size. For very large transforms, the advantage becomes less pronounced compared to direct matrix-vector multiplication.

1.2. The Main Contributions of the Paper

To overcome the limitations of existing strategies for designing fast DCT-VIII algorithms, we propose using the structural approach [32,33] to reduce the computational complexity of this transform. Over time, we have refined this approach and successfully applied it to develop efficient algorithms for discrete trigonometric transforms [34]. Unlike other methods, the structural approach does not rely on specific analytical properties of the transform. Instead, it depends solely on the structure of the transform matrix, particularly the repetition and arrangement of its elements. This makes the structural approach applicable to a wide variety of transform matrices.
According to the structural approach, the transform matrix is first preprocessed by permuting selected rows and columns and changing the signs of certain entries. Submatrices of the resulting matrix are then matched to templates defined in [32,33], which represent matrix patterns with known factorizations, such as butterfly-type or circular-convolution-type patterns. The factorizations of these submatrices are subsequently combined to produce a factorization of the original transform matrix. This factorization is then used to construct the corresponding fast transform algorithm, which is represented by a data flow graph.
Unlike the first strategy, which relies on relationships between DCT-VIII and other trigonometric transforms, the structural approach reduces the number of arithmetic operations required for the direct matrix-vector product. Moreover, it can be applied to matrices of any size, not just those that are powers of two.
Compared to the second strategy, the structural approach generalizes and formalizes the ideas presented in [12,29]. The first pattern corresponds to a butterfly-type module, the second pattern represents a circular convolution, and the third pattern corresponds to a fan-shaped fragment of a data flow graph representing the sum of several identical entries. By using the structural approach, it is possible to formalize the rules and patterns from [12,29,30,31] and to develop fast DCT-VIII algorithms for arbitrary input lengths N.
The novelty of this research lies in its departure from conventional approaches to DCT-VIII algorithm design. Existing methods have predominantly relied on spectral and algebraic properties of the transform matrix, such as representations based on the discrete Fourier transform or explicit algebraic relationships among matrix elements. Consequently, prior algorithm designs were driven by analytical expressions describing the spectral characteristics of the transform matrix.
In contrast, this paper proposes, for the first time, the construction of fast DCT-VIII algorithms based on structural pattern recognition. Specifically, structural patterns of submatrices within the DCT-VIII transform matrix are identified and exploited, following the framework introduced in the foundational studies [32,33]. This approach emphasizes the direct recognition and factorization of recurring structural configurations, rather than relying on spectral or algebraic interpretations of the transform.
The structured approach applied in this paper for the construction of fast DCT-VIII algorithms offers the following practical benefits:
  • Using the matrix factorizations and data flow graphs for submatrix structures introduced in [32,33], it is easy to construct factorizations and data flow graphs for fast DCT-VIII algorithms with different values of N.
  • The structural patterns defined in [32,33] enable the construction of fast DCT-VIII algorithms applicable to arbitrary values of N, including small transform sizes, without restricting the transform matrix to lengths that are powers of two.
  • Subgraphs corresponding to the data flow graphs of individual DCT-VIII submatrices can be reused as building blocks when constructing the data flow graph of an N-point fast DCT-VIII algorithm.
The primary contributions of this research are as follows:
  • For selected matrix sizes, we have developed factorizations of DCT-VIII matrices into sparse and diagonal matrices for input lengths from 3 to 7. These factorizations were obtained using two techniques: comparison with patterns from [32] and identification of circular-convolution structures [33].
  • The correctness of these factorizations has been verified both mathematically and through implementation in MATLAB R2023b.
  • Based on the DCT-VIII matrix factorizations, we have designed efficient algorithms for this transform using data flow graphs. A unique feature of these graphs is that each path from an input vertex to an output vertex contains only one multiplication, which reduces both processing time and resource usage.
The remainder of the paper is organized as follows. Section 2 presents the mathematical background of the DCT-VIII. Section 3 derives factorizations of DCT-VIII matrices and introduces fast DCT-VIII algorithms using data flow graphs. Section 4 provides a comparative analysis of the arithmetic operations required by the proposed algorithms versus direct matrix-vector multiplication. Section 5 compares the structure of the proposed algorithms with existing methods developed using the same strategy. Finally, Section 6 concludes the paper with a comprehensive summary of the research.

2. Preliminary Background

2.1. Obtaining a Matrix with the DCT-VIII Coefficients

The DCT-VIII that we used in our solutions can be calculated as follows [1]:
y k = 2 2 N + 1 n = 0 N 1 x n cos k + 0.5 n + 0.5 2 π 2 N + 1 ,
where
  • k = 0 , 1 , , N 1 ,
  • y k is the DCT-VIII transform coefficients,
  • x n is input data, and
  • N is a number of signal samples.
In matrix notation, the DCT-VIII can be represented as follows:
Y N × 1 = C N X N × 1 ,
where
Y N × 1 = y 0 , y 1 , , y N 1 T , X N × 1 = x 0 , x 1 , , x N 1 T ,
c k , n = 2 2 N + 1 cos k + 0.5 n + 0.5 2 π 2 N + 1 , where k , n = 0 , , N 1 .
The DCT-VIII in matrix notation is as follows:
y 0 y 1 y N 1 = c 0 , 0 c 0 , 1 c 0 , N 1 c 1 , 0 c 1 , 1 c 1 , N 1 c N 1 , 0 c N 1 , 1 c N 1 , N 1 x 0 x 1 x N 1 .
The notations employed in this work are summarized in Table 1.

2.2. The Method of Creating Data Flow Graphs

In this work, we use a method for drawing data flow graphs. In this subsection, we explain how these graphs are constructed and how to interpret them for clarity.
Values of an input vertex x n are placed on the left side of the graph, while output values y n appear on the right. Solid lines indicate data flow, and dashed lines indicate data flow with a sign change. Rectangular blocks labeled H 2 represent Hadamard matrices, and circles represent multipliers.
When drawing a graph from a specific expression, the process is reversed. The last matrix in the expression is drawn first from the left, while the first matrix is drawn first from the right.

2.3. Common Matrices for Different Solutions

For clarity, we present below the matrices that may be involved in several of the solutions:
T 2 × 3 ( 3 ) = 0 1 1 1 0 1 , T 2 × 3 ( 4 ) = 1 0 1 0 1 1 , T 3 × 2 ( 3 ) = 1 0 0 1 1 1 ,
P 3 ( π 3 ( 0 ) ) = 1 1 1 , H 2 = 1 1 1 1 ,
W 3 ( 0 ) = 1 1 1 1 1 1 1 , A 4 × 3 = 1 1 1 1 1 ,
A 3 × 4 = 1 1 1 1 1 , W 3 ( 1 ) = 1 1 1 1 1 1 1 .

2.4. The Steps for Constructing Fast DCT-VIII Algorithms

Let us outline the procedure for constructing fast DCT-VIII algorithms based on the structural approach [32,33].
Step 1. Permutation of rows and/or columns of the initial transform matrix. At this stage, the appropriate permutation matrices are constructed.
Step 2. Extraction of identical matrix entries. The permuted matrix is represented as the sum of two matrices. The first matrix contains entries that are identical up to a sign change and is excluded from further processing to reduce the number of arithmetic operations. The second matrix contains the remaining entries of the permuted matrix.
Step 3. Identification and factorization of structural matrix patterns. Submatrices matching predefined structural patterns are extracted from the second matrix obtained in Step 2. Each extracted submatrix is factorized according to the corresponding pattern factorizations described in [32].
Step 4. Factorization of the initial transform matrix. If necessary, Steps 1–3 are recursively applied to the extracted submatrices. Using the resulting submatrix factorizations, the factorization of the original DCT-VIII matrix is constructed, including the permutation matrices obtained in Step 1.
Step 5. Enhancement of the resulting factorization. The final factorization of the transform matrix is refined to further reduce the number of addition operations.

3. The Fast DCT-VIII Algorithms

3.1. Fast Algorithm for the 3-Point DCT-VIII

Below, we present the matrix-vector form of the calculation of the DCT-VIII for N = 3:
Y 3 × 1 = C 3 X 3 × 1 ,
where
Y 3 × 1 = y 0 , y 1 , y 2 T , X 3 × 1 = x 0 , x 1 , x 2 T ,
C 3 = a 3 b 3 c 3 b 3 c 3 a 3 c 3 a 3 b 3 , a 3 = 0.7370 , b 3 = 0.5910 , c 3 = 0.3280 .
Now, we need to define the π 3 ( 0 ) permutation as below:
π 3 ( 0 ) = 1 2 3 3 2 1 .
We then apply π 3 ( 0 ) to the matrix C 3 in columns. Additionally, we multiply the third row and third column by −1:
C 3 P = P 3 S C 3 P 3 S P 3 ( π 3 ( 0 ) ) , where P 3 S = 1 1 1 .
The matrix C 3 P has the following structure:
C 3 P = c 3 b 3 a 3 a 3 c 3 b 3 b 3 a 3 c 3 .
This structure corresponding to a circular convolution [33]. We use this property and derive the final expression of the DCT-VIII algorithm for N = 3:
Y 3 × 1 = P 3 S W 3 ( 1 ) A 3 × 4 D 4 A 4 × 3 W 3 ( 0 ) P 3 X 3 × 1 ,
where
P 3 = P 3 S P 3 ( π 3 ( 0 ) ) T , D 4 = diag s 0 ( 3 ) , s 1 ( 3 ) , s 2 ( 3 ) , s 3 ( 3 ) , s 0 ( 3 ) = c 3 + a 3 + b 3 3 ,
s 1 ( 3 ) = c 3 b 3 , s 2 ( 3 ) = a 3 b 3 , s 3 ( 3 ) = c 3 + a 3 2 b 3 3 .
Figure 1 presents the data flow graph of the fast DCT-VIII algorithm for N = 3. Direct matrix-vector product requires 9 multiplications and 6 additions. Our solution requires 4 multiplications and 11 additions. The pseudocode for the DCT-VIII for N = 3 is provided in Table A1.
Figure 2 illustrates the data flow graph of the DCT-VIII for an input size of 3, including labels for the individual nodes. Table 2 provides the corresponding calculations, enabling step-by-step verification of the correctness of the derivation.

3.2. Fast Algorithm for the 4-Point DCT-VIII

Below, we present the matrix-vector form of the calculation of the DCT-VIII for N = 4:
Y 4 × 1 = C 4 X 4 × 1 ,
where
Y 4 × 1 = y 0 , y 1 , y 2 , y 3 T , X 4 × 1 = x 0 , x 1 , x 2 , x 3 T ,
C 4 = a 4 b 4 c 4 d 4 b 4 0 b 4 b 4 c 4 b 4 d 4 a 4 d 4 b 4 a 4 c 4 , a 4 = 0.6565 , b 4 = 0.5774 , c 4 = 0.4285 , d 4 = 0.2280 .
Now, we need to define the π 4 ( 0 ) permutation as below:
π 4 ( 0 ) = 1 2 3 4 1 4 3 2 .
We then apply π 4 ( 0 ) to the matrix C 4 in both rows and columns:
C 4 P = P 4 ( π 4 ( 0 ) ) C 4 P 4 ( π 4 ( 0 ) ) ,
where
P 4 ( π 4 ( 0 ) ) = 1 1 1 1 .
After this operation, the matrix C 4 P has the following structure:
C 4 P = a 4 d 4 c 4 b 4 d 4 c 4 a 4 b 4 c 4 a 4 d 4 b 4 b 4 b 4 b 4 0 .
Let us write the matrix C 4 P as the sum of two matrices:
C 4 P = C 4 A + C 4 B ,
where
C 4 A = a 4 d 4 c 4 0 d 4 c 4 a 4 0 c 4 a 4 d 4 0 0 0 0 0 , C 4 B = 0 0 0 b 4 0 0 0 b 4 0 0 0 b 4 b 4 b 4 b 4 0 .
The matrix C 4 A , after removing the fourth row and the fourth column, takes the form:
C 3 = a 4 d 4 c 4 d 4 c 4 a 4 c 4 a 4 d 4 .
We now need to apply the π 3 ( 0 ) permutation to the matrix C 3 and multiply the first row and the second and third columns by −1. These operations are described below:
C 3 P = P 3 ( π 3 ( 0 ) ) P 3 R C 3 P 3 C ,
where
P 3 C = 1 1 1 , P 3 R = 1 1 1 .
The matrix C 3 P has the following structure:
C 3 P = c 4 a 4 d 4 d 4 c 4 a 4 a 4 d 4 c 4 .
This structure corresponding to a circular convolution [33]. Below, we present the final expression for the four-point DCT-VIII based on the factorization of the matrix C 4 B :
Y 4 × 1 = P 4 ( π 4 ( 0 ) ) W 4 × 5 ( 1 ) P 5 R W 5 × 4 ( 1 ) W 4 × 5 ( 0 ) D 5 W 5 × 4 ( 0 ) W 4 P 4 C P 4 ( π 4 ( 0 ) ) X 4 × 1 ,
where
P 4 C = P 3 C 1 , W 4 = 1 1 1 1 1 1 1 1 , W 5 × 4 ( 0 ) = 1 1 1 1 1 1 ,
D 5 = diag s 0 ( 4 ) , s 1 ( 4 ) , , s 4 ( 4 ) , s 0 ( 4 ) = c 4 + a 4 , s 1 ( 4 ) = d 4 + a 4 , s 2 ( 4 ) = c 4 + d 4 + 2 a 4 3 ,
s 3 ( 4 ) = s 4 ( 4 ) = b 4 , W 4 × 5 ( 0 ) = 1 1 1 1 1 1 , W 5 × 4 ( 1 ) = 1 1 1 1 1 1 ,
P 5 R = P 3 ( π 3 ( 0 ) ) P 3 R ( T ) I 2 , W 4 × 5 ( 1 ) = 1 1 1 1 1 1 1 .
Figure 3 presents the data flow graph of the fast DCT-VIII algorithm for N = 4. Direct matrix-vector product requires 15 multiplications and 11 additions. Our solution requires 5 multiplications and 11 additions. The first value c 4 + d 4 a 4 3 of the circular convolution matrix equals 0, so it is not specified in Equation (6) or in the data flow graph. The pseudocode for the DCT-VIII for N = 4 is provided in Table A2.

3.3. Fast Algorithm for the 5-Point DCT-VIII

Below, we present the matrix-vector form of the calculation of the DCT-VIII for N = 5:
Y 5 × 1 = C 5 X 5 × 1 ,
where
Y 5 × 1 = y 0 , y 1 , y 2 , y 3 , y 4 T , X 5 × 1 = x 0 , x 1 , x 2 , x 3 , x 4 T ,
C 5 = a 5 b 5 c 5 d 5 e 5 b 5 e 5 d 5 a 5 c 5 c 5 d 5 b 5 e 5 a 5 d 5 a 5 e 5 c 5 b 5 e 5 c 5 a 5 b 5 d 5 , a 5 = 0.5969 , b 5 = 0.5485 , c 5 = 0.4557 , d 5 = 0.3260 , e 5 = 0.1699 .
Now, we need to define the π 5 ( 0 ) permutation as below:
π 5 ( 0 ) = 1 2 3 4 5 5 2 4 3 1 .
Let us apply the π 5 ( 0 ) permutation to the matrix C 5 and multiply the third row and the third column by −1. These operations are described below:
C 5 P = P 5 S P 5 ( π 5 ( 0 ) ) C 5 P 5 S ,
where
P 5 S = 1 1 1 1 1 , P 5 ( π 5 ( 0 ) ) = 1 1 1 1 1 .
After these operations, the matrix C 5 P looks as follows:
C 5 P = e 5 c 5 a 5 b 5 d 5 b 5 e 5 d 5 a 5 c 5 d 5 a 5 e 5 c 5 b 5 c 5 d 5 b 5 e 5 a 5 a 5 b 5 c 5 d 5 e 5 .
Now, we split the matrix C 5 P into two matrices, which we consider separately, and then combine the factorizations of these two matrices:
C 5 P = C 5 A + C 5 B ,
where
C 5 A = e 5 c 5 a 5 b 5 0 b 5 e 5 d 5 a 5 0 d 5 a 5 e 5 c 5 0 c 5 d 5 b 5 e 5 0 0 0 0 0 0 , C 5 B = 0 0 0 0 d 5 0 0 0 0 c 5 0 0 0 0 b 5 0 0 0 0 a 5 a 5 b 5 c 5 d 5 e 5 .
If we omit the zero rows and columns from the matrix C 5 A , we get the following 4-by-4 matrix:
C 4 = e 5 c 5 a 5 b 5 b 5 e 5 d 5 a 5 d 5 a 5 e 5 c 5 c 5 d 5 b 5 e 5 .
The matrix C 4 has the following structure:
C 4 = A 2 B 2 C 2 A 2 ,
where
A 2 = e 5 c 5 b 5 e 5 , B 2 = a 5 b 5 d 5 a 5 , C 2 = d 5 a 5 c 5 d 5 .
Then, matrix C 4 can be described by the following expression:
C 4 = T 4 × 6 D 6 T 6 × 4 ,
where
T 6 × 4 = T 3 × 2 ( 3 ) I 2 , D 6 = E 2 F 2 A 2 , T 4 × 6 = T 2 × 3 ( 3 ) I 2 ,
E 2 = C 2 A 2 = d 5 e 5 a 5 + c 5 c 5 b 5 d 5 e 5 = 0.4959 1.0526 0.0928 0.4959 ,
F 2 = B 2 A 2 = a 5 e 5 b 5 + c 5 d 5 b 5 a 5 e 5 = 0.7668 0.0928 0.2225 0.7668 .
Next, we consider three matrices: E 2 , F 2 , and A 2 . They are similar to the pattern [32]: a b c a . As a result, we obtain the following factorization of the C 4 matrix:
C 4 = T 4 × 6 T 6 × 9 D 9 T 9 × 6 T 6 × 4 ,
where
T 9 × 6 = T 3 × 2 ( 3 ) T 3 × 2 ( 3 ) T 3 × 2 ( 3 ) , D 9 = diag s 0 ( 5 ) , s 1 ( 5 ) , , s 8 ( 5 ) ,
s 0 ( 5 ) = c 5 b 5 + d 5 + e 5 , s 1 ( 5 ) = a 5 + c 5 + d 5 + e 5 , s 2 ( 5 ) = d 5 e 5 ,
s 3 ( 5 ) = d 5 b 5 + a 5 + e 5 , s 4 ( 5 ) = b 5 + c 5 + a 5 + e 5 , s 5 ( 5 ) = a 5 e 5 ,
s 6 ( 5 ) = b 5 e 5 , s 7 ( 5 ) = c 5 e 5 , s 8 ( 5 ) = e 5 , T 6 × 9 = T 2 × 3 ( 3 ) T 2 × 3 ( 3 ) T 2 × 3 ( 3 ) .
We can now formulate the final matrix-vector procedure for computing DCT-VIII with reduced computational complexity for N = 5, remembering to include the matrix C 5 B :
Y 5 × 1 = P 5 W 5 × 13 T 13 × 15 T 15 × 18 D 18 T 18 × 15 T 15 × 13 W 13 × 5 P 5 S X 5 × 1 ,
where
W 13 × 5 = 1 1 1 1 1 1 1 1 1 1 1 1 1 , T 15 × 13 = T 6 × 4 I 9 , T 18 × 15 = T 9 × 6 I 9 , D 18 = diag s 0 ( 5 ) , s 1 ( 5 ) , , s 17 ( 5 ) , s 9 ( 5 ) = e 5 , s 10 ( 5 ) = s 17 ( 5 ) = a 5 , s 11 ( 5 ) = s 16 ( 5 ) = b 5 , s 12 ( 5 ) = s 15 ( 5 ) = c 5 , s 13 ( 5 ) = s 14 ( 5 ) = d 5 , T 15 × 18 = T 6 × 9 I 9 , T 13 × 15 = T 4 × 6 I 9 , P 5 = P 5 S P 5 ( π 5 ( 0 ) ) T ,
W 5 × 13 = 1 1 1 1 1 1 1 1 1 1 1 1 1 .
Figure 4 presents the data flow graph of the fast DCT-VIII algorithm for N = 5. Direct matrix-vector product requires 25 multiplications and 20 additions. Our solution requires 18 multiplications and 23 additions. The pseudocode for the DCT-VIII for N = 5 is provided in Table A3.

3.4. Fast Algorithm for the 6-Point DCT-VIII

Below, we present the matrix-vector form of the calculation of the DCT-VIII for N = 6:
Y 6 × 1 = C 6 X 6 × 1 ,
where
Y 6 × 1 = y 0 , y 1 , y 2 , y 3 , y 4 , y 5 T , X 6 × 1 = x 0 , x 1 , x 2 , x 3 , x 4 , x 5 T ,
C 6 = a 6 b 6 c 6 d 6 e 6 f 6 b 6 e 6 f 6 c 6 a 6 d 6 c 6 f 6 a 6 e 6 d 6 b 6 d 6 c 6 e 6 b 6 f 6 a 6 e 6 a 6 d 6 f 6 b 6 c 6 f 6 d 6 b 6 a 6 c 6 e 6 , a 6 = 0.5507 , b 6 = 0.5187 , c 6 = 0.4565 , d 6 = 0.3678 , e 6 = 0.2578 , f 6 = 0.1327 .
Let us alter the sign of the third column, and the sixth and the second rows. Next, we need to define the π 6 ( 0 ) and π 6 ( 1 ) permutations as below:
π 6 ( 0 ) = 1 2 3 4 5 6 1 2 4 3 6 5 , π 6 ( 1 ) = 1 2 3 4 5 6 3 5 6 1 4 2 .
We then apply π 6 ( 0 ) to the matrix C 6 in columns and permutation π 6 ( 1 ) in rows:
C 6 P = P 6 ( π 6 ( 1 ) ) C 6 P 6 ( π 6 ( 0 ) ) ,
where
P 6 ( π 6 ( 0 ) ) = 1 1 1 1 1 1 , P 6 ( π 6 ( 1 ) ) = 1 1 1 1 1 1 .
After these operations, the matrix C 6 P looks as follows:
C 6 P = d 6 c 6 b 6 e 6 a 6 f 6 f 6 d 6 a 6 b 6 e 6 c 6 a 6 b 6 d 6 c 6 f 6 e 6 e 6 a 6 f 6 d 6 c 6 b 6 b 6 e 6 c 6 f 6 d 6 a 6 c 6 f 6 e 6 a 6 b 6 d 6 = A 3 B 3 B 3 A 3 ,
where
A 3 = d 6 c 6 b 6 f 6 d 6 a 6 a 6 b 6 d 6 , B 3 = e 6 a 6 f 6 b 6 e 6 c 6 c 6 f 6 e 6 .
The above structure allows us to reduce the number of multiplications operations, so we derive the expression for the first stage of DCT-VIII factorization for N = 6:
Y 6 × 1 = P 6 ( 1 ) T 6 × 9 D 9 T 9 × 6 P 6 ( 0 ) X 6 × 1 ,
where
P 6 ( 0 ) = P 6 ( π 6 ( 0 ) ) T , P 6 ( 1 ) = P 6 ( π 6 ( 1 ) ) T , T 9 × 6 = T 3 × 2 ( 3 ) I 3 , T 6 × 9 = T 2 × 3 ( 4 ) I 3 , D 9 = E 3 F 3 B 3 , E 3 = A 3 B 3 , F 3 = A 3 + B 3 ,
E 3 = d 6 e 6 c 6 + a 6 b 6 f 6 f 6 b 6 d 6 e 6 a 6 + c 6 a 6 + c 6 b 6 f 6 d 6 e 6 = 0.1101 0.0941 0.3859 0.6514 0.1101 1.0072 1.0072 0.3859 0.1101 ,
F 3 = d 6 e 6 c 6 + a 6 b 6 f 6 f 6 b 6 d 6 e 6 a 6 + c 6 a 6 + c 6 b 6 f 6 d 6 e 6 = 0.6256 1.0072 0.6514 0.3859 0.6256 0.0941 0.0941 0.6514 0.6256 .
Now, we split the matrix E 3 into two matrices, which we consider separately, and then combine the factorizations of these two matrices:
E 3 = E 3 A + E 3 B ,
where
E 3 A = d 6 e 6 a 6 + c 6 b 6 f 6 b 6 f 6 d 6 e 6 a 6 + c 6 a 6 + c 6 b 6 f 6 d 6 e 6 , E 3 B = 0 2 c 6 0 2 b 6 0 0 0 0 0 .
The matrix E 3 A has the structure of a circular convolution matrix, so we present the expression for factorization of the matrix E 3 :
E 3 = W 3 × 5 W 5 1 W 5 × 6 D 6 E 3 W 6 × 5 W 5 0 W 5 × 3 ,
where
W 5 × 3 = 1 1 1 1 1 , W 5 0 = W 3 ( 0 ) I 2 , W 6 × 5 = A 4 × 3 I 2 ,
D 6 E 3 = diag s 0 ( 6 ) , s 1 ( 6 ) , , s 5 ( 6 ) , s 0 ( 6 ) = d 6 e 6 + b 6 f 6 + a 6 + c 6 3 ,
s 1 ( 6 ) = d 6 e 6 a 6 c 6 , s 2 ( 6 ) = b 6 f 6 a 6 c 6 , s 3 ( 6 ) = d 6 e 6 + b 6 f 6 2 a 6 + c 6 3 ,
s 4 ( 6 ) = 2 b 6 , s 5 ( 6 ) = 2 c 6 , W 5 × 6 = A 3 × 4 I 2 , W 5 1 = W 3 ( 1 ) I 2 ,
W 3 × 5 = 1 1 1 1 1 .
Next, we split the matrix F 3 into two matrices, which we consider separately, and then combine the factorizations of these two matrices:
F 3 = F 3 A + F 3 B ,
where
F 3 A = d 6 e 6 c 6 a 6 b 6 f 6 b 6 f 6 d 6 e 6 c 6 a 6 c 6 a 6 b 6 f 6 d 6 e 6 , F 3 B = 0 2 a 6 0 2 f 6 0 0 0 0 0 .
The matrix F 3 A has the structure of a circular convolution matrix, so we present the expression for factorization of the matrix F 3 :
F 3 = W 3 × 5 W 5 1 W 5 × 6 D 6 F 3 W 6 × 5 W 5 0 W 5 × 3 ,
where
D 6 F 3 = diag s 6 ( 6 ) , s 7 ( 6 ) , , s 11 ( 6 ) , s 6 ( 6 ) = d 6 e 6 b 6 f 6 + c 6 a 6 3 ,
s 7 ( 6 ) = d 6 e 6 c 6 + a 6 , s 8 ( 6 ) = b 6 f 6 c 6 + a 6 ,
s 9 ( 6 ) = d 6 e 6 b 6 f 6 2 c 6 a 6 3 , s 10 ( 6 ) = 2 f 6 , s 11 ( 6 ) = 2 a 6 .
Further, we split the matrix B 3 into two matrices, which we consider separately, and then combine the factorizations of these two matrices:
B 3 = B 3 A + B 3 B ,
where
B 3 A = e 6 c 6 f 6 f 6 e 6 c 6 c 6 f 6 e 6 , B 3 B = 0 a 6 c 6 0 f 6 b 6 0 0 0 0 0 .
The matrix B 3 A has the structure of a circular convolution matrix, so we present the expression for factorization of the matrix B 3 :
B 3 = W 3 × 5 W 5 1 W 5 × 6 D 6 B 3 W 6 × 5 W 5 0 W 5 × 3 ,
where
D 6 B 3 = diag s 12 ( 6 ) , s 13 ( 6 ) , , s 17 ( 6 ) , s 12 ( 6 ) = e 6 + f 6 c 6 3 , s 13 ( 6 ) = e 6 + c 6 , s 14 ( 6 ) = f 6 + c 6 ,
s 15 ( 6 ) = e 6 + f 6 + 2 c 6 3 , s 16 ( 6 ) = f 6 b 6 , s 17 ( 6 ) = a 6 c 6 .
We can now formulate the final matrix-vector procedure for computing DCT-VIII with reduced computational complexity for N = 6:
Y 6 × 1 = P 6 ( 1 ) T 6 × 9 W 9 × 15 W 15 ( 1 ) W 15 × 18 D 18 W 18 × 15 W 15 ( 0 ) W 15 × 9 T 9 × 6 P 6 ( 0 ) X 6 × 1 ,
where
W 15 × 9 = W 5 × 3 W 5 × 3 W 5 × 3 , W 15 ( 0 ) = W 5 ( 0 ) W 5 ( 0 ) W 5 ( 0 ) ,
W 18 × 15 = W 6 × 5 W 6 × 5 W 6 × 5 , D 18 = D 6 E 3 D 6 F 3 D 6 B 3 ,
W 15 × 18 = W 5 × 6 W 5 × 6 W 5 × 6 , W 15 ( 1 ) = W 5 ( 1 ) W 5 ( 1 ) W 5 ( 1 ) ,
W 9 × 15 = W 3 × 5 W 3 × 5 W 3 × 5 .
Figure 5 presents the data flow graph of the fast DCT-VIII algorithm for N = 6. Direct matrix-vector product requires 36 multiplications and 30 additions. Our solution requires 18 multiplications and 48 additions. The pseudocode for the DCT-VIII for N = 6 is provided in Table A4.

3.5. Fast Algorithm for the 7-Point DCT-VIII

Below, we present the matrix-vector form of the calculation of the DCT-VIII for N = 7:
Y 7 × 1 = C 7 X 7 × 1 ,
where
Y 7 × 1 = y 0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 T , X 7 × 1 = x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 T ,
C 7 = a 7 b 7 c 7 d 7 e 7 f 7 g 7 b 7 e 7 0 e 7 b 7 b 7 e 7 c 7 0 c 7 c 7 0 c 7 c 7 d 7 e 7 c 7 f 7 b 7 g 7 a 7 e 7 b 7 0 b 7 e 7 e 7 b 7 f 7 b 7 c 7 g 7 e 7 a 7 d 7 g 7 e 7 c 7 a 7 b 7 d 7 f 7 , a 7 = 0.5136 , b 7 = 0.4911 , c 7 = 0.4472 , d 7 = 0.3838 , e 7 = 0.3035 , f 7 = 0.2100 , g 7 = 0.1074 .
Let us change the signs of the entries in the third column and the third row of the matrix C 7 . Next, we need to define the π 7 ( 0 ) and π 7 ( 1 ) permutations as below:
π 7 ( 0 ) = 1 2 3 4 5 6 7 1 6 7 3 5 2 4 , π 7 ( 1 ) = 1 2 3 4 5 6 7 2 6 7 3 5 1 4 .
We then apply the permutation π 7 ( 0 ) to the matrix C 7 in columns and the permutation π 7 ( 1 ) in rows:
C 7 P = P 7 ( π 7 ( 1 ) ) C 7 P 7 ( π 7 ( 0 ) ) T ,
where
P 7 ( π 7 ( 0 ) ) = 1 1 1 1 1 1 1 , P 7 ( π 7 ( 1 ) ) = 1 1 1 1 1 1 1 .
After these operations, the matrix C 7 P looks as follows:
C 7 P = f 7 a 7 g 7 d 7 e 7 b 7 c 7 a 7 f 7 d 7 g 7 e 7 b 7 c 7 d 7 g 7 f 7 a 7 b 7 e 7 c 7 g 7 d 7 a 7 f 7 b 7 e 7 c 7 e 7 e 7 b 7 b 7 e 7 b 7 0 b 7 b 7 e 7 e 7 b 7 e 7 0 c 7 c 7 c 7 c 7 0 0 c 7 .
Now, we split the matrix C 7 P into two matrices, which we consider separately, and then combine:
C 7 P = C 7 A + C 7 B ,
where
C 7 A = f 7 a 7 g 7 d 7 0 0 0 a 7 f 7 d 7 g 7 0 0 0 d 7 g 7 f 7 a 7 0 0 0 g 7 d 7 a 7 f 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ,
C 7 B = 0 0 0 0 e 7 b 7 c 7 0 0 0 0 e 7 b 7 c 7 0 0 0 0 b 7 e 7 c 7 0 0 0 0 b 7 e 7 c 7 e 7 e 7 b 7 b 7 e 7 b 7 0 b 7 b 7 e 7 e 7 b 7 e 7 0 c 7 c 7 c 7 c 7 0 0 c 7 .
If we omit the zero rows and columns from the matrix C 7 A , we obtain the following matrix of the fourth order:
C 4 = f 7 a 7 g 7 d 7 a 7 f 7 d 7 g 7 d 7 g 7 f 7 a 7 g 7 d 7 a 7 f 7 .
The matrix C 4 has the following structure:
C 4 = A 2 B 2 C 2 A 2 ,
where
A 2 = f 7 a 7 a 7 f 7 , B 2 = g 7 d 7 d 7 g 7 , C 2 = d 7 g 7 g 7 d 7 .
Matrix C 4 after the first stage of factorization can be described by the following expression:
C 4 = T 4 × 6 D 6 ( 0 ) T 6 × 4 ,
where
T 6 × 4 = T 3 × 2 ( 3 ) I 2 , D 6 ( 0 ) = E 2 F 2 A 2 , T 4 × 6 = T 2 × 3 ( 3 ) I 2 ,
E 2 = C 2 A 2 = d 7 f 7 g 7 a 7 g 7 a 7 d 7 f 7 = 0.5938 0.4062 0.4062 0.5938 ,
F 2 = B 2 A 2 = g 7 f 7 d 7 a 7 d 7 a 7 g 7 f 7 = 0.1027 0.8973 0.8973 0.1027 .
The matrices E 2 , F 2 , and A 2 are similar to the pattern a b b a . Then, we have the following expression after the second stage of factorization of the C 4 matrix:
C 4 = T 4 × 6 W 6 D 6 ( 1 ) W 6 T 6 × 4 ,
where
W 6 = H 2 H 2 H 2 , D 6 ( 1 ) = diag s 0 ( 7 ) , s 1 ( 7 ) , , s 5 ( 7 ) ,
s 0 ( 7 ) = d 7 f 7 + g 7 a 7 2 , s 1 ( 7 ) = d 7 f 7 g 7 + a 7 2 , s 2 ( 7 ) = g 7 f 7 d 7 a 7 2 ,
s 3 ( 7 ) = g 7 f 7 + d 7 + a 7 2 , s 4 ( 7 ) = f 7 + a 7 2 , s 5 ( 7 ) = f 7 a 7 2 .
The final DCT-VIII expression for N = 7 including the matrix C 7 B is as follows:
Y 7 × 1 = P 7 ( π 7 ( 1 ) ) T W 7 × 13 W 13 × 16 W 16 D 16 W 16 × 9 W 9 W 9 × 7 P 7 ( π 7 ( 0 ) ) X 7 × 1 ,
where
W 9 × 7 = 1 1 1 1 1 1 1 1 1 1 1 , W 9 = W 6 I 3 ,
W 16 × 9 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 , W 16 = W 6 I 10 , D 16 = diag s 0 ( 7 ) , s 1 ( 7 ) , , s 15 ( 7 ) , s 6 ( 7 ) = s 7 ( 7 ) = s 10 ( 7 ) = s 11 ( 7 ) = e 7 , s 8 ( 7 ) = s 9 ( 7 ) = s 12 ( 7 ) = s 13 ( 7 ) = b 7 , s 14 ( 7 ) = s 15 ( 7 ) = c 7 ,
W 13 × 16 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ,
W 7 × 13 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 .
Figure 6 presents the data flow graph of the fast DCT-VIII algorithm for N = 7. Direct matrix-vector product requires 45 multiplications and 38 additions. Our solution requires 16 multiplications and 34 additions. The pseudocode for the DCT-VIII for N = 7 is provided in Table A5.

4. Results

The obtained results were verified in two stages. First, the correctness of the DCT-VIII matrix factorizations presented in Section 3 was experimentally validated. Using MATLAB R2023b, the original DCT-VIII matrices were computed from expressions (3), (5), (7), (9), and (11). These matrices were then compared with the products of the corresponding factorized matrices derived from expressions (4), (6), (8), (10), and (12). For size N ranging from 3 to 7, the factorized matrices exactly matched the original matrices, thereby confirming the correctness of the proposed algorithms.
Second, the correctness of the software implementations of the proposed algorithms was verified, which in turn confirms the correctness of the constructed data flow graphs. The algorithms represented by the data flow graphs were implemented in software, and the corresponding implementations are presented as pseudocode in Appendix A.
For each proposed N-point DCT-VIII algorithm, 30 sequences of N random numbers were generated. Each sequence was applied as input to the corresponding algorithm, and the resulting output was compared with the result obtained from the direct matrix-vector multiplication for the same input. If the results coincided for all 30 test sequences, the algorithm was deemed correct. In this manner, the correctness of the pseudocode and the corresponding data flow graphs was confirmed for the developed DCT-VIII algorithms with N ranging from 3 to 7.
Next, the arithmetic complexity of the proposed DCT-VIII algorithms was evaluated and compared with that of direct matrix-vector multiplication. The number of multiplications was determined by counting the vertices labeled with factors (circles) in the data flow graphs, while additions were estimated by counting vertices where two edges converge. Zero entries were not considered. Overall, the proposed DCT-VIII algorithms achieve a reduction of approximately 53% in the number of multiplications and an increase of about 21% in the number of additions for input sizes from 3 to 7.
Table 3 summarizes the number of arithmetic operations required by the proposed DCT-VIII algorithms. Percentage differences relative to the direct matrix-vector product are indicated in parentheses. A plus sign denotes an increase in the number of operations, whereas a minus sign indicates a reduction.
In Table 4, we present the number of arithmetic operations for DCT-VIII algorithms based on the fast Fourier transform (FFT). Detailed explanations of how the number of operations for these algorithms is calculated are provided in Appendix B.
Our analysis shows that, for N = 4, the proposed DCT-VIII algorithm using the structural approach requires the same number of multiplications and additions compared to the DCT-VIII algorithm from [27], which is based on the relationship between DST-VII and DCT-VIII. Moreover, the proposed solutions require a significantly lower number of the arithmetic operations than the DCT-VIII algorithms based on the FFT [4]. To the best of our knowledge, no other algorithms are available in the literature for direct comparison. We also consider it inappropriate to compare the proposed fast DCT-VIII algorithm with related integer-transform algorithms due to their differing mathematical foundations.

5. Discussion

Let us analyze the obtained algorithms. In the literature, DCT-VIII algorithms are typically presented only for input sequence lengths that are powers of two. This restriction arises because fast algorithms for this transform were developed primarily for video coding. However, other applications of DCT-VIII are also known. For example, in [17], it was used for fractional-pixel motion compensation in high-efficiency video coding, which requires fast algorithms for input sequences whose lengths are not powers of two.
We now examine the stages of the DCT-VIII matrix factorization based on the structural approach, which was used to construct the proposed algorithms (Table 5).
For N = 3 , the DCT-VIII matrix is fully characterized by the circular-convolution structure. Applying a factorization of this structure enabled the factorization of the 3-point DCT-VIII matrix. Using the terminology of [12,29], only the second pattern listed in Section 1.1 was applied for N = 3 . Specifically, this pattern involves a subset of column elements that is repeated in other columns with permutation.
For N = 4 , two patterns were identified in the transform matrix. The first pattern is a circular convolution applied to a 3 × 3 submatrix. In the second pattern, the column and row of the original matrix contain the same values up to a sign. In the data flow graph, this pattern corresponds to a fan-shaped structure, where several edges leave a single node toward different outputs or enter a single node from multiple inputs. This structure matches the third pattern from Section 1.1 and allows a reduction in the number of multiplications by performing preliminary additions of the inputs, followed by a single multiplication.
For N = 5 , a structure consisting of four 2 × 2 submatrices was extracted from the DCT-VIII matrix, and each submatrix was subsequently factorized. This corresponds to the first pattern described in [12,29], specifically a pattern consisting of several groups with a fixed number of elements, regardless of sign changes. For the entries of the DCT-VIII matrix not included in this structure, a direct multiplication of the input vector by the corresponding transform matrix entries was used. Thus, in this case, the structural approach was applied jointly with direct matrix-vector multiplication.
For N = 6 , a two-level hierarchical structure was identified in the DCT-VIII matrix. The upper level consists of a decomposition into 3 × 3 submatrices, while the lower level represents the structure of these submatrices in the form of circular convolutions. In this case, patterns 1 and 2 from Section 1.1 and from [12,29] are successively identified.
For N = 7 , two patterns were identified in the matrix. Unlike the case of N = 6 , the extracted structure is non-hierarchical. A 4 × 4 matrix was extracted from the original transform matrix and represented by four 2 × 2 submatrices. This corresponds to pattern 1 according to the terminology of [12,29]. The remaining entries of the original DCT-VIII matrix for N = 7 correspond to pattern 3 from Section 1.1, as they include several groups with identical elements, regardless of sign changes.
Identifying such structures enabled a significant reduction in the number of multiplications required to implement the DCT-VIII for small N compared to direct matrix-vector multiplication. However, the number of additions increases due to the preliminary summation of matrix entries that are multiplied by the same factor.
Another limitation of the structural approach is that it is better suited for constructing fast algorithms for short data sequences. As the sequence length increases, identifying the structure of the transform matrices becomes more difficult. Moreover, as discussed earlier, the efficiency of the proposed fast algorithms strongly depends on the structural properties of these matrices.

6. Conclusions

The fast DCT-VIII algorithms presented in this paper were constructed using the structural approach described in [32,33]. This approach derives fast algorithms from the structural properties of transform matrices for various input lengths. In this context, the structure of a transform matrix refers to properties such as symmetries in certain submatrices, recursion of matrix patterns, factorization of matrix patterns, and algebraic relationships between matrix entries.
The presented fast DCT-VIII algorithms were developed for short input sequences, specifically for lengths from 3 to 7. The derivation of the fast DCT-VIII algorithm for N = 8 has already been addressed in several studies [27,31]. The proposed algorithms reduce computational complexity by significantly decreasing the number of multiplications compared to direct matrix-vector computation. On average, they achieve a reduction of approximately 53% in the number of multiplications and an increase of about 21% in the number of additions for input sizes from 3 to 7.
The algorithms are represented using data flow graphs. A key advantage of the proposed designs is that the input-output path in each graph contains only a single multiplication. It is well known that when more than one multiplication lies on the input-output path, the operand format doubles with each multiplication, introducing additional data-processing challenges. The proposed algorithms completely avoid this issue.
Implementing the constructed fast DCT-VIII algorithms ensures numerical stability, as the structural approach generally preserves the orthogonality of the transform and maintains good conditioning.
Additionally, the fast DCT-VIII algorithms for short input sequences can be reused as building blocks for other DCT and DST types due to cross-relations between transforms. For example, a DST-VII can be converted into a DCT-VIII through permutation and input reversal.
Although practical codecs commonly employ large transform sizes (e.g., 16, 32, and 64) [35,36], algorithms for such large transforms are typically constructed from smaller kernels using well-established nesting techniques [37,38]. Consequently, optimizing the computation of small-scale transforms remains critically important. Moreover, for this class of transforms, algorithmic computational complexity increases rapidly with transform size. To mitigate this issue, one can synthesize new large-scale orthogonal transforms from small-scale discrete transforms used as kernels. As a result, the local characteristics of the original transforms are preserved while achieving lower computational complexity than conventional large-transform implementations.

Author Contributions

Conceptualization, A.C.; methodology, A.C., M.P. and M.R.; software, M.R.; validation, M.R. and M.P.; formal analysis, A.C., M.P. and M.R.; investigation, M.R., M.P. and A.C.; writing—original draft preparation, M.P. and M.R.; writing—review and editing, M.R. and M.P.; supervision, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The MATLAB (version R2023b Update 4 (23.2.0.2428915), 64-bit (win64)) programming code implementing the developed algorithms for the DCT-VIII is available at [39] (accessed on 29 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DCTdiscrete cosine transform
DCT-VIIIdiscrete cosine transform of type eight
DFTdiscrete Fourier transform
DSTdiscrete sine transform
DST-VIIdiscrete sine transform of type seven
VVCVersatile Video Coding
AMTAdaptive Multiple Transform
BD-rateBjøntegaard Delta Bitrate
FFTfast Fourier transform

Appendix A

The pseudocode for the three-point DCT-VIII algorithm is shown in Table A1. The inputs of the pseudocode are x 0 , x 1 , and x 2 . The scaling factors are s 0 ( 3 ) , s 1 ( 3 ) , s 2 ( 3 ) , and s 3 ( 3 ) . The variables were reused; additional variables are p 0 , p 1 , and p 2 . The outputs of the pseudocode are y 0 , y 1 , and y 2 .
Table A1. The pseudocode for the developed fast 3-point DCT-VIII algorithm with variable reuse.
Table A1. The pseudocode for the developed fast 3-point DCT-VIII algorithm with variable reuse.
Step 1Step 2Step 3Step 4
y 0 = x 2 + x 1 + x 0 y 1 = y 1 s 1 ( 3 ) p 2 = y 0 s 0 ( 3 ) y 0 = p 2 + y 1
y 1 = x 2 x 0 y 2 = y 2 s 2 ( 3 ) y 1 = y 1 p 1 y 1 = p 2 y 1 y 2
y 2 = x 1 x 0 p 1 = p 0 s 3 ( 3 ) y 2 = y 2 p 1 y 2 = p 2 + y 2
p 0 = y 1 + y 2
The pseudocode for the four-point DCT-VIII algorithm is shown in Table A2. The inputs of the pseudocode are x 0 , x 1 , x 2 , and x 3 . The scaling factors are s 0 ( 4 ) , s 1 ( 4 ) , s 2 ( 4 ) , s 3 ( 4 ) , and s 4 ( 4 ) . The variables were reused; additional variable is p. The outputs of the pseudocode are x 2 , x 0 , x 1 , and x 3 .
Table A2. The pseudocode for the developed fast 4-point DCT-VII algorithm with variable reuse.
Table A2. The pseudocode for the developed fast 4-point DCT-VII algorithm with variable reuse.
Step 1Step 2Step 3
p = x 0 + x 2 p = p s 0 ( 4 ) x 2 = x 2 x 3
x 0 = x 0 x 3 x 2 s 4 ( 4 ) x 2 = x 2 s 1 ( 4 ) x 3 = x 2 p x 1
x 2 = x 2 x 3 x 1 = x 1 s 3 ( 4 ) x 2 = x 1 x 2
x 3 = p + x 2 s 2 ( 4 ) p = p x 3 x 1 = p x 1
The pseudocode for the five-point DCT-VIII algorithm is shown in Table A3. The inputs of the pseudocode are x 0 , x 1 , x 2 , x 3 , and x 4 . The scaling factors are s 0 ( 5 ) , s 1 ( 5 ) , …, s 17 ( 5 ) . The outputs of the pseudocode are y 0 , y 1 , y 2 , y 3 , and y 4 . Variables p 0 , p 1 , and p 2 are additional.
Table A3. The pseudocode for the developed fast 5-point DCT-VII algorithm with variable reuse.
Table A3. The pseudocode for the developed fast 5-point DCT-VII algorithm with variable reuse.
Step 1Step 2Step 3
p 0 = x 0 x 2 y 3 = x 1 s 1 ( 5 ) + y 4 y 4 = y 4 + y 0 + x 4 s 14 ( 5 )
p 1 = x 1 + x 3 y 2 = x 0 s 0 ( 5 ) + y 4 y 1 = y 1 + p 2 + x 4 s 15 ( 5 )
y 4 = x 0 + x 1 s 2 ( 5 ) y 4 = x 3 s 4 ( 5 ) + y 0 y 3 = y 3 + y 0 + x 4 s 16 ( 5 )
y 0 = x 3 x 2 s 5 ( 5 ) y 1 = x 2 s 3 ( 5 ) + y 0 y 2 = y 2 + p 2 + x 4 s 17 ( 5 )
p 2 = p 0 + p 1 s 8 ( 5 ) y 0 = p 1 s 7 ( 5 ) + p 2 y 0 = x 4 s 9 ( 5 ) + x 0 s 10 ( 5 ) + x 1 s 11 ( 5 ) x 2 s 12 ( 5 ) + x 3 s 13 ( 5 )
p 2 = p 0 s 6 ( 5 ) + p 2
In Table A4, we present the pseudocode of the proposed 6-point DCT-VIII algorithm. The inputs of the pseudocode are x 0 , x 1 , x 2 , x 3 , x 4 , and x 5 . The scaling factors are s 0 ( 6 ) , s 1 ( 6 ) , …, s 17 ( 6 ) . We reuse variables. So, the outputs of the pseudocode are y 0 , y 1 , y 2 , y 3 , y 4 , and x 0 . Variables p 0 , p 1 , …, p 8 are additional.
Table A4. The pseudocode for the developed fast 6-point DCT-VII algorithm with variable reuse.
Table A4. The pseudocode for the developed fast 6-point DCT-VII algorithm with variable reuse.
Step 1Step 2Step 3
x 0 = x 0 x 2 y 1 = x 2 s 10 ( 6 ) p 2 = y 0 + y 2 + p 5 s 12 ( 6 )
y 2 = x 1 + x 5 y 4 = x 5 s 11 ( 6 ) p 1 = y 0 p 5
p 5 = x 3 + x 4 p 6 = x 2 + x 5 + x 4 s 6 ( 6 ) p 4 = y 2 p 5
y 3 = x 1 s 5 ( 6 ) p 7 = x 2 x 4 p 5 = p 1 + p 4 s 15 ( 6 )
p 0 = x 0 + x 1 + x 3 s 0 ( 6 ) p 8 = x 5 x 4 p 1 = p 1 s 13 ( 6 )
x 1 = x 1 x 3 x 4 = p 7 + p 8 s 9 ( 6 ) p 4 = p 4 s 14 ( 6 )
x 3 = x 0 x 3 p 7 = p 7 s 7 ( 6 )
p 3 = x 3 + x 1 s 3 ( 6 ) p 8 = p 8 s 8 ( 6 )
x 3 = x 3 s 1 ( 6 )
x 1 = x 1 s 2 ( 6 )
Step 4Step 5Step 6
x 3 = x 3 p 3 p 0 = p 6 + p 7 p 4 = y 0 s 16 ( 6 ) + p 7
x 1 = x 1 p 3 x 3 = p 6 p 7 p 8 p 1 = y 2 s 17 ( 6 ) + p 6
x 2 = p 0 + x 3 x 1 = p 6 + p 8 x 0 = x 0 s 4 ( 6 ) + x 5 + p 4
x 5 = p 0 x 3 x 1 p 1 = p 1 p 5 y 3 = y 3 + x 2 + p 1
p 3 = p 0 + x 1 p 4 = p 4 p 5 y 1 = y 1 + x 3 + p 4
p 7 = p 7 x 4 p 6 = p 2 + p 1 y 4 = y 4 + p 0 + p 1
p 8 = p 8 x 4 p 7 = p 2 p 1 p 4 y 0 = p 8 + p 3
p 8 = p 2 + p 4 y 2 = p 8 x 1
In Table A5, the pseudocode of the proposed 7-point DCT-VIII algorithm is presented. The inputs of the pseudocode are x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , and x 6 . The scaling factors are s 0 ( 7 ) , s 1 ( 7 ) , …, s 15 ( 7 ) . We reuse variables. So, the outputs of the pseudocode are x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , and x 6 . Variables p 0 , p 1 , …, p 15 are additional.
Table A5. The pseudocode for the developed fast 7-point DCT-VII algorithm with variable reuse.
Table A5. The pseudocode for the developed fast 7-point DCT-VII algorithm with variable reuse.
Step 1Step 2
p 7 = x 0 x 5 ,     p 9 = x 3 + x 6 p 0 = x 0 + x 5 s 0 ( 7 ) ,     p 1 = p 7 s 1 ( 7 )
p 11 = p 9 s 11 ( 7 ) p 2 = x 3 + x 6 s 2 ( 7 ) ,     p 3 = p 9 s 3 ( 7 )
p 13 = p 7 x 4 s 13 ( 7 ) p 4 = p 6 s 4 ( 7 ) ,     p 6 = x 4 s 6 ( 7 )
p 6 = x 0 x 3 ,     p 8 = x 5 + x 6 p 7 = p 7 s 7 ( 7 ) ,     p 8 = x 4 s 8 ( 7 )
p 5 = p 6 p 8 s 5 ( 7 ) p 9 = p 9 s 9 ( 7 ) ,     p 10 = x 1 s 10 ( 7 )
p 6 = p 6 + p 8 ,     p 15 = p 6 s 15 ( 7 ) p 12 = x 1 s 12 ( 7 ) ,     p 14 = x 2 s 14 ( 7 )
Step 3Step 4Step 5
x 0 = p 0 + p 1 p 0 = x 3 + x 4 x 5 = p 0 p 6 + p 14
x 5 = p 0 p 1 p 1 = x 6 + x 1 x 0 = p 1 + p 6 + p 14
x 3 = p 2 + p 3 p 2 = x 0 + x 4 x 3 = p 2 p 8 + p 14
x 6 = p 2 p 3 p 3 = x 5 + x 1 x 6 = p 3 + p 8 + p 14
x 4 = p 4 + p 5 p 6 = p 6 + p 12 x 4 = p 6 + p 7 + p 9
x 1 = p 4 p 5 p 8 = p 8 p 10 x 1 = p 10 + p 11 + p 13
x 2 = p 14 + p 15

Appendix B

To compare the proposed and existing fast DCT-VIII algorithms, we evaluate the number of arithmetic operations required to implement the DCT-VIII using the approach proposed in [4]. In [4], it is shown that the DCT-VIII matrix C N can be represented as the product of the pre-processing block-diagonal matrix A N , the DST-VII matrix S N , and the post-processing diagonal matrix D N :
C N = D N S N A N .
All these matrices are of order N. In turn, the DST-VII matrix S N of order N is factorized as:
S N = 1 2 R N × 2 N + 1 Im F 2 N + 1 Q 2 N + 1 × N P N ,
where P N is a permutation matrix, Im F 2 N + 1 denotes the imaginary part of the DFT of length 2 N + 1 , Q 2 N + 1 × N is an expansion matrix, including a zero row, an N × N identity matrix, and an N × N order-reversal matrix, R N × 2 N + 1 is a matrix collecting even-indexed outputs in a reverse order.
Taking into account expression (A2), the N-point DST-VII requires the same number of multiplications and additions as the ( 2 N + 1 )-point FFT. For the latter transform, a prime-factor or Winograd algorithm can be used, for example.
Moreover, since matrix A N in expression (A1) is sparse and includes only 0 and ± 1 , multiplication by an N-point input vector requires N 1 additions. The post-processing of the product S N A N by the matrix D N in expression (A1) requires additional N multiplications.
As a result, the N-point DCT-VIII requires the same number of arithmetic operations as the ( 2 N + 1 )-point FFT plus N multiplications and ( N 1 ) additions (see Table A6).
For the FFT calculation, we employed Winograd Fourier transform algorithms, which minimize the number of multiplications at the expense of a higher number of additions [40,41], as well as Majorkowska-Mech and Cariow algorithms [42] and generalized split-radix FFT algorithm [20]. In [41], the arithmetic complexity of the DFT for complex-valued inputs is reported. To estimate the approximate DFT cost for real-valued input data, the arithmetic operation count for the complex-input DFT was halved.
Table A6. The results of calculation the number of arithmetic operations for the DCT-VIII algorithms based on FFT.
Table A6. The results of calculation the number of arithmetic operations for the DCT-VIII algorithms based on FFT.
N 2 N + 1 Winograd FFT [41]FFT Algorithm [42]DCT-VIII Based on FFT [4]
(for DCT-VIII)(for FFT)Mults.Adds.Mults.Adds.Mults.Adds.
378368301138/32
49104210361445/39
5112184--2688/-
6132194--2799/-
715--15 *84 *22-/90
* generalized split-radix FFT algorithm from [20].

References

  1. Britanak, V.; Yip, P.C.; Rao, K.R. Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations; Academic: Amsterdam, The Netherlands; Boston, MA, USA, 2007. [Google Scholar]
  2. Garrido, M.J.; Pescador, F.; Chavarrías, M.; Lobo, P.J.; Sanz, C.; Paz, P. An FPGA-based architecture for the versatile video coding multiple transform selection core. IEEE Access 2020, 8, 81887–81903. [Google Scholar] [CrossRef]
  3. Saldanha, M.; Sanchez, G.; Marcon, C.; Agostini, L. Configurable fast block partitioning for VVC intra coding using light gradient boosting machine. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 3947–3960. [Google Scholar] [CrossRef]
  4. Park, W.; Lee, B.; Kim, M. Fast computation of integer DCT-V, DCT-VIII, and DST-VII for video coding. IEEE Trans. Image Process. 2019, 28, 5839–5851. [Google Scholar] [CrossRef]
  5. Fan, Y.; Katto, J.; Sun, H.; Zeng, X.; Zeng, Y. A minimal adder-oriented 1D DST-VII/DCT-VIII hardware implementation for VVC standard. In Proceedings of the 2019 32nd IEEE International System-On-Chip Conference (SOCC), Singapore, 19–22 September 2019; pp. 176–180. [Google Scholar] [CrossRef]
  6. Aggoun, A. Three-dimensional DCT/IDCT architecture. Int. J. Adv. Eng. Technol. 2013, 6, 648–668. Available online: https://www.ijaet.org/media/10I14-IJAET0514347_v6_iss2_648to658.pdf (accessed on 23 December 2025).
  7. Domínguez-Jiménez, M.E.; Sansigre, G.; Amo-López, P.; Cruz-Roldán, F. DCT type-III for multicarrier modulation. In Proceedings of the 2011 European Signal Processing Conference (EUSIPCO), Barcelona, Spain, 29 August–2 September 2011; pp. 1593–1597. Available online: https://www.eurasip.org/Proceedings/Eusipco/Eusipco2011/papers/1569416785.pdf (accessed on 23 December 2025).
  8. Asriani, E.; Muchtadi-Alamsyah, I.; Purwarianti, A. Real block-circulant matrices and DCT-DST algorithm for transformer neural network. Front. Appl. Math. Stat. 2023, 9, 1260187. [Google Scholar] [CrossRef]
  9. Shen, X.; Yang, J.; Wei, C.; Deng, B.; Huang, J.; Hua, X.; Cheng, X.; Liang, K. DCT-Mask: Discrete cosine transform mask representation for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtually, 19–25 June 2021; pp. 8720–8729. [Google Scholar] [CrossRef]
  10. Abramova, V.; Lukin, V.; Abramov, S.; Kryvenko, S.; Lech, P.; Okarma, K. A fast and accurate prediction of distortions in DCT-based lossy image compression. Electronics 2023, 12, 2347. [Google Scholar] [CrossRef]
  11. Li, F.; Lukin, V.; Ieremeiev, O.; Okarma, K. Quality control for the BPG lossy compression of three-channel remote sensing images. Remote Sens. 2022, 14, 1824. [Google Scholar] [CrossRef]
  12. Zhang, Z.; Zhao, X.; Li, X.; Li, L.; Luo, Y.; Liu, S.; Li, Z. Fast DST-7/DCT-8 with dual implementation support for versatile video coding. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 355–371. [Google Scholar] [CrossRef]
  13. Imen, W.; Fatma, B.; Amna, M.; Masmoudi, N. DCT-II transform hardware-based acceleration for VVC standard. In Proceedings of the 2021 IEEE International Conference on Design & Test of Integrated Micro & Nano-Systems (DTS), Sfax, Tunisia, 23–25 June 2021; pp. 1–5. [Google Scholar] [CrossRef]
  14. Hnativ, L.O. Discrete cosine-sine type VII transform and fast integer transforms for intra prediction of images and video coding. Cybern. Syst. Anal. 2021, 57, 827–835. [Google Scholar] [CrossRef]
  15. Abramov, S.K.; Abramova, V.V.; Lukin, V.V.; Egiazarian, K. Prediction of signal denoising efficiency for DCT-based filter. Telecommun. Radio Eng. 2019, 78, 1129–1142. [Google Scholar] [CrossRef]
  16. Brajovic, M.; Stankovic, I.; Dakovic, M.; Stankovic, L. Audio signal denoising based on Laplacian filter and sparse signal reconstruction. In Proceedings of the 2022 26th International Conference on Information Technology (IT), Zabljak, Montenegro, 16–20 February 2022; pp. 1–4. [Google Scholar] [CrossRef]
  17. Kim, M.; Lee, Y.-L. Discrete sine transform-based interpolation filter for video compression. Symmetry 2017, 9, 257. [Google Scholar] [CrossRef]
  18. Ghosh, A.; Chellappa, R. Deep feature extraction in the DCT domain. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3536–3541. [Google Scholar] [CrossRef]
  19. Saxena, A.; Fernandes, F.C. DCT/DST-based transform coding for intra prediction in image/video coding. IEEE Trans. Image Process. 2013, 22, 3974–3981. [Google Scholar] [CrossRef]
  20. Bi, G.; Zeng, Y. Transforms and Fast Algorithms for Signal Analysis and Representations; Birkhäuser: Boston, MA, USA, 2004. [Google Scholar]
  21. Korohoda, P.; Dabrowski, A. Generalized convolution for extraction of image features in the primary domain. Mach. Graph. Vis. 2008, 17, 279–297. [Google Scholar]
  22. Choi, K. A study on fast and low-complexity algorithms for Versatile Video Coding. Sensors 2022, 22, 8990. [Google Scholar] [CrossRef]
  23. Zeng, Y.; Sun, H.; Katto, J.; Fan, Y. Approximated reconfigurable transform architecture for VVC. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea, 22–28 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
  24. Cariow, A.; Makowska, M.; Strzelec, P. Small-size FDCT/IDCT algorithms with reduced multiplicative complexity. Radioelectron. Commun. Syst. 2019, 62, 559–576. [Google Scholar] [CrossRef]
  25. Chiper, D.F.; Cracan, A. An efficient algorithm and architecture for the VLSI implementation of integer DCT that allows an efficient incorporation of the hardware security with a low overhead. Appl. Sci. 2023, 13, 6927. [Google Scholar] [CrossRef]
  26. Zhechev, B. The discrete cosine transform DCT-4 and DCT-8. In Proceedings of the 4th International Conference on Computer Systems and Technologies: E-Learning, Rousse, Bulgaria, 19–20 June 2003; pp. 260–265. [Google Scholar] [CrossRef]
  27. Masera, M.; Martina, M.; Masera, G. Odd type DCT/DST for video coding: Relationships and low-complexity implementations. In Proceedings of the 2017 IEEE International Workshop on Signal Processing Systems (SiPS), Lorient, France, 3–5 October 2017; pp. 1–6. [Google Scholar] [CrossRef]
  28. Chivukula, R.K.; Reznik, Y.A. Fast computing of discrete cosine and sine transforms of types VI and VII. In Proceedings of the SPIE 8135, Applications of Digital Image Processing XXXIV, San Diego, CA, USA, 20–25 September 2011; pp. 1–10. [Google Scholar] [CrossRef]
  29. Zhang, Z.; Zhao, X.; Li, X.; Li, Z.; Liu, S. Fast adaptive multiple transform for versatile video coding. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; pp. 63–72. [Google Scholar] [CrossRef]
  30. Song, H.; Lee, Y.-L. Inverse transform using linearity for video coding. Electronics 2022, 11, 760. [Google Scholar] [CrossRef]
  31. Luts, V.K. Fast integer cosine transform of order-8 for high-speed video coding. In Proceedings of the SPIE 13137, Applications of Digital Image Processing XLVII, San Diego, CA, USA, 18–23 August 2024; pp. 398–409. [Google Scholar] [CrossRef]
  32. Cariow, A. Strategies for the synthesis of fast algorithms for the computation of the matrix-vector product. J. Signal Process. Theory Appl. 2014, 3, 1–19. [Google Scholar] [CrossRef]
  33. Cariow, A.; Paplinski, J. Algorithmic structures for realizing short-length circular convolutions with reduced complexity. Electronics 2021, 10, 2800. [Google Scholar] [CrossRef]
  34. Raciborski, M.; Cariow, A.; Bandach, J. The development of fast DST-I algorithms for short-length input sequences. Electronics 2024, 13, 5056. [Google Scholar] [CrossRef]
  35. Camargo, C.; Silveira, B.; Correa, G. Comprehensive Analysis of the Multiple Transform Selection Tool in Versatile Video Coding. JICS. J. Integr. Circuits Syst. 2025, 20, 1–11. [Google Scholar] [CrossRef]
  36. Vigneash, L.; Azath, H.; Nair, L.R.; Subramaniam, K. LC-VVC: Design of Low-Cost Versatile Video Coding Standard Based on Optimized Motion Estimation and Transform Blocks. Circuits Syst. Signal Process. 2025, 44, 9154–9179. [Google Scholar] [CrossRef]
  37. Shirvaikar, M.; Grecos, C. A comparative study of the AV1, HEVC, and VVC video codecs. In Proceedings of the Proceedings Volume 13458, Real-Time Image Processing and Deep Learning 2025, Orlando, FL, USA, 13–17 April 2025. [Google Scholar] [CrossRef]
  38. Lone, M.R. A high-throughput unified transform architecture for Versatile Video Coding. Cluster Comput. 2025, 28, 298. [Google Scholar] [CrossRef]
  39. Raciborski, M. The Development of Software for Fast DCT-VIII Algorithms for Short-Length Input Sequences; Version: V1; RepOD: Warszawa, Poland, 2025. [Google Scholar] [CrossRef]
  40. Blahut, R.E. Fast Algorithms for Signal Processing; Cambridge University Press: New York, NY, USA, 2010. [Google Scholar]
  41. Sidney Burrus, C. Fast Fourier Transforms (Burrus); LibreTexts: Davis, CA, USA, 2025; Available online: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Fast_Fourier_Transforms_%28Burrus%29/06%3A_Winograd%27s_Short_DFT_Algorithms/6.02%3A_Winograd_Fourier_Transform_Algorithm_%28WFTA%29 (accessed on 23 December 2025).
  42. Majorkowska-Mech, D.; Cariow, A. Some FFT Algorithms for Small-Length Real-Valued Sequences. Appl. Sci. 2022, 12, 4700. [Google Scholar] [CrossRef]
Figure 1. The data flow graph of the proposed solution for the DCT-VIII for N = 3.
Figure 1. The data flow graph of the proposed solution for the DCT-VIII for N = 3.
Electronics 15 00207 g001
Figure 2. The data flow graph of the DCT-VIII for an input size of 3, including labels for the individual nodes.
Figure 2. The data flow graph of the DCT-VIII for an input size of 3, including labels for the individual nodes.
Electronics 15 00207 g002
Figure 3. The data flow graph of the proposed solution for the DCT-VIII for N = 4.
Figure 3. The data flow graph of the proposed solution for the DCT-VIII for N = 4.
Electronics 15 00207 g003
Figure 4. The data flow graph of the proposed solution for the DCT-VIII for N = 5.
Figure 4. The data flow graph of the proposed solution for the DCT-VIII for N = 5.
Electronics 15 00207 g004
Figure 5. The data flow graph of the proposed solution for the DCT-VIII for N = 6.
Figure 5. The data flow graph of the proposed solution for the DCT-VIII for N = 6.
Electronics 15 00207 g005
Figure 6. The data flow graph of the proposed solution for the DCT-VIII for N = 7.
Figure 6. The data flow graph of the proposed solution for the DCT-VIII for N = 7.
Electronics 15 00207 g006
Table 1. The notations used in this research.
Table 1. The notations used in this research.
    Notation    Meaning
    ⊕    direct sum of two matrices
    ⊗    Kronecker product of two matrices
     H 2      2 × 2 Hadamard matrix
     I N     identity matrix of order N
P N permutation matrix
s m ( N ) multiplier
D N diagonal matrix of size N × N
W M × N and W N M × N and N × N matrices describing pre-additions and post-additions, respectively
An empty cell in a matrix indicates a zero entry
Table 2. Step-by-step verification of the computations associated with selected nodes in the DCT-VIII data flow graph for N = 3 .
Table 2. Step-by-step verification of the computations associated with selected nodes in the DCT-VIII data flow graph for N = 3 .
Step 1Step 2Step 3Step 4
a 0 = x 2 + x 1 + x 0 c 0 = a 1 s 1 ( 3 ) d 0 = a 0 s 0 ( 3 ) e 0 = d 0 + d 1
a 1 = x 2 x 0 c 1 = a 2 s 2 ( 3 ) d 1 = c 0 c 2 e 1 = d 0 d 1 d 2
a 2 = x 1 x 0 c 2 = b 0 s 3 ( 3 ) d 2 = c 1 c 2 e 2 = d 0 + d 2
b 0 = a 1 + a 2
Table 3. Comparison of the direct method with the proposed solutions.
Table 3. Comparison of the direct method with the proposed solutions.
Direct MethodProposed Algorithms
NMultiplicationsAdditionsMultiplicationsAdditions
3964 (−55%)11 (+83%)
415115 (−66%)11 (0%)
5252018 (−28%)23 (+15%)
6363018 (−50%)48 (+60%)
7453816 (−64%)34 (−10%)
Table 4. Comparison of the proposed solutions with the DCT-VIII algorithms from [4,27].
Table 4. Comparison of the proposed solutions with the DCT-VIII algorithms from [4,27].
Algorithms Based on FFT [4]Proposed AlgorithmsMasera et al. [27]
NMults.Adds.Mults.Adds.Mults.Adds.
31132411--
41439511511
526881823--
627991848--
722901634--
Table 5. The structural patterns of the proposed and existing fast DCT-VIII algorithms.
Table 5. The structural patterns of the proposed and existing fast DCT-VIII algorithms.
Stage 1 of the FactorizationStage 2 of the Factorization
NProposed DCT-VIII AlgorithmsExisting DCT-VIII Algorithms [12,29]Proposed DCT-VIII AlgorithmsExisting DCT-VIII Algorithms [12,29]
3Circular convolutionPattern 2--
4Circular convolutionPattern 2Fan-shaped structurePattern 3
5Four submatrices 2 × 2Pattern 1Direct multiplication of submatrix on the input vector-
6Four submatrices 3 × 3Pattern 1Circular convolutionPattern 2
7Fan-shaped structurePattern 3Four submatrices 2 × 2Pattern 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Raciborski, M.; Polyakova, M.; Cariow, A. Fast DCT-VIII Algorithms for Short-Length Input Sequences. Electronics 2026, 15, 207. https://doi.org/10.3390/electronics15010207

AMA Style

Raciborski M, Polyakova M, Cariow A. Fast DCT-VIII Algorithms for Short-Length Input Sequences. Electronics. 2026; 15(1):207. https://doi.org/10.3390/electronics15010207

Chicago/Turabian Style

Raciborski, Mateusz, Marina Polyakova, and Aleksandr Cariow. 2026. "Fast DCT-VIII Algorithms for Short-Length Input Sequences" Electronics 15, no. 1: 207. https://doi.org/10.3390/electronics15010207

APA Style

Raciborski, M., Polyakova, M., & Cariow, A. (2026). Fast DCT-VIII Algorithms for Short-Length Input Sequences. Electronics, 15(1), 207. https://doi.org/10.3390/electronics15010207

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop