Next Article in Journal
Machine Learning–Driven MPPT Control of PEM Fuel Cells with DC–DC Boost Converter Integration
Previous Article in Journal
A Multi-Behavior and Sequence-Aware Recommendation Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast Algorithms for Short-Length Type VI Discrete Cosine Transform

by
Valentyna Kitsela
1,
Marina Polyakova
2,* and
Aleksandr Cariow
1,*
1
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Żołnierska 49, 71-210 Szczecin, Poland
2
Institute of Computer Systems, Odesa Polytechnic National University, Shevchenko Ave., 1, 65044 Odesa, Ukraine
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(3), 699; https://doi.org/10.3390/electronics15030699
Submission received: 2 January 2026 / Revised: 27 January 2026 / Accepted: 3 February 2026 / Published: 5 February 2026
(This article belongs to the Section Circuit and Signal Processing)

Abstract

In this paper, new fast algorithms for computing the discrete cosine transform type VI (DCT-VI) are proposed, with a special emphasis on short input sequences of three to eight samples. Fast algorithms for small discrete trigonometric transformations are directly used for efficient processing of small data sets and also serve as fundamental building blocks for constructing algorithms for larger trigonometric transforms. By exploiting the intrinsic structural properties of the DCT-VI matrices of different sizes, the proposed methods significantly reduce arithmetic complexity compared to the conventional matrix–vector multiplication approach. The paper presents a detailed mathematical formulation of the algorithms, supported by data-flow graphs that illustrate the computational structure and facilitate the precise estimation of arithmetic operations. Optimized pseudocode implementations incorporating variable reuse are also introduced to facilitate practical realization in software environments. Performance analysis demonstrates a substantial reduction in the number of multiplications (up to 66%) and a slight decrease in additions (approximately 9%) for input sizes ranging from three to eight, thereby improving the execution speed of the considering transform. The proposed algorithms are well-suited for applications in video coding, data compression, and digital signal processing, where computational efficiency is critical.

1. Introduction

Raw, uncompressed video generates enormous data volumes. One minute of high-definition footage can require several gigabytes, making uncompressed video impractical for streaming, sharing, or archiving without compression. Video coding reduces file sizes by eliminating redundancies through lossy or lossless techniques, while preserving visual quality and enabling efficient bandwidth use to avoid buffering [1,2].
In recent approaches to image and video coding, the classical transform coding [2,3,4,5] and artificial neural networks were considered [6,7,8]. The use of artificial neural networks in video coding is limited due to the high cost of their training process. Codecs using neural networks are trained on samples from the coded image and operate as backward predictors. Their computational complexity is prohibitive, while the performance is still not as good as that of the best “classical” approaches [9]. Transform coding has been widely integrated into video coding due to its efficient spatial decorrelation capability [10]. Cooperating with other coding tools [11,12], it significantly reduces spatial correlations and achieves high compression rates while maintaining minimal quality loss. An example of such cooperation is the development of the Versatile Video Coding (VVC) standard, which has been motivated by the need to exceed the coding performance offered by High Efficiency Video Coding (HEVC) [13,14]. To investigate advanced coding tools, the Joint Exploration Model (JEM) was introduced, which demonstrated approximately a 30% gain in coding efficiency relative to HEVC [15]. These improvements arise from enhancements across the entire coding chain, with the transform module playing a central role. One of the key innovations evaluated within JEM is extending the set of transform types by incorporating discrete cosine transforms (DCTs) of types V and VIII and discrete sine transforms (DSTs) of types I and VII, in addition to the conventional type II DCT (DCT-II) and type VII DST used in HEVC [16].
JEM’s performance advantages come at the cost of significantly higher computational complexity. Compared to the HEVC reference model, JEM increases the complexity of both the encoder and decoder by up to seven times, especially in inter-code interaction configurations. This rise in computational cost poses a problem for the deployment of VVC, particularly in real-time applications on embedded platforms constrained by limited computational and memory resources [15]. While hardware accelerators can provide performance improvements, they are limited by computational memory and power resources, further emphasizing the need for efficient algorithmic design.
Because the type V DCT (DCT-V) is linearly related to the type VI DCT (DCT-VI), this study focuses on the DCT-VI and computationally efficient algorithms for its implementation. Below, we consider ways to design fast DCT-VI algorithms reported in the literature.
In this paper, we adopt the definitions of trigonometric transforms given in [17]. In contrast, in [18,19,20,21], the DCTs and DSTs of types VI and VII are defined in the reverse order.

1.1. Related Papers

According to the literature, eight types of DCT and DST are recognized [17]. Among them, the DCT-II, type II DST, type IV DCT, and type IV DST are the most widely used in image, video, speech, and audio processing [4,5]. The DCT and DST of types I–IV form the group of even sinusoidal transforms, whereas the much less familiar types V–VIII constitute the group of odd sinusoidal transforms.
Although numerous fast algorithms exist for even sinusoidal transforms, only a limited number of studies have focused on the fast computation of odd sinusoidal transforms, such as DCTs and DSTs of types V–VIII. Even sinusoidal transforms are associated with even-length fast Fourier transforms (FFTs) and can be efficiently computed using the split-radix algorithms [22,23,24,25]. In contrast, odd sinusoidal transforms correspond to odd-length FFTs.
Fast DCT-VI algorithms were developed using the FFT [15,18,19] or by exploiting the relationship between the DCT-VI and the DCT-II [20,21]. Using this strategy, Chivukula and Reznik [19] proposed fast algorithms for DCT/DST types VI and VII by exploiting their relationship with the discrete Fourier transform (DFT). They showed that an N-point DST of types VI and VII corresponds to a (2N + 1)-point DFT, enabling the use of the Winograd FFT algorithm [26,27], which is particularly efficient for DFTs of prime or prime-power lengths. Based on this approach, they developed fast algorithms for 4- and 8-point DST of types VI and VII, which map to 9- and 17-point DFTs, respectively, achieving a reduction in the number of multiplications compared to direct matrix computation.
In [15], Park, Lee, and Kim represent the N-point DCT-V using the equality between the DCT-VI and the (2N − 1)-point DFT, which enables fast computation of both the forward and inverse DCT-V. The linear relationship between the DCT-V and the DCT-VI is then exploited to further accelerate DCT-V processing in video coding. The (2N − 1)-point DFT can be efficiently evaluated using the Winograd FFT for prime-length factorizations and the prime factor FFT for composite lengths formed from relatively prime factors [22]. Additional computational savings are achieved by exploiting input symmetries within the FFT.
Several studies [20,21,28] have explored relationships between the DCT-VI and other types of DCTs or DSTs. In [20,28], Reznik showed that the (2N + 1)-point DCT-II matrix can be decomposed into an (N + 1)-point DCT-VI and an N-point type VII DST. Since the odd-length DCT-II can be treated as a real-valued DFT of the same length, the author showed how to employ DFT factorization known from the literature [22] to derive reduced-complexity algorithms for the (N + 1)-point DCT-VI and the N-point type VII DST.
In [21], Masera, Martina, and Masera demonstrated reduced-complexity factorizations and their corresponding data-flow graphs for the DCT-V and DCT-VI of lengths N = 4 and 8. To this end, the relationships between the DFT, DCT-II, DCT-VI, and type VII DST were exploited. As a result, a new relationship between the DCT-VI and the DCT-V was established. Hardware architectures implementing these transforms using the proposed reduced-complexity factorizations have been replicated on an FPGA, demonstrating lower complexity compared to a direct implementation based on matrix–vector multiplication.
Based on the brief review above of existing fast algorithms for the DCT-VI, we identified the limitations of these approaches and outlined directions for reducing the computational complexity of the DCT-VI.

1.2. The Main Contributions of the Paper

The above review reveals several limitations of existing methods for constructing fast algorithms for DCT-VI. Most algorithms described in the literature are designed exclusively for input sequences whose lengths are powers of two. Despite known algorithms reducing the number of required multiplications compared to direct matrix–vector multiplication, the computational cost of their implementation is still significant. Moreover, these algorithms are primarily intended for large transform sizes and yield a significant reduction in arithmetic complexity only in such cases.
To address these limitations, we propose fast algorithms for the DCT-VI based on the structural approach introduced in [29,30]. The effectiveness of this approach stems from its ability to identify and effectively exploit the block structures of the original matrices to be decomposed. In contrast to the fast transformation scheme presented in [14,31], the structured approach employs a richer set of matrix templates, which enables a more effective decomposition of the transform matrix. This enables the factorization of submatrices with repeated entries using the templates described in [29]. Furthermore, cyclic convolution blocks can be expressed as products of sparse matrices, yielding significant computational savings [32]. An additional advantage of the structural approach is that it naturally represents the resulting algorithms as data-flow graphs in which each input–output path contains only a single multiplication. This property reduces computational complexity by eliminating redundant operations and facilitates efficient organization of computations.
The primary contribution of this study is the development of reduced-complexity DCT-VI algorithms for small data sequence lengths, in particular for lengths ranging from three to eight. A set of these algorithms is obtained by successfully factoring the original DCT-VI matrices into sparse matrix products. The resulting data-flow graphs are well-suited for hardware implementation, while the corresponding pseudocodes enable efficient software realization. Compared to direct matrix–vector multiplication, the proposed algorithms exhibit lower computational complexity and, when combined with other techniques, apply to image, video, and audio coding, as well as to a broad range of signal and data processing applications.
The remainder of the paper is organized as follows. Section 2 presents the necessary mathematical background and notations. Section 3 introduces the proposed fast DCT-VI algorithms. Section 4 and Section 5 analyze their computational complexity. Section 6 concludes the paper. Appendix A provides the pseudocode of the proposed algorithms.

2. Preliminary Remarks

The DCT-VI is predominantly utilized in digital signal processing domains, including image and video coding. Some signal processing libraries and tools, such as LabVIEW, support DCT-VI as part of a general transform function set.
This transform is mathematically expressed as follows:
y k   =   n = 0 N 1 σ kn   x n   cos π n 2 k + 1 2 N 1 ,   k   =   0 ,   1 ,   ,   N 1 ,
where y k denotes the output signal after the application of the DCT-VI, x n is an input signal, and σ kn represents a normalization constant:
σ kn = 2 N 1   ,     k   =   N 1 ;   n   =   0 ,   1 ,   ,   N 1 ; 2 N 1 ,   k   =   0 ,   1 ,   ,   N 1 ;   n   =   0 ; 2 N 1 ,   otherwise .
Furthermore, the DCT-VI can be represented by the matrix–vector product:
Y N × 1   =   C N X N × 1 ,
where indices n and k range from 0 to N − 1, and C N corresponds to the transform matrix:
C N   = c 0 , 0 c 0 , 1 c 1 , 0 c 1 , 1 c 0 , N 1 c 1 , N 1 c N 1 ,   0 c N 1 ,   1   c N 1 , N 1 ,   c k n   =   σ kn cos π k 2 n + 1 2 N 1 ,  
Y N × 1 = y 0 ,   y 1 ,   ,   y N 1 T ,   X N × 1 = x 0 ,   x 1 ,   ,   x N 1 T .
In this research, we use the following notations:
  • W M × N and W N are, respectively, M × N and N × N matrices describing pre-additions and post-additions;
  • D N is a diagonal matrix of order N, whose elements represent the calculated values of the cosines;
  • I N is an identity matrix of order N;
  • P N is a permutation matrix;
  • H 2 is the 2 × 2 Hadamard matrix;
  • s m ( N ) is a cosine-value coefficient;
  • ⊕ is the direct sum of two matrices;
  • ⊗ is the Kronecker product of two matrices;
  • An empty cell in a matrix means it contains zero.

3. Fast Algorithms for the Short-Length DCT-VI

3.1. The Three-Point DCT-VI Algorithm

We represent the 3-point DCT-VI as follows:
Y 3 × 1   =   C 3 X 3 × 1 ,
where Y 3 × 1   =   y 0 ,   y 1 ,   y 2 T , X 3 × 1   =   x 0 ,   x 1 ,   x 2 T , and C 3   = a 3 b 3 c 3 a 3 c 3 b 3 d 3 a 3 a 3 with a 3   =   2 5     0.6325, b 3   =   2 5 cos π 5     0.7236, c 3   =   2 5 cos 2 π 5     0.2764, and d 3   =   1 5     0.4472.
Here and throughout the paper, when we state cosine values, for example, in the expression a 3   = 2 5     0.6325, we use numerical approximations.
To factorize the matrix C 3 , we first invert the signs of the elements in the second row and then decompose the resulting matrix into two components:
C 3 ( a )   =   C 3 ( b )   +   C 3 ( c ) ,  
where C 3 ( b )   =   a 3 a 3 d 3 a 3 a 3 and C 3 ( c )   =   b 3 c 3 c 3 b 3 .
The expression (3) can be represented as
Y 3 × 1 =   C 3 ( b ) X 3 × 1 + C 3 ( c ) X 3 × 1 ,
Within the first column and third row of the matrix C 3 ( b ) , identical entries differ only in sign. This property enables a reduction in the computational complexity of the 3-point DCT-VI, reducing the number of arithmetic operations without requiring further matrix transformations. In matrix C 3 ( c ) , the row and column consisting exclusively of zero entries are removed. By applying the template a b b a from [29], we obtain the following factorization of the resulting matrix C 2 ( c )   = b 3 c 3 c 3 b 3 :
C 2 ( c )   =   H 2 ×   diag ( ( b 3   +   c 3 ) / 2 ,   ( b 3 c 3 ) / 2 ) H 2 .
Let us denote s 0 ( 3 )   =   d 3 , s 1 ( 3 )   =   s 3 ( 3 )   =   a 3 , and s 2 ( 3 )   =   ( b 3 c 3 ) / 2 . We also set ( b 3 + c 3 ) / 2 =   1 / 2 . Then, based on factorization (6), we obtain the data-flow graph for multiplying the inputs by the matrix C 2 ( c ) , as shown in Figure 1.
The inputs of this graph are x 1 and x 2 , and the outputs are y 0 and y 1 , because the matrix C 2 ( c ) is obtained from the 2nd and 3rd columns and 1st and 2nd rows of the matrix C 3 ( a ) .
Next, we multiply the matrix C 3 ( b ) by the input vector X 3 × 1   =   x 0 ,   x 1 ,   x 2 T , yielding the resulting vector a 3 x 0 ,   a 3 x 0 ,   d 3 x 0 a 3 x 1 + a 3 x 2 T . The corresponding data-flow graph is presented in Figure 2.
By merging the data-flow graphs in Figure 1 and Figure 2, and inverting the sign of the second row, we obtain the data-flow graph for the 3-point DCT-VI algorithm, shown in Figure 3.
However, a number of additions required for this algorithm may be redundant. To reduce the number of additions, we construct a factorization of the 3-point DCT-VI matrix corresponding to the data-flow graph in Figure 3. This process is illustrated in Figure 4. For each subgraph at a different hierarchical level of the data-flow graph in Figure 3, we construct the corresponding adjacency matrix, for example, for subgraphs marked by green and blue rectangles.
The adjacency matrix of a graph in this paper is defined as an r × q matrix whose entries belong to {0, 1, −1}, where r and q denote the numbers of output and input vertices, respectively. The (i, j)-th entry of this matrix is equal to 1 if the j-th input vertex is connected to the i-th output vertex by an edge. An edge between the vertices of the graph may also be weighted by −1. A zero entry indicates that the corresponding edge is absent.
The matrix   W 5 × 3   =   1 1 1 1 1 corresponds to the green rectangle, and the matrix   W 3 × 4   =   1 1 1 1 1 1 1 corresponds to the blue rectangle.
As a result, the 3-point DCT-VI matrix is factorized as follows:
Y 3 × 1   =     P 3   W 3 × 4 W 4 × 5 D 5 W 5 × 3 W 3 ( 0 ) X 3 × 1 ,
where W 4 × 5   = 1 1 1 1 1 is defined as shown in the Figure 4, D 5   =   diag s 0 ( 3 ) ,   s 1 ( 3 ) ,   1 / 2 ,   s 2 ( 4 ) ,   s 3 ( 4 ) ,   P 3 = 1 1 1 , and W 3 ( 0 )   =   1 H 2 .
Based on factorization (7), we identify redundant additions using the adjacency matrices of the subgraphs defined in each hierarchical level of the data-flow graph in Figure 4. For example, in the matrix   W 3 × 4   =   1 1 1 1 1 1 1 the pair entries (1, 1) and (1, 3) as well as (2, 1) and (2, 3), which lie in the same columns, are repeated in the first and second rows up to a sign change. Therefore, the addition of the first and second inputs is repeated and hence redundant. One of these additions can be removed.
To achieve this, we first add the first and second inputs and then multiply the result by the matrix W 3 ( 1 )   =   H 2 1 . By replacing the matrices   W 3 × 4 and W 4 × 5 , with the matrix   W 3 × 5   =   1 1 1 1 1 , we obtain the following factorization of the 3-point DCT-VI matrix:
Y 3 × 1   =   W 3 ( 1 )   W 3 × 5 D 5 W 5 × 3 W 3 ( 0 ) X 3 × 1 .
Based on the DCT-VI matrix factorization in (8), we present the data-flow graph for the 3-point DCT-VI algorithm in Figure 5. This data-flow graph does not include the redundant additions. By applying the proposed algorithm, the number of multiplications is reduced from 9 to 4, while the number of additions remains unchanged, and a single shift operation is introduced.

3.2. Data-Flow Graph Construction

In Section 3, we present algorithms for short-length DCT-VI using data-flow graphs. To construct a data-flow graph, the factorization of the short-length DCT-VI matrix into sparse matrices is first obtained. This factorization consists of a diagonal matrix containing scaling factors and several sparse matrices whose elements belong to the set {0, 1, 22121}.
The data-flow graph of a short-length DCT-VI algorithm has a hierarchical structure. At the first level, located on the left side of the graph, vertices corresponding to the algorithm inputs x n , n   =   0 ,   1 ,   ,   N 1 , are placed. Then, considering the factorization from right to left, each subsequent hierarchical level is constructed to correspond to the current sparse matrix in the DCT-VI matrix factorization. Each sparse matrix of the DCT-VI serves as the adjacency matrix for the corresponding hierarchical level of the data-flow graph.
An edge drawn with a solid line corresponds to the value 1 in the adjacency matrix, whereas an edge drawn with a dashed line corresponds to the value −1. For example, in Figure 3, within the green rectangle, two edges extend from the first vertex, corresponding to the (4, 3) and (5, 3) entries of the matrix W 3 × 5 . Since the (4, 3) entry equals 1 and the (5, 3) entry equals 1, the corresponding edges are marked with solid and dashed lines, respectively.

3.3. The 4-Point DCT-VI Algorithm

The four-point DCT-VI is defined as follows:
Y 4 × 1   =   C 4 X 4 × 1 ,
where Y 4 × 1   =   y 0 ,   y 1 ,   y 2 ,   y 3 T , X 4 × 1   =   x 0 ,   x 1 ,   x 2 ,   x 3 T , and C 4   =   a 4 b 4 c 4 d 4 a 4 d 4 b 4 c 4 a 4 c 4 d 4 b 4 e 4 a 4 a 4 a 4 with a 4   = 2 7     0.5345, b 4   = 2   7 cos π 7   0.6811, c 4   =   2 7 cos 2 π 7 0.4713, d 4   =   2 7 cos 3 π 7   0.1682, and e 4   =   1 7     0.3780.
Let us consider the idea of developing the 4-point DCT-VI algorithm. First, we derive the factorization of the 4-point DCT-VI matrix; then, the data-flow graph is constructed, and the pseudocode is designed. To factorize the 4-point DCT-VI matrix, we decompose the original matrix into two matrices and factorize each one separately using the 3-point cyclic convolution pattern and a fan-like pattern that adds the same value to different outputs. Finally, the repeated additions are eliminated.
To implement this idea, we change the sign of the third column of the matrix C 4 . The resulting matrix C 4 ( a ) is decomposed into two submatrices:
C 4 ( a )   =   C 4 ( b )   +   C 4 ( c ) ,  
where C 4 ( b )   =   a 4 a 4 a 4 e 4 a 4 a 4 a 4 and C 4 ( c )   =   b 4 c 4 d 4 d 4 b 4 c 4 c 4 d 4 b 4 .
In the first column and third row of the matrix C 4 ( b ) , several entries are identical except for their signs. This property allows a reduction in the number of arithmetic operations without requiring additional transforms for this matrix. In the matrix C 4 ( c ) , we remove the row and column consisting entirely of zeros.
The resulting matrix conforms to the cyclic convolution pattern H 3   = h 1 h 0 h 2 h 2 h 1 h 0 h 0 h 2 h 1 with parameters h 0   =   c 4 , h 1   =   b 4 , and h 2   =   d 4 . The factorization of this pattern is described in [30] as follows:
H 3   =   A 3 × 4   ×   diag s 2 ( 4 ) ,   s 3 ( 4 ) , s 4 ( 4 ) ,   s 5 ( 4 ) A 4 × 3 ,  
where s 2 ( 4 )   =   ( c 4 + b 4 + d 4 ) / 3 ,   s 3 ( 4 )   =   ( 2 b 4 + c 4 d 4 ) / 3 , s 4 ( 4 )   =   ( b 4 + 2 c 4 + d 4 ) / 3 , s 5 ( 4 )   =   ( c 4 + b 4 2 d 4 ) / 3 , and A 3 × 4   = 1 0 1 1 1 1 0 1 1 1 1 0 , A 4 × 3   = 1 1 1 0 1 1 1 1 0 1 0 1 .
From expression (11) we derive the factorization of the four-point DCT-VI matrix as shown below:
Y 4 × 1   =   W 4 × 5 W 5 × 7 D 7 W 7 × 5 W 5 × 4 P 4 X 4 × 1 ,
where
W 4 × 5 = 1 1 1 1 1 1 1 1 1 1 ,   P 4 = 1 1 1 1 ,   W 5 × 4 = 1 1 1 1 1 1 1 1 1 1 ,  
W 5 × 7 = 1 1 1 1 1 1 1 ,   W 7 × 5 = 1 1 1 1 1 1 1 ,   D 7 = diag s 0 ( 4 ) ,   s 1 ( 4 ) ,   s 2 ( 4 ) ,   s 3 ( 4 ) ,   s 4 ( 4 ) ,   s 5 ( 4 ) ,   s 6 ( 4 ) ,   s 0 ( 4 ) = e 4 ,   s 1 ( 4 ) = s 6 ( 4 )   =   a 4 .
Figure 6 illustrates the data-flow graph of the four-point DCT-VI algorithm. Compared to the direct matrix–vector product, this approach reduces the number of multiplications from 16 to 7, while increasing the number of additions from 12 to 13.
The obtained data-flow graph has a regular structure; that is, a graph organization in which the same connectivity patterns and operation types are systematically repeated. In the graph shown in Figure 6, similar modules are presented on the left and right sides 8 of the scaling-factor line. However, some differences also exist between these modules.

3.4. Algorithm for the 5-Point DCT-VI

We express the five-point DCT-VI as a matrix–vector product:
Y 5 × 1   =   C 5 X 5 × 1 ,
where
Y 5 × 1 =   y 0 ,   y 1 ,     y 2 ,   y 3 ,   y 4 T ,   X 5 × 1   = x 0 ,   x 1 ,   x 2 ,   x 3 ,   x 4 T ,   and   C 5   = a 5 b 5 c 5 d 5 e 5 a 5 d 5 d 5 f 5 d 5 a 5 e 5 b 5 d 5 c 5 a 5 c 5 e 5 d 5 b 5 d 5 a 5 a 5 a 5 a 5 .
The constants are defined as a 5   = 2 9     0.4714, b 5   =   2 9 cos π 9     0.6265, c 5   =   2 9 cos 2 π 9     0.5107, d 5   =   2 9 cos π 3     0.3333, e 5   =   2 9 cos 4 π 9     0.1158, and f 5   = 1   9     0.6667. It should be noted that f 5   =   1     d 5 .
The idea of developing the 5-point DCT-VI algorithm is the same as that used for the 4-point algorithm. However, the matrix with identical entries has a more complex structure. The final reduction in the number of additions is not provided because no repeated additions were found.
Next, we multiply the second column of the matrix C 5 by 1 and split the resulting matrix C 5 ( a ) into two submatrices:
C 5 ( a )   =   C 5 ( b )   +   C 5 ( c ) ,
where C 5 ( b )   =   a 5 d 5 a 5 d 5 d 5 1 d 5 d 5 a 5 d 5 a 5 d 5 f 5 a 5 a 5 a 5 a 5 and C 5 ( c )   =   b 5 c 5 e 5 e 5 b 5 c 5 c 5 e 5 b 5 .
Let us multiply the input vector by the matrix C 5 ( b ) . We obtain the output vector x 0 ,   x 1 ,   x 2 ,   x 3 ,   x 4 [ a 5 x 0 + d 5 x 3 , a 5 x 0 d 5 ( x 1 + x 2 + x 3 + x 4 ) + x 3 , a 5 x 0 + d 5 x 3 , a 5 x 0 + d 5 x 3 , f 5 x 0 + a 5 ( x 1 + x 2     x 3 + x 4 )]. The data-flow subgraph for calculation the entries of this output vector is shown by the green rectangle in Figure 7. By construction, the adjacency matrix at each hierarchical level of this subgraph is included in the factorization of the C 5 ( b ) . The final reducing of the number of additions is not performed.
Then, the matrix C 5 ( b ) is factorized taking into account the similar entries:
C 5 ( b )   =   A 5 × 6 A 6   ×   diag s 0 ( 5 ) ,   s 1 ( 5 ) ,     s 2 ( 5 ) ,   1 ,   s 3 ( 5 ) ,   s 4 ( 5 ) A 6 × 5 A 5 ,
where
A 5 × 6 = 1 1 1 1 1 1 1 1 ,   A 5 = 1 1 1 1 1 ,  
A 6 × 5   = 1 1 1 1 1 1 1 ,   A 6 = 1 1 1 1 1 1 1 1 ,  
and s 0 ( 5 )   =   a 5 , s 1 ( 5 )   =   d 5 , s 2 ( 5 )   = d 5 , s 3 ( 5 )   =   d 5 , and s 4 ( 5 )   =   a 5 .
Next, the matrix C 5 ( c ) is factorized. From C 5 ( c ) , we remove the row and column containing only zero entries. The resulting matrix C 3 ( d )   =   b 5 c 5 e 5 e 5 b 5 c 5 c 5 e 5 b 5 corresponds to the cyclic convolution template H 3   = h 1 h 0 h 2 h 2 h 1 h 0 h 0 h 2 h 1 , where h 0   =   c 5 , h 1   = b 5 , and h 2   =   e 5 . Using Equation (11), the matrix is expressed as
C 4 ( d )   = A 3 × 4 × diag ( h 0 + h 1 + h 2 ) / 3 ,   ( 2 h 1 h 0 h 2 ) / 3 ,   ( h 1 2 h 0 + h 2 ) / 3 ,   ( h 0 + h 1 2 h 2 ) / 3 A 4 × 3 .
The first element of the diagonal matrix in expression (16) is approximately zero: ( h 0 + h 1 + h 2 ) / 3   =   ( c 5 b 5 + e 5 ) / 3     ( 0.5107 0.6265 + 0.1158 ) / 3     0 . Therefore, Equation (16) can be reformulated as
C 4 ( d )   = A 3 ( 1 ) × diag s 5 ( 5 ) ,   s 6 ( 5 ) ,   s 7 ( 5 ) A 3 ( 0 ) ,
where
A 3 ( 1 ) = 1 0 1 1 1 0 0 1 1 ,   A 3 ( 0 ) = 1 1 0 1 0 1 0 1 1 ,  
and s 5 ( 5 )   =   ( 2 b 5 e 5 c 5 ) / 3 ,   s 6 ( 5 )   =   ( b 5 2 c 5 + e 5 ) / 3 ,   s 7 ( 5 )   =   ( c 5 b 5 2 e 5 ) / 3 .
Using expressions (14), (15), and (17), the factorization of the five-point DCT-VI matrix is derived:
Y 5 × 1   =   W 5 × 9 W 9 D 9 W 9 × 8 W 8 × 5 X 5 × 1 ,
where D 9   =   diag s 0 ( 5 ) ,   s 1 ( 5 ) ,   s 2 ( 5 ) , 1 ,   s 3 ( 5 ) ,   s 4 ( 5 ) ,   s 5 ( 5 ) ,   s 6 ( 5 ) ,   s 7 ( 5 ) , W 9   =   A 6 A 3 ( 1 ) , W 9 × 8   = A 6 × 5 A 3 ( 0 ) ,
W 5 × 9 = 1 1 1 1 1 1 1 1 1 1 1 ,   W 8 × 5 = 1 1 1 1 1 1 1 1 .
In Figure 7, we present the data-flow graph of the five-point DCT-VI algorithm, which reduces the number of multiplications from 25 to 8 compared to the direct matrix–vector multiplication. The number of additions is also reduced from 20 to 16 (see Figure 7).
In this data-flow graph, two modules are explicitly identified. The module in the upper part of the graph (green rectangle) corresponds to the factorization of the matrix C 5 ( b ) with repeated entries. The resulting subgraph has a highly irregular structure due to the sparsity of the matrix C 5 ( b ) . The module in the lower part of the graph (blue rectangle) is related to the cyclic convolution factorization of the matrix C 3 ( d ) and is also irregular.

3.5. Algorithm for 6-Point DCT-VI

Let us derive an algorithm for the six-point DCT-VI by expressing this transform as
Y 6 × 1 =   C 6 X 6 × 1 ,
where Y 6 × 1   =   y 0 y 1 y 2 y 3 y 4 y 5 , X 6 × 1   =   x 0 x 1 x 2 x 3 x 4 x 5 , and C 6 is the DCT-VI transform matrix defined as C 6   = a 6 b 6 c 6 d 6 e 6 f 6 a 6 d 6 f 6 c 6 b 6 e 6 a 6 f 6 b 6 e 6 c 6 d 6 a 6 e 6 d 6 b 6 f 6 c 6 a 6 c 6 e 6 f 6 d 6 b 6 g 6 a 6 a 6 a 6 a 6 a 6 with a 6   =   2 11   0.4264, b 6   =   2 11 cos π 11     0.5786, c 6   =   2 11 cos 2 π 11     0.5073, d 6   =   2 11 cos 3 π 11     0.3949, e 6   =   2 11 cos 4 π 11     0.2505, f 6   =   2 11 cos 5 π 11     0.0858, and g 6   =   1 11     0.3015.
The idea of developing the 6-point DCT-VI algorithm is the same as that used for the 4-point DCT-VI algorithm. However, in this case, we first apply a permutation of the matrix columns and rows so that both the 5-point cyclic convolution pattern and the fan-like pattern, where the same value is added to multiple outputs, can be utilized. Finally, repeated additions are eliminated by replacing the fan-like structure with an adder tree. This technique is discussed in detail in this subsection.
First, we introduce the permutations
π 1 = 1 2 1 5 3 4 5 6 3 2 6 4   and   π 2 = 1 2 3 4 5 6 1 5 4 2 3 6 .
The order of the columns and rows of the matrix C 6 is changed using π 1 and π 2 , respectively. In addition, the signs of the second and third columns of the permuted matrix are inverted. The corresponding transformation matrices are
P 6 ( 0 ) = 1 1 1 1 1 1   and   P 6 ( 1 ) = 1 1 1 1 1 1 .
After applying these operations, we obtain the matrix C 6 ( a )   = a 6 e 6 c 6 b 6 f 6 d 6 a 6 d 6 e 6 c 6 b 6 f 6 a 6 f 6 d 6 e 6 c 6 b 6 a 6 b 6 f 6 d 6 e 6 c 6 a 6 c 6 b 6 f 6 d 6 e 6 g 6 a 6 a 6 a 6 a 6 a 6 which is decomposed into the sum of two submatrices:
C 6 ( a )   =   C 6 ( b )   +   C 6 ( c ) ,
where
C 6 ( b )   =   a 6 a 6 a 6 a 6 a 6 g 6 a 6 a 6 a 6 a 6 a 6 and C 6 ( c )   =   e 6 c 6 b 6 f 6 d 6 d 6 e 6 c 6 b 6 f 6 f 6 d 6 e 6 c 6 b 6 b 6 f 6 d 6 e 6 c 6 c 6 b 6 f 6 d 6 e 6 .
Next, we remove the zero rows and columns from the matrix C 6 ( c ) . The resulting matrix is given by
C 5 ( d ) = e 6 c 6 b 6 f 6 d 6 d 6 e 6 c 6 b 6 f 6 f 6 d 6 e 6 c 6 b 6 b 6 f 6 d 6 e 6 c 6 c 6 b 6 f 6 d 6 e 6 ,
which is the circular convolution matrix for N = 5:
H 5   =   h 0 h 4 h 3 h 2 h 1 h 1 h 0 h 4 h 3 h 2 h 2 h 1 h 0 h 4 h 3 h 3 h 2 h 1 h 0 h 4 h 4 h 3 h 2 h 1 h 0 .
Here h 0   =   e 6 , h 1   =   d 6 , h 2   =   f 6 , h 3   =   b 6 , and h 4   =     c 6 .
For this matrix, the following factorization is obtained based on [30]:
C 5 ( d )   =   W 5 W 5 × 7 W 7 × 10 D 10 W 10 × 7 W 7 × 5 W 5 P 5 ,
where D 10   =   diag ( s 2 6 ,   s 3 6 ,   s 4 6 ,   s 5 6 ,   s 6 6 ,   s 7 6 ,   s 8 6 ,   s 9 6 ,   s 10 6 ,   s 11 ( 6 ) ) , s 2 6   =   ( e 6 + d 6 + f 6 + b 6 c 6 ) / 5 ,   s 3 6   = e 6 + f 6 b 6 + c 6 ,   s 4 6   = e 6 c 6 b 6 d 6 ,   s 5 6   = e 6 + b 6 ,   s 6 6   = e 6 + d 6 f 6 + c 6 ,   s 7 6   = e 6 + b 6 f 6 d 6 ,   s 8 6   = e 6 + f 6 ,   s 9 6   = e 6 c 6 ,   s 10 6   = e 6 + d 6 ,   s 11 6   = e 6   s 0 6 ,   W 5 × 7   =   1 T 2 × 3 I 2 , T 2 × 3   = 1 0 1 0 1 1 , W 7 × 10   =   1 T 2 × 3 T 2 × 3 T 2 × 3 , W 10 × 7   =   1 T 3 × 2 T 3 × 2 T 3 × 2 , T 3 × 2   = 1 0 0 1 1 1 , W 7 × 5   =   1 T 3 × 2 I 2 , W 5   =   1 1 1 1 1 1 1 1 1 1 1 1 1 , and P 5   =   1 1 1 1 1 .
To factorize the initial 6-point DCT-VI matrix, we add to the factorization (22) matrices that take into account the entries of the matrix C 6 ( b ) :
W 7 × 6   =   1 1 1 1 1 1 1 1 1 1 1 and   W 6 × 8   =   1 1 1 1 1 1 1 1 1 1 1 1 .
Then, the factorization (22) is transformed into the following factorization of the initial 6-point DCT-VI matrix:
Y 6 × 1   =   P 6 ( 1 ) W 6 × 8 W 8 W 8 × 10 W 10 × 13 D 13 W 13 × 9 W 9 × 7 W 7 W 7 × 6 P 6 ( 3 ) P 6 ( 0 ) X 6 × 1 ,
where
D 13   = s 0 6 s 1 6 D 10 s 12 6 ,   s 0 6 = g 6 ,   s 1 6 =   a 6 ,   s 12 6 = a 6 ,   W 7 = 1 W 5 1 ,   W 8 × 10 = I 3 ( T 2 × 3 I 2 ) 1 ,   W 9 × 7 = I 2 ( T 3 × 2 I 2 ) 1 ,   W 13 × 9   = 1 0 1 0 0 1 T 3 × 2 T 3 × 2 T 3 × 2 1 ,   W 10 × 13   = I 3 T 2 × 3 T 2 × 3 T 2 × 3 1 ,   W 8 = I 2 W 5 1 ,   P 6 ( 3 ) = ( 1 P 5 ) .
The data-flow graph corresponding to the factorization (23) is shown in Figure 8. In this data-flow graph, the additions in the fan-like structures marked by the green and blue rectangles are repeated and, therefore, redundant. We remove the left fan-like structure in the green rectangle and the right fan-like structure in the blue rectangle. After this, the final factorization is obtained.
Based on factorization (22), the matrices W 6   =   1 W 5 W 6 × 8   =   I 2 ( T 2 × 3 I 2 ) , and W 8 × 6   =   I 2 ( T 3 × 2 I 2 ) are introduced. Subsequently, the factorization of the six-point DCT-VI matrix can be expressed as follows:
Y 6 × 1   =   P 6 ( 2 ) W 6 W 6 × 8 W 8 × 13 D 13 W 13 × 8 W 8 × 6 W 6 P 6 ( 3 ) P 6 ( 0 ) X 6 × 1 ,
where
P 6 ( 2 ) = P 6 ( 1 ) 1 1 1 1 1 1 = 1 1 1 1 1 1 ,  
P 6 ( 0 ) = 1 1 1 1 1 1 ,   W 13 × 8   = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ,  
W 8 × 13 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Figure 9 illustrates the six-point DCT-VI algorithm. Compared to the direct matrix–vector implementation, the number of multiplications is reduced from 36 to 13, while the number of additions increases from 30 to 33.
The data-flow graph presented in Figure 9 includes two permutation modules that rearrange the order of data samples without changing their values. These modules are placed near the input and output vertices of the graph. Next, the repeated computational modules are presented. In particular, the subgraphs consisting of five butterfly modules are located at levels neighboring the permutation modules. In addition, repeated computational modules appear on both the left and right sides of the scaling-factor line. With a few exceptions, the graph can be divided into stages with symmetrical topology relative to the scaling-factor line.

3.6. Algorithm for Seven-Point DCT-VI

Let us construct the algorithm for the 7-point DCT-VI. This transform can be represented as follows:
Y 7 × 1   =   C 7 X 7 × 1 ,
where Y 7 × 1   =   y 0 y 1 y 2 y 3 y 4 y 5 y 6 , X 7 × 1   =   x 0 x 1 x 2 x 3 x 4 x 5 x 6 , and C 7   =   a 7 b 7 c 7 d 7 e 7 f 7 g 7 a 7 d 7 g 7 e 7 b 7 c 7 f 7 a 7 f 7 d 7 c 7 g 7 b 7 e 7 a 7 g 7 b 7 f 7 c 7 e 7 d 7 a 7 e 7 f 7 b 7 d 7 g 7 c 7 a 7 c 7 e 7 g 7 f 7 d 7 b 7 h 7 a 7 a 7 a 7 a 7 a 7 a 7 with a 7   = 2 13   0.3922, b 7   =   2 13 cos π 13   0.5386, c 7   =   2 13 cos 2 π 13   0.4912, d 7   =   2 13 cos 3 π 13     0.4152, e 7   =   2 13 cos 4 π 13     0.3151, f 7   =   2 13 cos 4 π 13     0.1967, g 7   =   2 13 cos 5 π 13     0.0669, h 7   =   1 13     0.2774 .
The development of the 7-point DCT-VI algorithm follows the same approach as that used for the 6-point DCT-VI algorithm. Specifically, we define the permutations,
π 3 = 1 2 1 4 3 4 5 6 7 2 3 5 6 7     and   π 4 = 1 2 3 4 5 6 7 1 2 3 5 6 4 7 ,
and apply π 3 to reorder the rows of the matrix C 7 and π 4 to reorder its columns. In addition, the signs of the third, fourth, and seventh columns are changed. The corresponding permutation matrices are as follows:
P 7 ( 1 )   =   1 1 1 1 1 1 1   and   P 7 ( 0 )   =   1 1 1 1 1 1 1 .
Then, the resulting matrix C 7 ( a ) = a 7 b 7 c 7 e 7 f 7 d 7 g 7 a 7 g 7 b 7 c 7 e 7 f 7 d 7 a 7 d 7 g 7 b 7 c 7 e 7 f 7 a 7 f 7 d 7 g 7 b 7 c 7 e 7 a 7 e 7 f 7 d 7 g 7 b 7 c 7 a 7 c 7 e 7 f 7 d 7 g 7 b 7 h 7 a 7 a 7 a 7 a 7 a 7 a 7 is decomposed into the sum of two matrices:
C 7 ( a )   =   C 7 ( b ) +   C 7 ( c ) ,
where
C 7 ( b )   = a 7 a 7 a 7 a 7 a 7 a 7 h 7 a 7 a 7 a 7 a 7 a 7 a 7 ,   C 7 ( c ) = b 7 c 7 e 7 f 7 d 7 g 7 g 7 b 7 c 7 e 7 f 7 d 7 d 7 g 7 b 7 c 7 e 7 f 7 f 7 d 7 g 7 b 7 c 7 e 7 e 7 f 7 d 7 g 7 b 7 c 7 c 7 e 7 f 7 d 7 g 7 b 7
Next, we remove the rows and columns containing only zero entries from C 7 ( c ) to obtain the matrix
C 6 ( d ) = b 7 c 7 e 7 f 7 d 7 g 7 g 7 b 7 c 7 e 7 f 7 d 7 d 7 g 7 b 7 c 7 e 7 f 7 f 7 d 7 g 7 b 7 c 7 e 7 e 7 f 7 d 7 g 7 b 7 c 7 c 7 e 7 f 7 d 7 g 7 b 7 .
The matrix C 6 ( d ) is a circular convolution matrix, which can be presented as [30]
H 6 = h 0 h 5 h 4 h 3 h 2 h 1 h 1 h 0 h 5 h 4 h 3 h 2 h 2 h 1 h 0 h 5 h 4 h 3 h 3 h 2 h 1 h 0 h 5 h 4 h 4 h 3 h 2 h 1 h 0 h 5 h 5 h 4 h 3 h 2 h 1 h 0 ,
where h 0   =   b 7 , h 1   =   g 7 , h 2   =   d 7 , h 3   = f 7 , h 4   = e 7 , and h 5   =   c 7 .
Using the factorization of the matrix H 6 from the [30], the following expression is obtained:
H 6   =   W 6 ( a ) W 6 ( 1 ) W 6 W 6 × 8 × diag ( s 2 7 ,     s 3 7 ,     s 4 7 ,   s 5 7 ,   s 6 7 ,   s 7 7 ,   s 8 7 ,   s 9 ( 7 ) ) W 8 × 6 W 6 W 6 ( 0 ) W 6 ( a ) ,
where W 6 × 8   =   W 3 × 4 W 3 × 4 , W 8 × 6   =   W 4 × 3 W 4 × 3 ,   W 6 ( a )   =   H 2 I 3 , W 6   = 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 , W 3 × 4   = 1 1 1 1 1 , W 4 × 3   = 1 1 1 1 1 , W 6 ( 0 )   =   1 1 1 1 1 1 , W 6 ( 1 )   =   I 4 ( 1 ) 1 , s 2 7   =   ( b 7 g 7 + d 7 + f 7 e 7 c 7 ) / 6 , s 3 7   =   ( d 7 c 7 b 7 f 7 ) / 2 , s 4 7   =   ( g 7 e 7 b 7 f 7 ) / 2 , s 5 7   =   ( 2 b 7 + g 7 d 7 + 2 f 7 + e 7 + c 7 ) / 6 , s 6 7   =   ( b 7 + g 7 + d 7 f 7 e 7 + c 7 ) / 6 , s 7 7   =   ( d 7 + c 7 b 7 + f 7 ) / 2 , s 8 7   =   ( e 7 + g 7 b 7 + f 7 ) / 2 , s 9 7   =   ( 2 b 7 g 7 d 7 2 f 7 + e 7 c 7 ) / 6 .
To reduce the number of arithmetic operations for the 7-point DCT-VI, we also consider the entries in the first column and the last row of the matrix C 7 ( b ) , which differ only in a sign. Exploiting this property leads to the following factorization of the matrix C 7 for the 7-point DCT-VI:
Y 7 × 1   =   P 7 ( 1 ) W 7 × 8 W 8 ( 1 ) W 8 W 8 × 11 D 11 W 11 × 7 W 7 ( 0 ) W 7 P 7 ( 0 ) X 7 × 1 ,
where D 11   =   diag ( s 0 7 ,   s 1 7 ,   s 2 7 ,   s 3 7 ,   s 4 7 ,   s 5 7 ,   s 6 7 ,   s 7 ( 7 ) ,   s 8 7 ,   s 9 ( 7 ) s 10 7 ) , s 0 7   =   a 7 , s 1 7   =   h 7 , s 10 7   = a 7 , W 8 × 11   =   1 1 1 T 2 × 3 1 T 2 × 3 1 , W 7   =   1 W 6 ( a ) , W 8   =   1 W 6 1 , W 8 ( 1 )   =   1 W 6 ( 1 ) W 6 1 , W 7 ( 0 )   =   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 , W 7 × 8   = 1 1 1 1 1 1 1 1 , W 16 × 7   = 1 1 1 1 1 1 1 1 1 1 1 1 1 .
Consequently, a fast algorithm for the 7-point DCT-VI has been developed, as illustrated in Figure 10 with a data-flow graph. This algorithm reduces the number of multiplications from 49 to 11, while the number of additions is slightly reduced from 42 to 36.
The data-flow graph shown in Figure 10 includes permutation modules at the input and output that perform index reordering of data samples, enabling regular computational structures in fast transform implementations without introducing additional arithmetic operations. As a result, the graph also contains repeated modules on both the left and right sides of the scaling-factor line, in particular butterfly modules near the input and output permutations. To add the same value to all outputs, a fan-like structure is included before the output permutation.

3.7. Algorithm for Eight-Point DCT-VI

To design the algorithm for the eight-point DCT-VI, the transform can be expressed as follows:
Y 8 × 1   =   C 8 X 8 × 1 ,
where Y 8 × 1   =   y 0 y 1 y 2 y 3 y 4 y 5 y 6 y 7 , X 8 × 1   =   x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 , and C 8   =   a 8 b 8 c 8 d 8 e 8 f 8 g 8 h 8 a 8 d 8 g 8 g 8 d 8 p 8 d 8 g 8 a 8 f 8 f 8 p 8 f 8 f 8 p 8 f 8 a 8 h 8 b 8 g 8 c 8 f 8 d 8 e 8 a 8 g 8 d 8 d 8 g 8 p 8 g 8 d 8 a 8 e 8 h 8 d 8 b 8 f 8 g 8 c 8 a 8 c 8 e 8 g 8 h 8 f 8 d 8 b 8 f 8 a 8 a 8 a 8 a 8 a 8 a 8 a 8 with a 8   =   2 15   0.3651, b 8   =   2 15 cos π 15   0.5051, c 8   =   2 15 cos 2 π 15   0.4718, d 8   =   2 15 cos π 5     0.4178, e 8   =   2 15 cos 4 π 15   0.3455, f 8   =   2 15 cos π 3   0.2582, g 8   =   2 15 cos 2 π 5   0.1596, h 8   =   2 15 cos 7 π 15     0.0540, p 8   = 1 15     0.5164.
To design the 8-point DCT-VI algorithm, we first permute the rows and columns of the initial DCT-VI matrix and invert the signs of certain rows or columns. As a result, we obtain a matrix in which submatrices correspond to patterns identified in [29,30]. These patterns are then extracted, and their factorizations, as presented in [29,30], are applied. Finally, the factorizations and data-flow graphs of the individual submatrices are merged to form the factorization and data-flow graph of the original 8-point DCT-VI matrix.
To implement this approach, we reorder the columns and rows of the matrix C 8 using the permutations
π 5   = 1 2 1 6 3 4 5 6 7 8 4 7 2 3 5 8     and   π 6   = 1 2 1 4 3 4 5 6 7 8 6 7 2 5 3 8  
.
Then, the signs of the sixth and seventh columns of C 8 are inverted. The corresponding permutation matrices P 8 ( 0 ) and P 8 ( 1 ) are expressed as follows:
P 8 ( 0 )   =   1 1 1 1 1 1 1 1   and   P 8 ( 1 )   =   1 1 1 1 1 1 1 1 .
Next, the resulting matrix C 8 ( a ) is expressed as the sum of two matrices:
C 8 ( a )   =   C 8 ( b ) +   C 8 ( c ) ,
where
C 8 ( a ) = a 8 f 8 d 8 g 8 b 8 c 8 e 8 h 8 a 8 f 8 g 8 d 8 h 8 b 8 c 8 e 8 a 8 f 8 d 8 g 8 e 8 h 8 b 8 c 8 a 8 f 8 g 8 d 8 c 8 e 8 h 8 b 8 a 8 p 8 g 8 d 8 d 8 g 8 d 8 g 8 a 8 p 8 d 8 g 8 g 8 d 8 g 8 d 8 a 8 f 8 p 8 p 8 f 8 f 8 f 8 f 8 f 8 a 8 a 8 a 8 a 8 a 8 a 8 a 8 ,
C 8 ( b ) = a 8 f 8 d 8 g 8 a 8 f 8 g 8 d 8 a 8 f 8 d 8 g 8 a 8 f 8 g 8 d 8 a 8 p 8 g 8 d 8 d 8 g 8 d 8 g 8 a 8 p 8 d 8 g 8 g 8 d 8 g 8 d 8 a 8 f 8 p 8 p 8 f 8 f 8 f 8 f 8 f 8 a 8 a 8 a 8 a 8 a 8 a 8 a 8 ,
C 8 ( c ) = b 8 c 8 e 8 h 8 h 8 b 8 c 8 e 8 e 8 h 8 b 8 c 8 c 8 e 8 h 8 b 8 .
The submatrix C 4 ( d )   =   b 8 c 8 e 8 h 8 h 8 b 8 c 8 e 8 e 8 h 8 b 8 c 8 c 8 e 8 h 8 b 8 of the matrix C 8 ( c ) is the a circular convolution matrix [30] for N = 4 which can be represented as H 4   =   h 0 h 3 h 2 h 1 h 1 h 0 h 3 h 2 h 2 h 1 h 0 h 3 h 3 h 2 h 1 h 0 with entries h 0   =   b 8 , h 1   = h 8 , h 2   = e 8 , and h 3     = c 8 .
Using the entries of H 4 , we define the vector A 5 × 1 of scaling factors to factorize the circular convolution matrix C 4 ( d ) :
A 5 × 1   =   1 / 4 × diag ( 1 ,   1 ,   2 ,   2 ,   2 ) × A 5 × 4 ( 0 ) A 4 × h 0 ,   h 1 ,   h 2 ,   h 3 T ,
where A 5 × 4 ( 0 )   =   H 2 1 1 1 1 1 and A 4   =   H 2 I 2 .
Then, the matrix C 4 ( d ) is factorized as follows:
C 4 ( d )   =   A 4 A 4 × 5 × diag ( A 5 × 1 ) × A 5 × 4 P 4 A 4 ,
where
A 5 × 1 = ( s 9 8 ,     s 10 8 ,     s 11 8 ,   s 12 8 ,   s 13 8 )
with
s 9 8 = ( b 8 + h 8 e 8 c 8 ) / 4 ,   s 10 8 = ( b 8 h 8 e 8 c 8 ) / 4 , s 11 8 = ( b 8 h 8 e 8 c 8 ) / 2 ,   s 12 8 = ( b 8 + h 8 e 8 c 8 ) / 2 ,   s 13 8 = ( b 8 + e 8 ) / 2 .
The matrices A 5 × 4 and A 4 × 5 are constructed as A 5 × 4   =   H 2 T 3 × 2 and A 4 × 5   =   H 2 T 2 × 3 , respectively, and P 4 is
P 4 = 1 1 1 1 .
Further, the submatrix
B 6 = d 8 g 8 b 8 c 8 e 8 h 8 g 8 d 8 h 8 b 8 c 8 e 8 d 8 g 8 e 8 h 8 b 8 c 8 g 8 d 8 c 8 e 8 h 8 b 8 g 8 d 8 d 8 g 8 d 8 g 8 d 8 g 8 g 8 d 8 g 8 d 8
of the matrix C 8 ( a ) is factorized by decomposing its 2 × 2 submatrices. It can be observed that the submatrix A 2   = d 8 g 8 g 8 d 8 exhibits structural similarity to the template a b b a . The submatrix B 2   =   d 8 g 8 g 8 d 8 is similar to the template a b b a . Then, the submatrices A 2 and B 2 are decomposed as
A 2   = I ¯ 2 H 2   [ ( d 8 + g 8 ) / 2   ( d 8 g 8 ) / 2 ] H 2 ,   B 2 =   H 2   [ ( d 8 g 8 ) / 2   ( d 8 + g 8 ) / 2 ] H 2 ,
where I ¯ 2   = 1 0 0 1 . Using expressions (32) and (33), the factorization of the matrix B 6 is obtained:
B 6   =   W 6 × 8 W 8 ( 1 )   W 8 × 9   × diag ( s 5 ( 8 ) ,   s 6 ( 8 ) ,   ,   s 13 ( 8 ) ) × W 9 × 8   W 8 ( 0 ) W 8 × 6 ,
where
s 5 8 = ( d 8 + g 8 ) / 2 ,   s 6 8 = ( d 8 g 8 ) / 2 ,   s 7 8 = ( d 8 g 8 ) / 2 ,   s 8 8   = ( d 8 + g 8 ) / 2 ,  
W 8 ( 1 ) =   I ¯ 2 H 2 H 2 A 4 ,   W 8 ( 0 ) = H 2 H 2 P 4 A 4 , W 9 × 8 = I 4 H 2 T 3 × 2 ,   W 8 × 9 =   I 4 H 2 T 2 × 3 ,  
W 6 × 8 = 1 1 1 1 1 1 1 1 1 1 1 1 ,   W 8 × 6 = 1 1 1 1 1 1 1 1 1 1 .
As a result, the matrix of the eight-point DCT-VI is factorized as
Y 8 × 1 =   P 8 ( 1 ) W 8 × 16 W 16 × 14 W 14 × 16 D 16 W 16 × 8 W 8 W 8 ( 2 ) P 8 ( 0 ) X 8 × 1 ,
where
D 16 = diag ( s 0 ( 8 ) ,   s 1 ( 8 ) ,   ,   s 15 ( 8 ) ) ,   s 0 ( 8 ) = f 8 ,   s 1 ( 8 ) = a 8 ,   s 2 ( 8 ) = f 8 , s 3 ( 8 ) = p 8 ,   s 4 ( 8 ) = p 8 ,   s 14 ( 8 ) = a 8 ,   s 15 ( 8 ) = f 8 ,  
W 8 ( 2 ) =   I 4 1 1 1 1 1 1 1 1 ,   W 8 = I 2 H 2 H 2 I 2 ,  
W 16 × 14 = 1 1 1 1 1 1 1 1 1 1 H 2 ( H 2 I 2 ) I 2 ,  
W 14 × 16 = 1 1 1 1 1 1 I ¯ 2 H 2 I 2 H 2 T 2 × 3 I 2 ,  
W 8 × 16   = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ,  
W 16 × 8 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 .
Figure 11 shows the data-flow graph of the 8-point DCT-VI algorithm based on factorization (35), reducing the number of additions from 56 to 38 and multiplications from 64 to 16.
The data-flow graph shown in Figure 11 includes permutation modules at the input and output that perform index reordering of data samples, enabling regular computational structures in fast transform implementations without introducing additional arithmetic operations. As a result, the graph contains butterfly modules on both the left and right sides of the scaling-factor line. The repeated entries allow efficient computation using adder trees and a circular convolution structure module, similar to the 4- to 7-point DCT-VI algorithm cases.

4. Results

In this section, we experimentally confirm the correctness of the DCT-VI matrix factorizations presented in Section 3. Using MATLAB R2023b, the original DCT-VI matrices were generated according to (3), (9), (13), (19), (25) and (29) and were subsequently compared with the products of the corresponding factorized matrices obtained from (8), (12), (18), (24), (28) and (35). For initial matrix sizes from three to eight, the factorized matrices were found to coincide exactly with the original matrices, thereby confirming the correctness of the proposed algorithms.
The arithmetic complexity of the proposed DCT-VI algorithms was then evaluated and compared with that of direct matrix–vector multiplication. The number of multiplications was determined by counting the vertices labeled with multiplicative factors (circles) in the associated data-flow graphs, while the number of additions was estimated by counting the vertices at which two edges merge. Contributions corresponding to multiplications by powers of two were accounted for accordingly.
Table 1 summarizes the number of multiplications and additions for the proposed DCT-VI algorithms. The first column lists the transform size N. Each row of Table 1 reports the results for the corresponding N-point DCT-VI algorithm. The second and third columns present the number of additions and multiplications required for the direct matrix–vector implementation of the DCT-VI. The fourth and fifth columns report the numbers of additions and multiplications required by the proposed algorithms, with the percentage differences relative to direct matrix–vector multiplication indicated in parentheses. A plus sign denotes an increase in the number of operations, whereas a minus sign indicates a reduction.
Moreover, the N-point DCT-VI can be realized as the real part of a (2N − 1)-point DFT [15]. The corresponding arithmetic operation counts for these implementations are reported in the sixth and seventh columns of Table 1. For N = 8, the generalized split-radix DFT algorithm [22] is used. For N = 3, 4, and 5, DFT algorithms from [27] are considered, whereas for N = 6 and 7, Winograd DFT algorithms [26,33] are employed.
Table 2 presents a comparison of the numbers of additions and multiplications required by other existing DCT-VI algorithms, together with the corresponding percentage differences. The first column of Table 2 lists the authors of the existing algorithms, while the second column provides the corresponding reference and year of publication. In [21], four-point and eight-point DCT-VI algorithms were developed; the numbers of additions and multiplications required for these algorithms are reported in the third, fourth, seventh, and eighth columns. In [28], a five-point DCT-VI algorithm was proposed, and the corresponding numbers of arithmetic operations are listed in the fifth and sixth columns of Table 2. In parentheses, the additional number of multiplications required by these existing algorithms is indicated, since the normalization constant is not taken into account in [21,28].
In Table 3, we present the number of multiplications and additions required by existing algorithms for computing short-length DCT-II and DCT-VIII transforms. The fast DCT-II algorithms considered are reported in [22]. For N = 4 and N = 8, the number of arithmetic operations is evaluated for radix-2 algorithms, whereas for N = 5 and N = 7, it is evaluated for radix-q algorithms.
The fast DCT-VIII algorithms considered are presented in [21,34]. For N = 3, 4, 5, 6, and 7, the number of arithmetic operations is evaluated for algorithms based on the structural approach [34]. For N = 8, the number of arithmetic operations is evaluated for the DFT-based algorithm [21]. As can be seen from Table 1 and Table 3, the number of arithmetic operations required by the proposed DCT-VI algorithms is similar to that of the existing DCT-II and DCT-VIII algorithms.

5. Discussion

The analysis of the results has shown the following.
First, the proposed DCT-VI algorithms achieve a substantial reduction in both the number of multiplications and additions compared to direct matrix–vector multiplication. In particular, the number of multiplications is reduced by nearly 66%, while the number of additions decreases by approximately 9%.
Second, the developed fast DCT-VI algorithms were compared with the DFT-based algorithms reported in [15]. In this case, a significant reduction in the number of additions is achieved (more than twofold). In addition, we considered the DCT-VI algorithms proposed in [21,28]. In those studies, the normalization constants were not taken into account during the construction of the fast algorithms. Therefore, the corresponding number of multiplications is added in parentheses to equalize the experimental results. As a result, the computational complexity of some existing algorithms and the proposed ones becomes approximately the same.
Table 4 presents the memory requirements of the proposed DCT-VI algorithms. The number of required memory cells was determined based on the pseudocode implementations described in Appendix A. On average, the proposed algorithms require approximately 40% more memory than the direct matrix–vector product approach, when evaluated over input sizes ranging from three to eight.
It is worth noting that memory consumption is highly dependent on implementation-specific factors, including the target platform, programming technique, and the developer’s expertise. Unlike arithmetic complexity, which constitutes an objective and implementation-independent performance metric, memory usage may vary substantially across different execution environments. The proposed DCT-VI algorithms are primarily intended for software implementations, where memory requirements, execution latency, and computational resources can differ considerably depending on system architecture and optimization strategies. In practice, memory cells may be shared or reused across multiple stages of computation, and the algorithms may be executed in sequential, parallel, or hybrid sequential–parallel modes, each of which influences overall latency and memory utilization. Consequently, the assessment of memory efficiency is inherently subjective. In contrast, arithmetic complexity remains the most reliable and consistent criterion for evaluating the efficiency of the proposed algorithms.

6. Conclusions

This paper presents novel fast algorithms for computing DCT-VI, with a particular focus on short-length input sequences (ranging from three to eight samples). A detailed, step-by-step description of each computational stage is provided, including intermediate outputs, thereby ensuring transparency and reproducibility of the proposed algorithms. The proposed algorithms achieve a substantial reduction in the number of multiplications required for transform computation when compared with the direct matrix–vector products. A comparative complexity analysis demonstrates that, for input sequence lengths ranging from three to eight, the proposed fast algorithms reduce the number of multiplications by approximately 66% on average, while achieving an average reduction of nearly 9% in the number of additions relative to direct computation methods.
To further support practical implementation, data-flow graphs are introduced to illustrate the space–time structure of the computational processes. These graphs not only clarify the flow of operations but also enable accurate evaluation of arithmetic complexity in terms of multiplications and additions. Notably, each path from input to output in the proposed data-flow graphs contains only a single multiplication operation, which represents a significant advantage over alternative designs where multiple sequential multiplications may occur along the same path. This characteristic enables faster execution and facilitates efficient hardware implementation.
In addition, optimized pseudocode implementations incorporating variable reuse are developed to reduce memory requirements in software-based implementations [35]. The resulting algorithms are readily applicable to a wide range of signal processing tasks, including video and image coding.

Author Contributions

Conceptualization, A.C.; methodology, A.C. and V.K.; software, M.P.; validation, V.K. and M.P.; formal analysis, A.C., M.P. and V.K.; investigation, V.K., M.P. and A.C.; writing—original draft preparation, M.P. and A.C.; writing—review and editing, A.C. and M.P.; supervision, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

In the appendix, we present the pseudocode for the proposed fast DCT-VI algorithms. These pseudocodes are used in Section 4 to evaluate the number of memory cells required by the algorithms introduced in Section 3. To minimize the memory requirements of the implementations, variables are reused where possible. As a result, the variables that must be retained as outputs, along with their required ordering, are explicitly specified. Thus, in Table A1, we show the pseudocode of the proposed three-point DCT-VI algorithm. The inputs of the pseudocode are x 0 ,   x 1 , and x 2 . The scaling factors are s 0 ( 3 ) , s 1 ( 3 ) , and s 2 ( 3 ) . We reuse variables and, as a result, the outputs of the pseudocode are x 0 ,   x 1 , and x 2 . Variables p 0 and p 1 are additional.
Table A1. The pseudocode for the constructed fast DCT-VI algorithm for N = 3 with variable reuse.
Table A1. The pseudocode for the constructed fast DCT-VI algorithm for N = 3 with variable reuse.
Step 1Step 2
p 0   =   ( x 1 + x 2 ) / 2 , p 1   =   x 1 x 2 , x 1 = x 0 p 0 ,
x 2   =   x 0 s 0 ( 3 ) p 1 s 1 ( 3 ) , x 0   =   x 0 s 1 ( 3 ) + p 1 s 2 ( 3 ) ; x 0   =   x 0 + p 0 .
We present the pseudocode for the four-point DCT-VI algorithm in Table A2. The inputs of the pseudocode are x 0 ,   x 1 ,   x 2 , and x 3 . Also the scaling factors s 0 ( 4 ) , s 1 ( 4 ) , s 2 ( 4 ) , s 3 ( 4 ) , s 4 ( 4 ) , and s 5 ( 4 ) are used to create the pseudocode because   s 6 ( 4 )   = s 1 ( 4 ) . The variables were reused; additional variables are p 1 ,   p 2 , and p 3 . The outputs of the pseudocode are x 0 ,   x 1 ,   x 2 , and x 3 . .
Table A2. The pseudocode for the developed fast 4-point DCT-VI algorithm with variable reuse.
Table A2. The pseudocode for the developed fast 4-point DCT-VI algorithm with variable reuse.
Step 1Step 2
p 0   =   x 0 s 0 ( 4 ) ,     p 1   =   x 0 s 1 ( 4 ) ,
p 2   =   x 1 x 2 + x 3 ,   p 3   =     ( x 2 x 3 ) s 3 ( 4 ) ,
p 4   =     ( x 1 + x 2 ) s 4 ( 4 ) , p 5   =     ( x 1 x 3 ) s 5 ( 4 ) ;
x 3   =   p 2 s 1 ( 4 ) , x 0   =   p 2 s 2 ( 4 ) ,   x 3   =   p 0 x 3 , x 0   =   p 1 + x 0 , x 1   =   p 5 + p 3 + x 0 ,   x 2   =     p 4 p 3 + x 0 ,   x 0   =     p 4 + p 5 + x 0 ;
We present the pseudocode for the designed five-point DCT-VI algorithm in Table A3. The inputs of the pseudocode are x 0 ,   x 1 ,   x 2 ,   x 3 , and x 4 . We use the scaling factors s 0 ( 5 ) , s 1 ( 5 ) , s 5 ( 5 ) , s 6 ( 5 ) , and s 7 ( 5 ) because s 2 ( 5 )   =     s 1 ( 5 ) , s 3 ( 5 )   =     s 1 ( 5 ) , and s 4 ( 5 )   =     s 0 ( 5 ) . The outputs of the pseudocode are x 0 ,   x 1 ,   x 2 ,   x 3 , and x 4 . Variables p 0 , p 5 , p 6 , and p 7 is additional.
Table A3. The pseudocode for the designed 5-point DCT-VI algorithm with variable reuse.
Table A3. The pseudocode for the designed 5-point DCT-VI algorithm with variable reuse.
Step 1Step 2
p 0   = x 1 + x 2 + x 4 , x 4   =   x 0 s 1 ( 5 ) + ( p 0 x 3 ) s 0 ( 5 ) , x 0   =   x 3 s 1 ( 5 ) + x 0 s 0 ( 5 ) ,
p 5   =   ( x 1 x 2 ) s 5 ( 5 ) , p 6   =   ( x 4 + x 1 ) s 6 ( 5 ) , x 1   =   x 0 p 0 s 1 ( 5 ) x 3 , x 2   =   x 0 p 5 p 6 ,
p 7   =   ( x 4 x 2 ) s 7 ( 5 ) ; x 3   =   x 0 + p 6 + p 7 , x 0   =   x 0 + p 5 p 7 .
In Table A4, the pseudocode of the proposed six-point DCT-VI algorithm is presented. The inputs of the pseudocode are x 0 ,   x 1 ,   x 2 ,   x 3 ,   x 4 , and x 5 . The scaling factors are s 0 ( 6 ) , s 1 ( 6 ) , …,   s 11 ( 6 ) , because   s 12   ( 6 ) = s 10 ( 6 ) . We reuse variables. So, the outputs of the pseudocode are x 0 ,   x 3 ,   x 4 ,   x 2 ,   x 1 , and p 0 . Variables p 1 , p 2 , …, p 5 are additional.
Table A4. The pseudocode for the developed 6-point DCT-VI algorithm with variable reuse.
Table A4. The pseudocode for the developed 6-point DCT-VI algorithm with variable reuse.
Step 1Step 2Step 3Step 4Step 5
p 4   =     x 1 , p 5   = x 2 ,
x 1   =     x 4 + x 3 + x 5 + p 4 + p 5 ,   x 2   =   x 4 x 3 ,   x 3   =   x 4 x 5 ,
x 5   =   x 4 p 5 ,   x 4   =   x 4 p 4 ;
p 1   =   x 2 + x 4 ,
p 2   =   x 3 + x 5 ,
p 3   =   ( x 2 + x 3 ) s 5 ( 6 ) ,
p 4   =   ( x 4 + x 5 ) s 8 ( 6 ) ,
p 5   =   ( p 1 + p 2 ) s 11 ( 6 ) ,
p 0   =   x 0 s 0 ( 6 )   x 1 s 1 ( 6 ) ;
x 0   = x 0 s 1 ( 6 ) +   x 1 s 2 ( 6 ) ,
x 1   =   x 2 s 3 ( 6 ) + p 3 ,
x 2   =   x 3 s 4 ( 6 ) + p 3 ,
x 3   =   x 4 s 6 ( 6 ) + p 4 ,
x 4   =   x 5 s 7 ( 6 ) + p 4 ,
x 5   =   p 1 s 9 ( 6 ) + p 5 ;
p 1   =   p 2 s 10 ( 6 ) + p 5 ,
  p 2   =   x 1 + x 5 ,  
  p 3   =   x 2 + p 1 ,  
  p 4   =   x 3 + x 5 ,
  p 5   =   x 4 + p 1 ;
x 1   = x 0 p 2 ,
  x 2 = x 0 p 3 ,
  x 3 = x 0 p 4 ,
  x 4 = x 0 p 5 ,
  x 0 = x 0 + p 2 + p 3 + p 4 + p 5 .
In Table A5, the pseudocode of the constructed seven-point DCT-VII algorithm is shown. The inputs of the pseudocode are x 0 ,   x 1 ,   x 2 ,   x 3 ,   x 4 ,   x 5 , and x 6 . We use only scaling factors s 0 ( 7 ) , s 1 ( 7 ) , …, s 8 ( 7 ) , and s 9 ( 7 ) because s 10 ( 7 )   = s 0 ( 7 ) . We reuse variables. So, the outputs of the pseudocode are x 1 ,   x 3 ,   x 4 ,   x 2 ,   x 5 ,   x 6 , and p 0 . Variables p 1 , p 2 , …, p 6 are additional.
Table A5. The pseudocode for the designed 7-point DCT-VI algorithm with variable reuse.
Table A5. The pseudocode for the designed 7-point DCT-VI algorithm with variable reuse.
Step 1Step 2Step 3Step 4Step 5
p 1   =   x 1 + x 5 ,
p 2   =   x 3 x 2 ,
p 3   =   x 4 x 6 ,
p 4   =   x 1 x 5 ,
p 5   =   x 3 x 2 ,
p 6   =   x 4 + x 6 ;
x 1   =   p 1 + p 2 + p 3 ,
x 2   =   p 1 p 3 ,
x 3   =   p 1 p 2 ,
p 1 = ( x 2 + x 3 ) s 5 ( 7 ) ,
x 2 = x 2 s 3 ( 7 ) ,
x 3 = x 3 s 4 ( 7 ) ,
x 4 = ( p 4 p 5 + p 6 ) s 6 ( 7 ) ,
x 5 = p 4 p 6 ,
x 6 = p 4 + p 5 ,
p 3 = ( x 5 + x 6 ) s 9 ( 7 )
x 5 = x 5 s 7 ( 7 ) ,
x 6 = x 6 s 8 ( 7 ) ;
p 0 =   x 1 s 0 ( 7 ) ,
x 1 = x 1 s 2 ( 7 ) ,
x 2 = p 1 + x 2 ,
x 3 = p 1 + x 3 ,
x 5 = p 3 + x 5 ,
x 6 = p 3 + x 6 ;
p 1 = x 1 + x 2 + x 3 ,
p 2 = x 1 x 2 ,
p 3 = x 1 x 3 ,
p 4 = x 4 + x 5 + x 6 ,
p 5 = x 5 x 4 ,
p 6 = x 4 x 6 ;
p 0 = x 0 s 1 ( 7 ) + p 0 ,
x 0 =   x 0 s 0 ( 7 ) ,
x 1 = x 0 + p 1 ,
x 2 = x 0 + p 2 ,
x 3 = x 0 + p 3 ,
x 4 = x 1 p 4 ,
x 5 = x 2 p 5 ,
x 6 = x 3 p 6 ,
x 1 = x 1 + p 4 ,
x 2 = x 2 + p 5 ,
x 3 = x 3 + p 6 .
In Table A6, the pseudocode of the constructed eight-point DCT-VI algorithm is presented. The inputs of the pseudocode are x 0 ,   x 1 ,   x 2 ,   x 3 ,   x 4 ,   x 5 ,   x 6 , and x 7 . We use scaling factors s 0 ( 8 ) , s 1 ( 8 ) , s 3 ( 8 ) , s 5 ( 8 ) , s 6 ( 8 ) ,   s 9 ( 8 ) , s 10 ( 8 ) , s 11 ( 8 ) , s 12 ( 8 ) , and s 13 ( 8 ) because s 2 ( 8 )   =   s 0 ( 8 ) , s 4 ( 8 )   =   s 3 ( 8 ) , s 7 ( 8 )   =   s 6 ( 8 ) , s 8 ( 8 )   =   s 5 ( 8 ) , s 14 ( 8 )   =   s 1 ( 8 ) , and s 15 ( 8 )   = s 0 ( 8 ) . We reuse variables. So, the outputs of the pseudocode are x 0 ,   x 1 ,   x 2 ,   x 3 ,   x 4 ,   x 5 ,   x 6 , and x 7 . Variables p 0 , …, p 8 , p 12 , p 13 , p 14 , p 15 are additional.
Table A6. The pseudocode for the constructed 8-point DCT-VI algorithm with variable reuse.
Table A6. The pseudocode for the constructed 8-point DCT-VI algorithm with variable reuse.
Step 1Step 2Step 3Step 4Step 5Step 6
  p 0 = x 0 s 0 ( 8 ) ,
  p 1 = x 0 s 1 ( 8 ) ,
  p 2 = x 5 s 0 ( 8 ) ,
  p 3 = x 5 s 3 ( 8 ) ,
  p 5 = s 5 ( 8 ) ( x 3 + x 6 ) ,
x 3 = x 3 x 6 ,
  p 6 = x 3 s 6 ( 8 ) ,
  p 4 = x 3 s 3 ( 8 ) ;
  p 12 = x 1 x 4 ,
  p 13 = x 7 x 2 ,
  p 14 = x 7 x 2 ,
  p 15 = x 1 + x 4 ,
  p 7 =   p 12 + p 13 ,
  p 8 =   p 12 p 13 ;
x 1 =   p 7 s 9 ( 8 ) ,
x 2 =   p 8 s 10 ( 8 ) ,
x 4 =   p 8 s 5 ( 8 ) ,
x 6 =   p 14 s 11 ( 8 ) ,
x 7 =   p 15 s 12 ( 8 ) ,
  p 13 = s 13 ( 8 ) (   p 14 +   p 15 ) ,
  p 14 = s 1 ( 8 ) ( x 5 + x 3 + p 7 ) ;
  p 15 = s 0 ( 8 ) p 7 ,
p 7 = s 6 ( 8 ) p 7 ,
  p 2 = p 1 + p 2 ,
  p 3 = p 1 + p 3 ,
  p 1 = p 5 + p 6 ,
  p 6 = p 6 p 5 ,
  p 5 = x 1 + x 2 ;
x 2 = x 1 x 2 ,
x 6 =   p 13 + x 6 ,
x 7 =   p 13 + x 7 ,
x 1 =   p 7 + x 4 ,
x 4 =   p 7 x 4 ,
x 0 =   p 5 + x 6 ,
x 3 = x 2 + x 7 ,
x 5 =   p 5 x 6 ,
x 6 = x 2 x 7 ,
  x 7 =   p 0 + p 14 ;
  p 0 =   p 2 + p 1 ,
  p 5 =   p 2 + p 6 ,
  x 0 =   p 0 + x 0 ,
  x 6 =   p 5 + x 6 ,
  x 3 =   p 5 + x 3 ,
  x 5 =   p 0 + x 5 ,
  x 1 =   p 3 +   p 6 + x 1 ,
  x 4 =   p 3 +   p 1 + x 4 ,
  x 2 =   p 4 +   p 2 + p 15 .

References

  1. Richardson, I.E. Coding Video: A Practical Guide to HEVC and Beyond; John Wiley & Sons Ltd.: Chichester, UK, 2024. [Google Scholar]
  2. Choi, K. A study on fast and low-complexity algorithms for Versatile Video Coding. Sensors 2022, 22, 8990. [Google Scholar] [CrossRef] [PubMed]
  3. Zeng, Y.; Sun, H.; Katto, J.; Fan, Y. Approximated reconfigurable transform architecture for VVC. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
  4. Werda, I.; Belghith, F.; Maraoui, A.; Masmoudi, N. DCT-II transform hardware-based acceleration for VVC standard. In 2021 IEEE International Conference on Design & Test of Integrated Micro & Nano-Systems (DTS); IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
  5. Abramova, V.; Lukin, V.; Abramov, S.; Kryvenko, S.; Lech, P.; Okarma, K. A fast and accurate prediction of distortions in DCT-based lossy image compression. Electronics 2023, 12, 2347. [Google Scholar] [CrossRef]
  6. Li, H.; Wei, G.; Wang, T.; Bui, T.O.; Zeng, Q.; Wang, R. Reducing video coding complexity based on CNN-CBAM in HEVC. Appl. Sci. 2023, 13, 10135. [Google Scholar] [CrossRef]
  7. Wang, X. Strategies for enhancing deep video encoding efficiency using the convolutional neural network in a hyperautomation mechanism. Sci. Rep. 2025, 15, 1079. [Google Scholar] [CrossRef] [PubMed]
  8. Huo, S.; Liu, H.; Gu, J.; Jin, D.; Lei, M.; Huang, B. Deep network-based adaptive quantization for practical video coding. IEEE Trans. Circuits Syst. Video Technol. 2025; early access. [Google Scholar] [CrossRef]
  9. Das, T.; Liang, X.; Choi, K. Versatile Video coding-post processing feature fusion: A post-processing convolutional neural network with progressive feature fusion for efficient video enhancement. Appl. Sci. 2024, 14, 8276. [Google Scholar] [CrossRef]
  10. Zieliński, T.P. Digital Signal Processing—From Theory to Applications, 2nd ed.; WKL: Warszawa, Poland, 2005. [Google Scholar]
  11. Zhao, X.; Kim, S.-H.; Zhao, Y.; Egilmez, H.E.; Koo, M.; Liu, S.; Lainema, J.; Karczewicz, M. Transform coding in the VVC standard. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3878–3890. [Google Scholar] [CrossRef]
  12. Kolodziejski, W.; Domanski, R.; Agostini, L. FastGW: A machine learning-based early skip for the AV1 global warped motion compensation. IEEE Trans. Circuits Syst. I Reg. Pap. 2025, 72, 977–988. [Google Scholar] [CrossRef]
  13. Bross, B.; Wang, Y.-K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.-R. Overview of the Versatile Video Coding (VVC) Standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
  14. Zhang, Z.; Zhao, X.; Li, X.; Li, L.; Luo, Y.; Liu, S.; Li, Z. Fast DST-7/DCT-8 with dual implementation support for versatile video coding. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 355–368. [Google Scholar] [CrossRef]
  15. Park, W.; Lee, B.; Kim, M. Fast computation of integer DCT-V, DCT-VIII, and DST-VII for video coding. IEEE Trans. Image Process. 2019, 28, 5839–5851. [Google Scholar] [CrossRef] [PubMed]
  16. Alshina, E.; Sullivan, G.J.; Ohm, J.-R.; Boyce, J.; Chen, J. Algorithm description of joint exploration test model 4. In Proceedings of the JVET-D1001, Joint Video Exploration Team (JVET) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, 4th Meeting, Chengdu, China, 15–21 October 2016. [Google Scholar]
  17. Britanak, V.; Yip, P.C.; Rao, K.R. Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations; Elsevier/Academic Press: Amsterdam, The Netherlands, 2007. [Google Scholar]
  18. Murty, M.N.; Panda, H. Mapping between discrete cosine transform of Type-VI/VII and discrete Fourier transform. Int. J. Eng. Res. Appl. 2016, 6, 60–62. [Google Scholar]
  19. Chivukula, R.K.; Reznik, Y.A. Fast computing of discrete cosine and sine transforms of types VI and VII. In SPIE 8135, Applications of Digital Image Processing XXXIV; SPIE: Bellingham, WA, USA, 2011; pp. 1–10. [Google Scholar] [CrossRef]
  20. Reznik, Y.A. Relationship between DCT-II, DCT-VI, and DST-VII transforms. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
  21. Masera, M.; Martina, M.; Masera, G. Odd type DCT/DST for video coding: Relationships and low-complexity implementations. In 2017 IEEE International Workshop on Signal Processing Systems (SiPS); IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar] [CrossRef]
  22. Bi, G.; Zeng, Y. Transforms and Fast Algorithms for Signal Analysis and Representations, 1st ed.; Birkhäuser: Boston, MA, USA, 2004. [Google Scholar]
  23. Johnson, S.G.; Frigo, M. A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Process. 2007, 55, 111–119. [Google Scholar] [CrossRef]
  24. Stasiński, R. Split multiple radix FFT. In 2022 30th European Signal Processing Conference (EUSIPCO); IEEE: New York, NY, USA, 2022; pp. 2251–2255. [Google Scholar] [CrossRef]
  25. Stasiński, R. Fast discrete Fourier transform algorithms requiring less than O(N log N) multiplications. arXiv 2023, arXiv:2303.02647. [Google Scholar]
  26. Winograd, S. On computing the discrete Fourier transform. Math. Comput. 1978, 32, 175–199. [Google Scholar] [CrossRef]
  27. Majorkowska-Mech, D.; Cariow, A. Some FFT algorithms for small-length real-valued sequences. Appl. Sci. 2022, 12, 4700. [Google Scholar] [CrossRef]
  28. Saxena, A.; Fernandes, F.C.; Reznik, Y.A. Fast transforms for intra-prediction-based image and video coding. In Proceedings of the 2013 Data Compression Conference (DCC), Snowbird, UT, USA, 20–22 March 2013. [Google Scholar]
  29. Cariow, A. Strategies for the synthesis of fast algorithms for the computation of the matrix-vector product. J. Signal Process. Theory Appl. 2014, 3, 1–19. [Google Scholar] [CrossRef]
  30. Cariow, A.; Papliński, J. Algorithmic structures for realizing short-length circular convolutions with reduced complexity. Electronics 2021, 10, 2800. [Google Scholar] [CrossRef]
  31. Zhang, Z.; Zhao, X.; Li, X.; Li, Z.; Liu, S. Fast adaptive multiple transform for versatile video coding. In 2019 Data Compression Conference (DCC); IEEE: New York, NY, USA, 2019; pp. 63–72. [Google Scholar] [CrossRef]
  32. Polyakova, M.; Witenberg, A.; Cariow, A. The design of fast type-V discrete cosine transform algorithms for short-length input sequences. Electronics 2024, 13, 4165. [Google Scholar] [CrossRef]
  33. Sidney Burrus, C. Fast Fourier Transforms (Burrus); LibreTexts: Davis, CA, USA, 2025; Available online: https://eng.libretexts.org/Bookshelves/Electrical_Engineering/Signal_Processing_and_Modeling/Fast_Fourier_Transforms_%28Burrus%29/06%3A_Winograd%27s_Short_DFT_Algorithms/6.02%3A_Winograd_Fourier_Transform_Algorithm_%28WFTA%29 (accessed on 2 February 2026).
  34. Raciborski, M.; Polyakova, M.; Cariow, A. Fast DCT-VIII algorithms for short-length input sequences. Electronics 2026, 15, 207. [Google Scholar] [CrossRef]
  35. Im, S.-K.; Pearmain, A.J. Unequal error protection with the H.264 flexible macroblock ordering. In Visual Communications and Image Processing 2005; SPIE: Beijing, China, 2005; Volume 5960, p. 596032. [Google Scholar] [CrossRef]
Figure 1. The data-flow graph for multiplying the inputs by the matrix C 2 ( c ) .
Figure 1. The data-flow graph for multiplying the inputs by the matrix C 2 ( c ) .
Electronics 15 00699 g001
Figure 2. The data-flow graph for multiplying the inputs by the matrix C 3 ( b ) .
Figure 2. The data-flow graph for multiplying the inputs by the matrix C 3 ( b ) .
Electronics 15 00699 g002
Figure 3. The data-flow graph for the 3-point DCT-VI algorithm.
Figure 3. The data-flow graph for the 3-point DCT-VI algorithm.
Electronics 15 00699 g003
Figure 4. Construction of adjacency matrices based on the data-flow graph.
Figure 4. Construction of adjacency matrices based on the data-flow graph.
Electronics 15 00699 g004
Figure 5. The data-flow graph of the algorithm for the 3-point DCT-VI.
Figure 5. The data-flow graph of the algorithm for the 3-point DCT-VI.
Electronics 15 00699 g005
Figure 6. The data-flow graph of the four-point DCT-VI algorithm.
Figure 6. The data-flow graph of the four-point DCT-VI algorithm.
Electronics 15 00699 g006
Figure 7. Data-flow graph of the algorithm for the five-point DCT-VI.
Figure 7. Data-flow graph of the algorithm for the five-point DCT-VI.
Electronics 15 00699 g007
Figure 8. The data-flow graph of the algorithm for the 6-point DCT-VI.
Figure 8. The data-flow graph of the algorithm for the 6-point DCT-VI.
Electronics 15 00699 g008
Figure 9. The data-flow graph of the algorithm for the 6-point DCT-VI.
Figure 9. The data-flow graph of the algorithm for the 6-point DCT-VI.
Electronics 15 00699 g009
Figure 10. The data-flow graph implementing the 7-point DCT-VI.
Figure 10. The data-flow graph implementing the 7-point DCT-VI.
Electronics 15 00699 g010
Figure 11. The data-flow graph for the computation of the 8-point DCT-VI.
Figure 11. The data-flow graph for the computation of the 8-point DCT-VI.
Electronics 15 00699 g011
Table 1. The number of multiplications and additions for the matrix–vector product, the proposed and existing fast algorithms for DCT-VI.
Table 1. The number of multiplications and additions for the matrix–vector product, the proposed and existing fast algorithms for DCT-VI.
NMatrix–Vector ProductProposed DCT-VI AlgorithmsDFT-Based DCT-VI Algorithms
Adds.Mults.Adds.Mults.Adds.Mults.
3696 (0%)4 (−56%)135
4121613 (+1%)7 (−56%)308
5202516 (−20%)8 (−68%)3610
6303633 (+10%)13 (−64%)8421
7424936 (−14%)11 (−77%)9421
8566438 (−32%)16 (−75%)8415
Table 2. The number of operations for the existing algorithms.
Table 2. The number of operations for the existing algorithms.
AlgorithmReference, Year of PublicationN = 4N = 5N = 8
Mults.Adds.Mults.Adds.Mults.Adds.
Masera, Martina, Masera[21], 20174 (+4)1316 (+8)36
Saxena, Fernandes, Reznik[28], 20133 (+5)15
Proposed algorithm7138161638
Table 3. The number of multiplications and additions for the existing short-length DCT-II and DCT-VIII algorithms.
Table 3. The number of multiplications and additions for the existing short-length DCT-II and DCT-VIII algorithms.
NDCT-IIDCT-VIII
Mults.Adds.Mults.Adds.
3411
449511
54131823
61848
718241634
812292177
Table 4. The number of memory cells for the matrix–vector product and the proposed fast algorithms for DCT-VI.
Table 4. The number of memory cells for the matrix–vector product and the proposed fast algorithms for DCT-VI.
NMatrix–Vector ProductProposed DCT-VI Algorithms
368 (+33%)
41013 (+30%)
51115 (+36%)
61824 (+33%)
71723 (+35%)
81831 (+72%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kitsela, V.; Polyakova, M.; Cariow, A. Fast Algorithms for Short-Length Type VI Discrete Cosine Transform. Electronics 2026, 15, 699. https://doi.org/10.3390/electronics15030699

AMA Style

Kitsela V, Polyakova M, Cariow A. Fast Algorithms for Short-Length Type VI Discrete Cosine Transform. Electronics. 2026; 15(3):699. https://doi.org/10.3390/electronics15030699

Chicago/Turabian Style

Kitsela, Valentyna, Marina Polyakova, and Aleksandr Cariow. 2026. "Fast Algorithms for Short-Length Type VI Discrete Cosine Transform" Electronics 15, no. 3: 699. https://doi.org/10.3390/electronics15030699

APA Style

Kitsela, V., Polyakova, M., & Cariow, A. (2026). Fast Algorithms for Short-Length Type VI Discrete Cosine Transform. Electronics, 15(3), 699. https://doi.org/10.3390/electronics15030699

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop