You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

22 October 2025

A Novel DST-IV Efficient Parallel Implementation with Low Arithmetic Complexity

and
1
Faculty of Electronics, Telecommunications and Information Technology, “Gheorghe Asachi” Technical University of Iaşi, 700506 Iaşi, Romania
2
Technical Sciences Academy of Romania—ASTR, 700050 Iaşi, Romania
3
Academy of Romanian Scientists—AOSR, 030167 Bucharest, Romania
4
Institute of Computer Science, Romanian Academy Iași Branch, 700481 Iași, Romania

Abstract

Discrete sine transform (DST) has numerous applications across various fields, including signal processing, image compression and coding, adaptive digital filtering, mathematics (such as partial differential equations or numerical solutions of differential equations), image reconstruction, and classification, among others. The primary disadvantage of DST class algorithms (DST-I, DST-II, DST-III, and DST-IV) is their substantial computational complexity (O (N log N)) during implementation. This paper proposes an innovative decomposition and real-time implementation for the DST-IV. This decomposition facilitates the execution of the algorithm in four or eight sections operating concurrently. These algorithms, which encompass 4 and 8 sections, are primarily developed using a matrix factorization technique to decompose the DST-IV matrices. Consequently, the computational complexity and execution time of the developed algorithms are markedly reduced compared to the traditional implementation of DST-IV, resulting in significant time efficiency. The performance analysis conducted on three distinct Graphics Processing Unit (GPU) architectures indicates that a substantial speedup can be achieved. An average speedup ranging from 22.42 to 65.25 was observed, depending on the GPU architecture employed and the DST-IV implementation (with 4 or 8 sections).

1. Introduction

The discrete sine transform (DST) was first introduced in 1979 within the paper [1]. Since the publication of the original paper [1], the DST has undergone numerous implementations, extensions, improvements, and analyses of its original algorithm. The DST primarily finds applications in spectral analysis [2], time-frequency analysis [3], audio coding [4], image compression and coding [5], adaptive digital filtering [6,7], interpolation [8,9], image reconstruction [10], and classification [11,12,13,14].
The main problem of this transformation and its twin brother, the discrete cosine transform (DCT), is the computational costs. These algorithms are computationally intensive and expensive to implement. To be used in real-time applications, they need to be restructured and reimplemented.
As previously mentioned in references [2,3,4,5,6,7,8,9,10,11,12,13,14], the DST-IV transform has a broad range of applications across various fields of science and technology, but its computational costs pose a significant challenge. In this paper, we introduce a new implementation method that can perform faster and requires fewer memory access cycles. As a result, all these applications [2,3,4,5,6,7,8,9,10,11,12,13,14] will operate with much lower latency, leaving more computing power available for other tasks and reducing power consumption when executing the same functions.
This paper introduces an innovative DST-IV algorithm characterized by a unique computational structure comprising either 4 or 8 short segments designed for parallel processing. Each segment exhibits remarkably low arithmetic complexity, owing to the specific implementation method employed and the utilization of precomputed coefficients. Ultimately, the algorithm is proficiently implemented on a system supported by a Graphics Processing Unit (GPU), achieving a reduced overall computational cost. The novel DST-IV algorithm was developed and experimentally validated on a GPU system; however, it is also adaptable for deployment across various parallel architectures, including those based on VLSI technology or other multi-core systems, such as Deep Learning Processing Units (DPU), Neural Processing Units (NPU), Tensor Processing Units (TPU), or processors equipped with multiple Central Processing Units (CPUs).
Various approaches exist to accelerate the execution of algorithms. Many of these methods rely on exploiting specific characteristics of the hardware they operate on, thereby enabling and facilitating certain implementations (e.g., systolic architecture [15], recursive implementations [16], pipeline architecture, utilization of shared memory, shared arithmetic units [17]). The new algorithm introduced in this paper aligns with this previously discussed category, as it leverages the parallelism inherent in different architectures. However, its fundamental concept involves the mathematical factorization of the classical DST-IV algorithm, aimed at reducing both the number of mathematical operations and memory accesses.
The research presented in this paper makes the following contributions to the field:
  • We have developed a novel algorithm for DST-IV that can be executed with high efficiency in parallel computing environments.
  • We are employing specific computational architectures that can be efficiently reorganized through sub-expression sharing to achieve a reduced arithmetic complexity, particularly minimizing the number of multiplications.
  • In the DST-IV algorithm, we have further minimized the number of multiplications by replacing even the multiplications involving ½ and ¼, which are employed in the novel DCT-IV algorithm proposed previously in another research paper [18], with additions and subtractions. This optimization is based on the fact that internal matrices comprise only 1, −1, and 0.
  • Provides experimental validation through extensive empirical results and profiling (via NVIDIA Nsight Compute), confirming improvements in execution time, memory usage, and throughput.
  • Combines mathematical efficiency and hardware-level optimization for real-time signal and image processing.
The remainder of the paper is structured as follows: Section 2 surveys related research conducted by other scholars on various DST implementations that aim to enhance multiple objectives, including increasing speed, reducing silicon area, decreasing latency, lowering working memory consumption, and improving throughput. Section 3 presents the novel mathematical decomposition of the DST-IV algorithm, which will be implemented, tested, and analyzed in Section 6. The subsequent section, Section 4, provides a comprehensive description and comparative analysis of the Graphics Processing Units (GPUs) used in the algorithm testing. The materials and methodologies used in examining the novel algorithm are detailed in Section 5. In Section 7, the findings are discussed within the context of prior research. The concluding section provides the overall summary of the paper.

3. Proposed DST IV Algorithm for a Parallel Implementation

Due to the wide range of applicability of the DCT-IV and DST-IV transforms and the interest shown by the academic community, this paper presents a novel implementation of the DST-IV algorithm. The basic form of the DST-IV transformation is appropriately reformulated to obtain, besides a parallel decomposition, an efficient implementation due to a reduced complexity form that has been obtained using an appropriate subexpression sharing technique.
For a real input sequence x i :   i = 0,1 , , N 1 , type IV DST (DST-IV) is defined using the following equation [1]:
Y ( k ) = 2 N i = 0 N 1 x ( i ) sin [ ( 2 i + 1 ) ( 2 k + 1 ) α ]
where k = 0 ,   1 , , N 1   and where:
α = π 4 N
To simplify the presentation, we remove the constant coefficient 2 N from the DST-IV equation, and we add this multiplication at the end of the algorithm.
Equations (1) and (2) represent the classical definition of DST-IV, and they can be implemented in parallel but not very efficiently, mainly because the arithmetic complexity can be significantly reduced as in our proposed algorithm.
In the following, we propose a new parallel algorithm that, besides being an efficient parallel implementation by breaking the computation into 4 or 8 computational structures that can be computed in parallel, has a low arithmetic complexity due to the use of the sub-expression sharing technique, as has been put in evidence below.
To obtain an efficient parallel algorithm, we need to reformulate Equation (1) using some auxiliary input and output restructuring sequences, and we have properly reordered them.
In the following, we consider the transform length a prime number N = 17.
We have used an auxiliary output sequence T ( k ) : k = 1,2 , , N 1 and the following auxiliary input sequence:
x p ( N 1 ) = x ( N 1 )
x p i = ( 1 ) i x ( i ) x p ( i + 1 )
for i = N 2 , , 1 ,   0 .
x a ( i ) = x p ( i ) cos   [ ( 2 i ) α ]   f o r   i = 0,1 , , N 1    
For a compact expression of the following relations, we note:
A =   1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1
C = 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1
Using the above auxiliary input and output sequences and sub-expression sharing technique, we have obtained the following equations that can be efficiently computed in parallel and have a reduced arithmetic complexity that allows an efficient implementation on a GPU significantly faster than other existing algorithms, including the classical algorithm.
Thus, we have the following equations:
T 6 T 14 T 10 T 12 = A   ×   d i a g C × x a 8 x a 9 + x a 2 x a 15 x a 1 + x a 16 + x a 4 x a 13 x a 8 x a 9 + x a 2 x a 15 x a 8 x a 9 x a 2 + x a 15 x a 1 + x a 16 x a 4 + x a 13 x a 8 + x a 9 + x a 2 x a 15 X A 1   × 2 cos ( 28 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) + 2 cos ( 20 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 20 α ) 2 cos ( 28 α ) 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) 2 cos ( 20 α ) cos ( 24 α ) 2 cos ( 12 α ) 2 cos ( 20 α ) c o s 2   A × d i a g C × x a 7 x a 10 x a 6 + x a 11 x a 3 x a 14 + x a 5 x a 12 x a 7 x a 10 x a 6 + x a 11 x a 7 x a 10 + x a 6 x a 11 x a 3 x a 14 x a 5 + x a 12 x a 7 + x a 10 x a 6 + x a 11 X A 2   × 2 cos ( 16 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) + 2 cos ( 8 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 8 α ) 2 cos ( 16 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) 2 cos ( 8 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) 2 cos ( 8 α ) c o s 1
Furthermore, the decomposition can be continued with:
T 16 T 8 T 4 T 2 = A × d i a g C × x a 8 x a 9 + x a 2 x a 15 x a 1 + x a 16 + x a 4 x a 13 x a 8 x a 9 + x a 2 x a 15 x a 8 x a 10 x a 2 + x a 15 x a 1 + x a 16 x a 4 + x a 13 x a 8 + x a 9 + x a 2 x a 15 ×   B × 2 cos ( 16 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) + 2 cos ( 8 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 8 α ) 2 cos ( 16 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) 2 cos ( 8 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) 2 cos ( 8 α )   A × d i a g C × x a 3 x a 14 + x a 5 x a 12 x a 6 + x a 11 + x a 7 x a 10 x a 3 x a 14 + x a 5 x a 12 x a 3 x a 14 x a 5 + x a 12 x a 6 + x a 11 x a 7 + x a 10 x a 3 + x a 14 + x a 5 x a 12 X A 3 × B   × 2 cos ( 28 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) + 2 cos ( 20 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 20 α ) 2 cos ( 28 α ) 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) 2 cos ( 20 α ) cos ( 24 α ) 2 cos ( 12 α ) 2 cos ( 20 α )
and:
T ( 11 ) T ( 3 ) T ( 7 ) T ( 5 ) = A × d i a g C × x a 8 + x a 9 + x a 2 + x a 15 x a 1 + x a 16 + x a 4 + x a 13 x a 8 + 9 + x a 2 + x a 15 x a 8 + x a 9 2 x a 15 x a 1 + x a 16 x a 4 x a 13 x a 8 x a 9 + x a 2 + x a 15 X A 4 ×   B × 2 cos ( 28 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) + 2 cos ( 20 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 20 α ) 2 cos ( 28 α ) 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) 2 cos ( 20 α ) cos ( 24 α ) 2 cos ( 12 α ) 2 cos ( 20 α )   + A × d i a g C × x a 6 + x a 11 + x a 7 + x a 10 x a 3 + x a 14 + x a 5 + x a 12 x a 6 + x a 11 + x a 7 + x a 10 x a 6 x a 11 + x a 7 + x a 10 x a 3 + x a 14 x a 5 x a 12 x a 6 + x a 11 x a 7 x a 10 X A 5 × B × 2 cos ( 16 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) + 2 cos ( 8 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 8 α ) 2 cos ( 16 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) 2 cos ( 8 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) 2 cos ( 8 α )
T ( 1 ) T ( 9 ) T ( 13 ) T ( 15 ) = A × d i a g C × x a 8 + x a 9 + x a 2 + x a 15 x a 1 + x a 16 + x a 4 + x a 13 x a 8 + 9 + x a 2 + x a 15 x a 8 + x a 9 x a 2 x a 15 x a 1 + x a 16 x a 4 x a 13 x a 8 x a 9 + x a 2 + x a 15 × B × 2 cos ( 16 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) + 2 cos ( 8 α ) + 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 8 α ) 2 cos ( 16 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) + 2 cos ( 16 α ) 2 cos ( 8 α ) 2 cos ( 4 α ) 2 cos ( 32 α ) 2 cos ( 8 α ) + A × d i a g C × x a 3 + x a 14 + x a 5 + x a 12 x a 6 + x a 11 + x a 7 + x a 10 x a 3 + x a 14 + x a 5 + x a 12 x a 3 + x a 14 x a 5 x a 12 x a 6 + x a 11 x a 7 x a 10 x a 3 x a 14 + x a 5 + x a 12 X A 6 × B × 2 cos ( 28 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) + 2 cos ( 20 α ) + 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 20 α ) 2 cos ( 28 α ) 2 cos ( 24 α ) 2 cos ( 12 α ) + 2 cos ( 28 α ) 2 cos ( 20 α ) cos ( 24 α ) 2 cos ( 12 α ) 2 cos ( 20 α )
In Equations (8)–(11), it was noted: d i a g a 0 , a 1 = a 0 0 0 a 1 .
To obtain the output values, the following coefficients must first be computed.
T a ( 0 ) = 2 × i = 1 ( N 1 ) ( 1 ) i x a ( i )
T a i = T ( i ) T a ( i 1 )
for i = 1 , , N 1 .
In the end, the DST-IV values are computed based on:
Y ( 6 ) Y ( 14 ) Y ( 10 ) Y ( 12 ) = ( x a ( 0 ) + T a ( 6 ) ) s i n ( 13 α ) ( x a ( 0 ) + T a ( 14 ) ) s i n ( 29 α ) ( x a ( 0 ) + T a ( 10 ) ) s i n ( 21 α ) ( x a ( 0 ) + T a ( 12 ) ) sin ( 25 α )
Y ( 16 ) Y ( 8 ) Y ( 4 ) Y ( 2 ) = ( x a ( 0 ) + T a ( 16 ) ) s i n ( 33 α ) ( x a ( 0 ) + T a ( 8 ) ) s i n ( 17 α ) ( x a ( 0 ) + T a ( 4 ) ) s i n ( 9 α ) ( x a ( 0 ) + T a ( 2 ) ) s i n ( 5 α )
Y ( 11 ) Y ( 3 ) Y ( 7 ) Y ( 5 ) = ( x a ( 0 ) + T a ( 11 ) ) s i n ( 23 α ) ( x a ( 0 ) + T a ( 3 ) ) s i n ( 7 α ) ( x a ( 0 ) + T a ( 7 ) ) s i n ( 15 α ) ( x a ( 0 ) + T a ( 5 ) ) s i n ( 11 α )
Y ( 1 ) Y ( 9 ) Y ( 13 ) Y ( 15 ) = x a 0 + T a 1 s i n 3 α x a 0 + T a 9 s i n 19 α x a 0 + T a 13 s i n 27 α x a 0 + T a 15 s i n 31 α  
In the following sections, we will discuss the efficient parallel implementation of the proposed algorithm using a GPU architecture.

4. Overview of Used NVIDIA Development Boards

In this paper, three distinct NVIDIA CUDA architectures, representing two separate classes of GPU devices, were selected to demonstrate the algorithm’s performance. The first class belongs to the consumer class, which focuses on high throughput, compatibility with gaming engines like Unreal and Unity, and support for machine learning frameworks such as TensorFlow 2 and PyTorch 2.9. The second category of GPU devices is dedicated to Edge AI applications, such as intelligent robotics systems. In this mode, the reader can obtain a clear and comprehensive overview of its performance compared to the traditional implementation of the DST-IV algorithm.
Modern GPUs are designed with varying characteristics, targeting different markets and workloads. As summarized in Table 1, GPUs can be categorized into four levels: embedded (e.g., Jetson Xavier, Orin), consumer-grade GPUs (e.g., GeForce RTX 30/40/50-series), workstation/professional-grade GPUs (e.g., RTX A4000, A5000, A6000), and data center accelerators (e.g., H100, A100, GH200). Each level offers a unique combination of computing power, memory bandwidth, reliability (including ECC memory features), and power efficiency, thereby affecting their appropriateness for different types of workloads.
Table 1. The main classes of NVIDIA GPU devices.
In this research, the developed algorithm was tested using three different types of GPUs produced by NVIDIA. The main differences among these GPUs are primarily due to their architectures: Volta, Ampere, and Ada Lovelace. Although NVIDIA developed all the architectures, each represents a different generation of GPUs. The Volta architecture was designed in 2017, the Ampere Architecture in 2020, and the Ada Lovelace architecture in 2022. Each brings technological improvements in performance, power efficiency, and feature set, targeting different market segments.
The NVIDIA Jetson family includes a wide range of embedded computing platforms that vary significantly in computational power, energy efficiency, and physical size. Generally, Jetson Nano boards are intended for beginner AI and robotics projects, where low power consumption and cost savings are more important than high performance. The Jetson lineup features models like Jetson Nano, Jetson NX, and Jetson AGX, which are available in different versions, such as Orin or Xavier. In this hierarchy, the Jetson AGX series—consisting of Jetson AGX Xavier and Jetson AGX Orin—offers the highest performance and most advanced features. These modules provide workstation-level processing in an embedded package, making them perfect for autonomous systems, industrial robots, and edge AI inference. The Jetson NX development boards offer a mid-range option, combining the energy efficiency of the Nano series with the processing power of the AGX line. This makes them a flexible choice for embedded AI applications that need both mobility and high performance.
Each development system has unique features that influence its overall performance, which can sometimes be confusing when comparing the results of the same algorithm run on different development boards. When considering solely memory capabilities, it is evident that the Jetson AGX Orin system provides transfer rates of 204.8 GB/s, surpassing the Jetson AGX Xavier system’s 136.5 GB/s (Table 2)—an expected advantage of a newer generation that embeds a more powerful GPU. In contrast, the personal computer with the most powerful GPU offers a memory transfer rate of only 120 GB/s (the lowest of all 3 systems analyzed). Still, it includes internal memory, allowing a transfer rate of 1.01 TB/s.
Table 2. The main features of the development systems used in this research.
The Jetson AGX Xavier offers moderate computational capabilities, equipped with 512 CUDA cores and 64 Tensor Cores, as well as an 8-core ARMv8.2 Cortex-A72 processor, rendering it appropriate for fundamental applications such as object detection, simultaneous localization and mapping (SLAM), and image classification. It achieves up to 32 TOPS of artificial intelligence performance within a configurable power range of 10–30 W. The Jetson AGX Orin represents a significant enhancement in computational power, offering approximately eight times the performance of the Jetson AGX Xavier, and attaining up to 275 TOPS at a power range of 15–60 W. It can execute multiple large-scale models, including transformer-based architectures, in real-time for three-dimensional perception. The RTX 4090 offers an exceptional bandwidth capacity of 1.01 TB/s, suitable for demanding tasks such as large-scale inference, generative processes (including large language models (LLM) or diffusion models), 8K gaming, and scientific simulations. The RTX 4090 is equipped with 16,384 CUDA cores, 512 Tensor Cores, and 24 GB of GDDR6X memory, requiring an input power exceeding 450 W.
The Jetson AGX Orin attains an exceptional equilibrium between energy efficiency and inference capability, establishing itself as the preferred platform for power-sensitive and latency-critical applications at the edge. While the Jetson AGX Xavier remains suitable for less complex AI tasks, it is increasingly constrained by its computational and memory limitations. Conversely, the RTX 4090 delivers unparalleled performance for training and inference of large models; however, it is unsuitable for embedded environments due to its substantial thermal and power demands.
As it is expected, the Jetson Xavier and Orin modules primarily comprise several central processing units (CPUs—8 and 12, respectively) and a graphics processing unit (GPU). To utilize standard interfaces such as Ethernet, USB, and General-Purpose Input/Output (GPIO) pin headers—which facilitate access to UART, SPI, CAN, I2C, I2S, and other communication lines—a carrier board is indispensable for development activities.
In NVIDIA’s architecture, the CUDA cores are grouped into SMs (Streaming Multiprocessors). Depending on the architecture, each SM contains a fixed number of CUDA cores. For example, in NVIDIA’s Ampere architecture, each SM (Streaming Multiprocessor) has 128 CUDA cores. In this situation, the GPU on the Jetson AGX Orin development board has 16 SMs.

5. Materials and Methods

Given that this paper advances the implementation of the DST-IV algorithm, the newly developed algorithm will be compared with the classical algorithm described by relation (1). In Section 7, the newly developed algorithm is also compared with two other recent algorithms that represent alternative state-of-the-art implementations of the DST-IV algorithm [15,19].
The C implementation of the classical DST-IV algorithm [1], as described by relation (1), was executed on each GPU used and served as a reference for all the implemented algorithms. We also conducted an additional analysis, considered as a reference, in which relation (1) was implemented across 17 threads of execution, each operating on a different CUDA core, with each value of k ranging from 0 to 16. This decision is motivated by the intention to emphasize solely the performance enhancement attributable to the new implementation, without considering the performance gains resulting from more advanced architecture or an increased operating frequency of a particular processor type.
In this context, using Equation (1), the sequence of N real numbers, x(i), is transformed into another output sequence, Y(k), of real numbers with the same length as the input sequence, using the DST-IV transform. This traditional implementation was executed on a CUDA core during performance analysis. In the classical DST-IV implementation, the “sin” function was the default library implementation on double precision, unlike “__nv_fast_sinf”, which offers a faster but less accurate way to calculate sine.
In this research, the measurement cycle consists of three sequential stages: first, transferring data from the Central Processing Unit (CPU) to the Graphics Processing Unit (GPU); the second, performing a DST-IV transform, as defined by relation (1) or one of the novel Discrete Sinus Transform (DST) IV implementations introduced in this study—either with four-section or eight-section configurations; and the third, transferring data from the GPU back to the CPU. As a direct conclusion, in all the determinations performed, the measured durations include: (1) the data transfer times between the CPU and GPU in both directions, and (2) the data processing time needed to obtain the result using the new DCT-IV algorithm. The latter covers the calculation time within the CUDA cores, as well as the time required to access internal variables and data structures within the GPU needed by the DCT-IV algorithm.
The data set used as input for calculating the DST-IV transforms was randomly generated and scaled at the beginning of each measurement cycle using the “rand()” function from the C programming language. Ultimately, all generated values fall within the range of [−1, +1]. On a GPU, arithmetic operations are performed on SIMD (Single Instruction, Multiple Data) hardware. Specifically, several fma.rn.f32 assembly instructions were utilized to implement the program as primary components. These instructions execute a fused multiply–add (fma) operation on 32-bit floating-point numbers. Consequently, a PTX (Parallel Thread Execution) instruction such as fma.rn.f32 d, a, b, c, is equivalent to a fused multiply–add (computing ab + c), followed by rounding the result to the nearest (.rn—round to nearest) representable 32-bit floating-point number (adhering to the IEEE 754 standard) before storing the value in the operand d. All these operations are executed within a single clock cycle per pipeline. As a result, the multiplication and addition operations occur within a constant time interval irrespective of the operand values. Therefore, the data on which the algorithm was tested did not influence its final performance outcomes.
Based on the measurement cycle outlined above, one determination is the average value of 10,000 sequential measurement cycles. For the statistical analysis of the performance values obtained on each architecture supported by each development system, 100 such determinations were conducted. Then, statistical parameters such as average values, standard deviations, minimum, and maximum values were calculated. All measurements were made at the minimum frequency of the GPUs. This approach offers valuable insights into an algorithm’s efficiency, robustness under constrained conditions, and understanding of the worst-case scenario analysis. It also provides high-speed performance with low power consumption.
The computation time for each measurement cycle was determined using events from the CUDA event API. Functions such as “cudaEventCreate()”, “cudaEventRecord()”, and “cudaEventElapsedTime()” were used to measure the elapsed time. This measurement method offers a resolution of about half a microsecond [36]. This approach to measuring time is well-known and accepted for performance evaluations related to GPUs [18,36,37,38,39].
The DST-IV algorithms (both classical and parallel implementations) ran on GPUs as kernel functions mapped as one thread on each block, such as: DST4_clasic <<<1, 1>>> (in_gpu, out_gpu), DST4_4sections <<<4, 1>>> (in_gpu, out_gpu), and DST4_8sections <<<8, 1>>> (in_gpu, out_gpu). In all measurements, the internal data, constants, input, and output data buffers were stored in global memory unless otherwise specified.
Figure 1 demonstrates the operation of the algorithm on both the Central Processing Unit (CPU) and the four CUDA cores within the Graphics Processing Unit (GPU). It depicts the parallel implementation of the novel DST-IV algorithm, segmented into four sections. Furthermore, it is evident from Figure 1 that the four execution threads, each allocated to a CUDA core on the GPU, neither exchange data among themselves nor synchronize their processes in any way. Each thread computes four distinct coefficients from the result buffer provided by the new algorithm, which is executed on the GPU and subsequently transmitted to the CPU.
Figure 1. A graphical representation of the novel DST-IV (with 4 sections) algorithm implementation.
Additionally, the NVIDIA Nsight Compute package was used to profile the CUDA applications for GPU utilization, data transfer, and memory workload.
The programs were compiled using the NVIDIA CUDA official compiler (“nvcc”), which is part of the NVIDIA JetPack SDK. The SDK provides a comprehensive environment for developing C and C++ GPU-accelerated programs and AI applications optimized for each specific CUDA architecture. To enhance performance, the programs were compiled specifically for each GPU’s architecture and version, see Table 2. This approach ensures that the resulting binary code is compatible with the target architecture and delivers optimal performance.
The following line outlines the NVIDIA compiler arguments:
$ nvcc source_code_name.cu -o binary_name -v -arch = xxx
The xxx flag is specific to the destination architecture to which the program was compiled—the arguments used are listed in Table 3. The same table also presents the version of the CUDA Toolkit utilized in the development and testing of the applications developed.
Table 3. The specific architecture flag used to compile the developed applications.

6. Experimental Results

To accurately measure the performance of the parallel implementation of the proposed novel DST-IV algorithm, each section was executed on a different CUDA core, and no other applications were running on any GPU core during these tests. The results obtained by each algorithm are presented in the following tables, based on the following parameters: mean execution time, standard deviation of the execution time (Mean ± SD), minimum, and maximum execution times for each specific measurement condition (GPU architecture)—listed in red in the tables. All algorithms were implemented using a double-precision numeric format for the data.
The results obtained are presented in Table 4 and Table 5. The reference was always a measurement of the classical implementation of DST-IV (relation (1)), executed on a single core of one of the GPU architectures used.
Table 4. Performances achieved by running the two DST-IV computing algorithms (the classical algorithm and the parallel implementation with four sections) on different GPU architectures.
Table 5. Performances achieved by running the two DST-IV computing algorithms (classical algorithm and the parallel implementation with eight sections) on different GPU architectures.
The acceleration attained by the proposed innovative algorithm compared to the traditional one (which relies on the direct implementation of relation (1)) is presented in the final row of Table 4 and Table 5. This acceleration factor is determined as the ratio of the mean execution time of the traditional algorithm to that of the targeted algorithm, employing either 4 or 8 sections.
Table 4 shows the results for the algorithm configured with four sections running in parallel, while Table 5 presents the results for the innovative algorithm with eight sections operating in parallel.
According to Table 4, the DST-IV algorithm, when implemented across four sections, exhibits performance enhancement ranging from 22.42 to 50.46. The performance of the DST-IV algorithm, implemented across eight sections, demonstrates improvements in the speedup factor ranging from 24.33 to 65.25. This can be partially explained by the nature of the proposed parallel algorithm, which, in addition to parallel decomposition, incorporates an algorithm characterized by low arithmetic complexity. This is achieved by utilizing the subexpression sharing technique, minimizing memory accesses, and using constant values that are calculated, stored, and used later.
The Ada Lovelace architecture in the GeForce RTX 4090 system shows greater speedup in executing the new algorithm compared to the classical one, surpassing the performance increases seen on older architectures like the Ampere GPU in the Jetson AGX Orin board—23.35 versus 50.46 (for the DST-IV algorithm in 4 parallel sections, Table 4) or 28.16 versus 65.25 (for the DST-IV algorithm in 8 parallel sections, Table 5). This is a notable observation. It demonstrates that a more powerful or newer architecture achieves higher speed gains with the innovative algorithm compared to the classical one, surpassing the improvements seen in older architectures. This underscores NVIDIA’s ongoing efforts to enhance architectures and eliminate limitations of older designs.
Analyzing relation (1), it is evident that the computation of the classical DST-IV transform for 17 elements necessitates the calculation of 17 sums of products that are mutually independent. Accordingly, this classical algorithm can be efficiently implemented in parallel across 17 distinct segments. Consequently, a pertinent question emerges: what would be the performance of the novel developed algorithm if the reference point is the implementation of the classical algorithm with 17 segments executed concurrently? Table 6 and Table 7 answer this question.
Table 6. Performances achieved by running the two DST-IV computing algorithms (the classical algorithm implemented with 17 sections and the parallel implementation with four sections) on different GPU architectures.
Table 7. Performances achieved by running the two DST-IV computing algorithms (the classical algorithm implemented with 17 sections and the parallel implementation with eight sections) on different GPU architectures.
The performance gains are between 2.49 and 3.45 for the new algorithm implemented with 4 sections, and between 2.72 and 4.54 for the new algorithm with 8 sections. Analyzing these results, presented in Table 6 and Table 7, reveals that the new algorithm remains faster than the classic DST-IV algorithm, which is implemented with 17 sections running in parallel. The methodology employed to acquire these results is similar to that used for obtaining the data presented in Table 4 and Table 5. Furthermore, these results also include the duration necessary for data transfer between the CPU and GPU, and vice versa.
The execution time for each algorithm mainly depends on three factors: (1) the number and type (integer, float, double, bfloat, etc.) of computations performed on specific hardware (CPU, GPU, NPU, etc.), (2) the number of memory accesses and the type of memory accessed, and (3) how the algorithm is implemented (e.g., parallel or sequential).
By counting the operations performed by an algorithm—such as addition, subtraction, division, multiplication, or any other operation involving floating-point values—one can determine the FLOPs (Floating Point Operations). The FLOPs serve as an estimate of the model’s complexity. In Table 8, the number of FLOPs required by each kernel (only for the component(s) that are executed on the GPU) to complete its execution is shown.
Table 8. The number of Floating-Point Operations (FLOPs) required by an algorithm to complete its execution.
The information in Table 8 was derived using NVIDIA Nsight Compute, a profiling tool that provides detailed performance metrics for software components (kernels) running on CUDA cores. To obtain the values presented in the previous table, the recorded count of DFMA (Double-precision Fused Multiply–Add) instructions, captured by NVIDIA Nsight Compute, was multiplied by two. The classical algorithm consumes so many FLOPs to complete, mainly due to the use of “sin” library functions. Additionally, it should be noted that the eight-section algorithm requires slightly fewer instructions to finish (compared to the four-section implementation); however, some of these instructions are executed in parallel across eight sections.
Another important observation from Table 4 and Table 5 is that an NVIDIA GPU utilizes various types of memory, each with distinct characteristics and purposes: registers, shared memory, surface memory, texture memory, local memory, and global memory. In Figure 2, a schematic shows all types of memory present in a GPU; in this case, the GPU is part of the Jetson AGX Xavier development board.
Figure 2. A comprehensive analysis of the GPU memory workload utilizing the NVIDIA Nsight Compute tool.
Global memory is the largest type of memory on a GPU, but it also has the slowest access speed. It can be accessed by all threads running on the GPU and the host CPU. Shared memory, located on the GPU chip, provides faster access but is limited to threads within the GPU. Local memory is an abstraction of global memory used for thread-local variables. Texture and surface memory are specialized types optimized for specific data access patterns; however, they are not helpful for the algorithm described in this paper.
The developed algorithms (with 4 and 8 sections) do not use shared memory, surface memory, or texture memory. In Table 9, the number of memory accesses to local and global memories is shown. The data were obtained using the NVIDIA Nsight Compute tool. For the classical implementation of DST-IV, the same information from Table 9 is also displayed in Figure 2.
Table 9. Number of requests from global and local memory generated by the classical algorithm and by the developed algorithms (with 4 and 8 sections).
Memory poses a significant limitation for nearly all applications because it operates more slowly than CUDA core units. As shown in Table 9, the new algorithm introduced in this study is more efficient based on the write/read operations performed by the algorithms. The classical algorithm needs a total of 2310 read requests and 306 write requests to global memory, along with 289 write requests from local memory to complete its execution. The new algorithms (with 4 and 8 sections) do not utilize local memory and require 64 reads from global memory, along with 16 writes for the 4-section algorithm and 32 writes for the 8-section algorithm, to accomplish their goals.
Considering the substantial influence of memory access on overall performance, all kernel variables and constants were stored in shared memory (placed on the L1 cache of the Streaming Multiprocessor) in the following analysis. The performance of the new DST-IV algorithm within this configuration was evaluated in accordance with the methodology detailed in Section 5. The results are presented in Table 10.
Table 10. The analysis of the new algorithm (with 4 and 8 sections) considers the case when the variables and constants of each kernel function are placed in shared memory.
An analysis of the data presented in Table 10 indicates that the inclusion of internal constants and variables of each kernel within shared memory results in marginal enhancements in the execution durations of the revised kernel—0.045407 ms compared to 0.045006 ms in the case of the four-section algorithm, and 0.038822 ms relative to 0.038509 ms for the eight-section algorithm. These slight improvements are primarily attributable to the limited number of memory accesses (as listed in the preceding table, Table 9) performed by the new algorithm. This characteristic constitutes one of the fundamental advancements and distinctive features introduced by the new algorithm.
To evaluate the performance of all three methods for implementing the DST-IV algorithm—namely, the classical approach based on relation (1), the new algorithm with four sections, and the version with eight sections—Figure 3 shows the throughput concerning the GPU’s computing and memory resources during a specific working period outlined in Table 11. All data was collected using the NVIDIA Nsight Compute tool on the Jetson AGX Xavier development board.
Figure 3. The GPU throughput for an SM (Streaming Multiprocessor) unit is shown, with violet representing the classical implementation, green indicating a parallel implementation with 4 sections, and blue showing a parallel implementation with 8 sections.
Table 11. The execution time of each kernel unit implementation.
The results in Figure 3 are presented per SM unit, indicating the utilization percentage relative to the theoretical maximum limit.
The traditional implementation of the DST-IV algorithm achieved 4.13% of the total computing throughput of an SM unit and 0.23% memory throughput during a 2.65 ms kernel execution time, as shown in Table 11. Under the same conditions, the new four-section parallel implementation of the DST IV algorithm reached a computational throughput of 15.93% and 2.17% memory throughput in just 66.24 microseconds. The highest computational throughput was obtained with the eight-section algorithm, which achieved 20.51%.
To attain a thorough understanding and a comprehensive overview of the novel DST-IV developed algorithm, this algorithm (implemented with four and eight sections) was compared with the traditional implementation of DST-IV executed on a single-threaded CPU core, as well as with the sequential implementation of the new algorithm on a CPU, as detailed in Table 12. The tests were conducted on a Jetson AGX Orin GPU (2048 CUDA cores, operating at 306 MHz) and a Cortex-A78AE CPU (12 cores, operating at 345.6 MHz). The Cortex-A78AE CPU is integrated into the Jetson AGX Orin development board. The closest values to the GPU’s operating frequency (306 MHz) are 268 MHz and 345.6 MHz.
Table 12. Algorithm’s performance analysis: CPU versus GPU.
To establish a fair and transparent baseline for comparing the running results of algorithms on CPU and GPU (devices with different (1) architectures and (2) main concepts supporting these architectures), the CPU’s operating frequency was chosen to be as close as possible to the GPU’s operating frequency. The operating frequency of the CPU can only be set through software to specific fixed operation points. The CPU’s closest values to the GPU’s operating frequency (306 MHz) are 268 MHz and 345.6 MHz. From this perspective, the CPU works at a higher frequency than the GPU in this analysis. The execution time exhibits an inverse proportionality to the CPU frequency. Assuming an inverse linear relationship between execution time and CPU frequency, the execution time for the classical implementation of the DST-IV algorithm on a CPU operating at 306 MHz is estimated at 44.28 ms. Under identical conditions, the execution time for the newly developed algorithm, implemented sequentially on the CPU, is 79.24 ms. In the results shown in Table 12, the time taken for data transfer between the CPU and GPU, in both directions, was also included in the algorithms’ execution times on the GPU.
The time required to execute the classic algorithm on the CPU (44.28 ms) is similar to that of the 4-section algorithm on the GPU (45.41 ms) and longer than that required to execute the new 8-section algorithm (38.82 ms). In this instance, the overall benefit of the new algorithm (with 4 sections) is derived from an additional factor. Even a high-performance processor like the Intel Core i9-14900K, with 24 cores, can execute only 24 conventional DST-IV algorithms concurrently. Within the same timeframe, the GeForce RTX 4090 graphics card, equipped with 16384 CUDA cores, can perform 4096 DST-IV transformations (with four sections). This analysis presumes that both the CPU and GPU are operating at identical frequencies.

7. Discussion

Comparing different algorithms that implement the DST-IV is a difficult task. The different implementations of DST are made on different development systems (FPGA, ASIC, GPU, or CPU), running at different working frequencies, having different architectures, based on different frameworks (like Matlab [21], which is known as a slow development environment), many of these algorithms are based on certain technical specificities of the support system on which they run and which are not found on other development systems.
For these reasons, we selected to implement two distinct DST-IV algorithms, namely [15,19], as documented in the literature, and subsequently evaluate them under identical conditions to those utilized for testing the algorithm described in this paper. The analytical methodology used to test these algorithms is identical to the one employed in Subchapter 5 of this paper. Furthermore, both algorithms exhibit conceptual similarities to the algorithm presented herein and are founded on the decomposition of the traditional DST-IV algorithm into four sections [15] and six sections [19], which operate concurrently.
The results are presented in Table 13 and were acquired using the Jetson AGX Orin development board. On the second and third lines, the performances of the novel proposed algorithm for the cases of implementation on 4 and 8 sections are presented. These results are from Table 4 and Table 5. In the last two lines, the performances of the algorithms [15,19] are presented.
Table 13. Performance analysis of several DST-IV implementations (done on Jetson AGX Orin development board) when the GPU frequency was set to the minimum value.
One of the primary differences between the proposed algorithm and the two algorithms referenced in the literature [15,19], which were implemented, pertains to the number of working samples; specifically, References [15,19] utilize 13 samples, whereas the proposed algorithm employs 17 samples. This is why the execution times of the classical algorithm in Table 13 differ, with one being around 1 ms and the other 0.61 ms.
In conclusion, the direct comparison of execution times remains irrelevant, even if the proposed algorithm demonstrates faster performance than the algorithms [15] and [19]—specifically, 0.045407 and 0.038822 versus 0.058482 and 0.055502. Consequently, the speed improvement of each of the four DST-IV algorithms was compared to a reference algorithm. The classical implementation of DST-IV, based on relation (1), served as the benchmark algorithm, using the same number of samples as the algorithm it was compared to.
Based on the data provided in Table 13, it is evident that the newly proposed algorithm outperforms the algorithms described in [15,19]. This increase in performance of the new algorithm is impressive, being more than twice that of the algorithms presented in [15,19]—23.35 and 28.16 versus 10.48 and 11.16.
Previously, we implemented a faster parallel execution method for the DCT-IV algorithm [18]. When comparing the performance of this new algorithm with the previous one, we see that the new DST-IV implementation clearly and significantly exceeds the old performance. For example, in the case of the Jetson AGX Xavier development system, the execution speed multiplication factor for the 4-section algorithm was 12.1 [18], while for the current algorithm, it is 22.42.
However, because the novel DST-IV has a different form and was derived using other methods and equations compared to DCT-IV [18], the two algorithms differ. In both cases, we aim to highlight 4 or 8 computational structures that can be computed in parallel. Moreover, in the DST-IV algorithm, we have further reduced the number of multiplications to a minimum, replacing also the multiplications with ½ and ¼ that are used in the DCT-IV algorithm [18] with additions and subtractions only, because matrices A and C contain only 1, −1, and 0—see relations (6) and (7) presented above.
A potential drawback of the novel proposed algorithm is its limitation to parallelization on only four or eight distinct sections. This restriction primarily arises from the mathematical factorization method underlying its implementation. However, due to data communication overhead, the performance enhancements are not substantial when increasing the number of parallel sections, as evidenced by our analysis when transitioning from 4 to 8 sections.

8. Conclusions

All the results shown earlier, especially the comparison with other cutting-edge DST-IV algorithms, clearly demonstrate that the newly developed DST-IV algorithm greatly surpasses all expectations.
The enhancement in execution speed ranges from 22.42 to 50.46 times for the implementation divided into four sections (see Table 4), and from 24.33 to 65.25 times when applying the DST-IV algorithm based on eight independent sections (see Table 5). These performances arise from a combination of several factors: the enhanced factorization technique of the DST-IV algorithm, the reduced number of mathematical operations involved, the a priori calculation of the coefficients, a low number of memory accesses, and their implementation in parallel.
Another significant aspect to consider is that the Ada Lovelace architecture offers considerable enhancements in execution speed compared to the Volta and Ampere architectures, resulting in more than double the performance.
As future work, we are considering investigating how speed performance can be enhanced by increasing the degree of parallelism, with particular attention to the impact of data transfers that may constrain speed improvements as parallelism increases. Additionally, we plan to explore alternative parallel architectures suitable for implementing our algorithms. Furthermore, we are considering implementing other discrete transforms on GPU architectures and examining their specific features to adapt our algorithms accordingly for such parallel systems.
In conclusion, based on the results obtained, it is evident that the novel DST-IV algorithm presented in this paper demonstrates exceptional performance, establishing it as a leading example among the contemporary state-of-the-art DST-IV algorithms.

Author Contributions

Conceptualization, D.F.C. and D.M.D.; methodology, D.F.C. and D.M.D.; software, D.M.D.; validation, D.F.C. and D.M.D.; investigation, D.M.D.; resources, D.M.D.; writing—original draft preparation: D.F.C. and D.M.D.; writing—review and editing: D.M.D. and D.F.C., supervision: D.F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ASICApplication-specific integrated circuit
CPUCentral Processing Unit
CUDACompute Unified Device Architecture
DCTDiscrete Cosine Transform
DPUDeep Learning Processing Unit
DSTDiscrete Sine Transform
FP16, 32, 64Floating-point on 16, 32, or 64 bits
FPGAField-programmable gate array
GBGigabyte
GHzGiga Hertz
INT1, 4, 8Integer representation on 1, 4, 8 bits
GPUGraphics Processing Unit
MHzMega Hertz
NPUNeural Processing Unit
TF32TensorFloat-32
FLOPsFloating Point Operations
FLOPSFloating Point Operations Per Second
TFLOPSTera FLOPS
TOPSTera Operations Per Second
TPUTensor Processing Unit
VLSIVery-large-scale integration

References

  1. Jain, A.K. A sinusoidal family of unitary transforms. IEEE Trans. Pattern Mach. Intell. 1979, 1, 356–365. [Google Scholar] [CrossRef]
  2. Wang, Y.; Veluvolu, K.C. Time-Frequency Analysis of Non-Stationary Biological Signals with Sparse Linear Regression based Fourier Linear Combiner. Sensors 2017, 17, 1386. [Google Scholar] [CrossRef]
  3. Yan, J.; Laflamme, S.; Singh, P.; Sadhu, A.; Dodson, J. A Comparison of Time-Frequency Methods for Real-Time Application to High-Rate Dynamic Systems. Vibration 2020, 3, 204–216. [Google Scholar] [CrossRef]
  4. Suresh, K.; Sreenivas, T.V. Linear filtering in DCT IV/DST IV and MDCT/MDST domain. Signal Process. 2009, 89, 1081–1089. [Google Scholar] [CrossRef]
  5. Rose, K.; Heiman, A.; Dinstein, I. DCT/DST alternate-transform image coding. IEEE Trans. Commun. 1990, 38, 94–101. [Google Scholar] [CrossRef]
  6. Wang, J.L.; Ding, Z.Q. Discrete sine transform domain LMS adaptive filtering. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Tampa, FL, USA, 26–29 March 1985; pp. 260–263. [Google Scholar]
  7. Shi, J.; Zheng, J.; Liu, X.; Xiang, W.; Zhang, Q. Novel short-time fractional Fourier transform: Theory, implementation, and applications. IEEE Trans. Signal Process. 2020, 68, 3280–3295. [Google Scholar] [CrossRef]
  8. Wang, Z.; Wang, L. Interpolation using the fast discrete sine transform. Signal Process. 1992, 26, 131–137. [Google Scholar] [CrossRef]
  9. Kim, M.; Lee, Y.L. Discrete sine transform-based interpolation filter for video compression. Symmetry 2017, 9, 257. [Google Scholar] [CrossRef]
  10. Cheng, S.N.C. Application of the sine-transform method in time-of-flight positron-emission image reconstruction algorithms. IEEE Trans. Biomed. Eng. 1985, BME-32, 185–192. [Google Scholar] [CrossRef]
  11. Thalmayer, A.; Zeising, S.; Fischer, G.; Kirchner, J. A Robust and Real-Time Capable Envelope-Based Algorithm for Heart Sound Classification: Validation under Different Physiological Conditions. Sensors 2020, 20, 972. [Google Scholar] [CrossRef]
  12. Hassan, M.; Osman, I.M. Facial Feature Extraction Based on Frequency Transforms: Compartive study. In Proceedings of the International Arab Conference on Information Technology (ACIT), Hammamet, Tunisia, 15–16 October 2008. [Google Scholar]
  13. Sudeep, D.; Thepade Madhura, M.K. Video Classification using Sine, Cosine, and Walsh Transform with Bayes, Function, Lazy, Rule and Tree Data Mining Classifier. Int. J. Comput. Appl. 2015, 110, 18–23. [Google Scholar] [CrossRef]
  14. Thepade, S.; Das, R.; Ghosh, S. Feature Extraction with Ordered Mean Values for Content Based Image Classification. Adv. Comput. Eng. 2014, 15, 454876. [Google Scholar] [CrossRef]
  15. Chiper, D.F.; Cotorobai, L.-T. A New Approach for a Unified Architecture for Type IV DCT/DST with an Efficient Incorporation of Obfuscation Technique. Electronics 2021, 10, 1656. [Google Scholar] [CrossRef]
  16. Kidambi, S.S. Recursive implementation of the DCT-IV and DST-IV. In Proceedings of the IEEE Symposium on Advances in Digital Filtering and Signal Processing, Victoria, BC, Canada, 5–6 June 1998. [Google Scholar]
  17. Chiper, D.F.; Cracan, A. A novel algorithm and architecture for a high-throughput VLSI implementation of DST using short pseudo-cycle convolutions. In Proceedings of the International Symposium on Signals, Circuits and Systems, Iasi, Romania, 13–14 July 2017. [Google Scholar]
  18. Chiper, D.F.; Dobrea, D.M. A Novel Low-Complexity and Parallel Algorithm for DCT IV Transform and Its GPU Implementation. Appl. Sci. 2024, 14, 7491. [Google Scholar] [CrossRef]
  19. Chiper, D.F.; Cracan, A. An Area-Efficient Unified VLSI Architecture for Type IV DCT/DST Having an Efficient Hardware Security with Low Overheads. Electronics 2023, 12, 4471. [Google Scholar] [CrossRef]
  20. Poola, L.; Aparna, P. An efficient parallel-pipelined intra prediction architecture to support DCT/DST engine of HEVC encoder. J. Real-Time Image Proc. 2022, 19, 539–550. [Google Scholar]
  21. Polyakova, M.; Witenberg, A.; Cariov, A. The Fast Type-IV Discrete Sine Transform Algorithms for Short-Length Input Sequences. Bull. Pol. Acad. Sci. Tech. Sci. 2025, 73, 153827. [Google Scholar] [CrossRef]
  22. Britanak, V. The fast DCT-IV/DST-IV computation via the MDCT. Signal Process. 2003, 83, 1803–1813. [Google Scholar] [CrossRef]
  23. Madhukar, B.N.; Sanjay, J. A Duality Theorem for the Discrete Sine Transform-IV (DST-IV). In Proceedings of the 3rd International Conference on Advanced Computing and Communication Systems, Coimbatore, India, 22–23 January 2016. [Google Scholar]
  24. Murthy, N.R.; Swamy, M.N.S. On the on-line computation of DCT-IV and DST-IV transforms. IEEE Trans. Signal Process. 1995, 43, 1249–1251. [Google Scholar] [CrossRef]
  25. Chiang, H.C.; Liu, J.C. A regressive structure for on-line computation of arbitrary length DCT-IV and DST-IV transforms. IEEE Trans. Circ. Syst. Video Tech. 1996, 6, 692–695. [Google Scholar] [CrossRef]
  26. Shao, X.; Johnson, S.G. Type-IV DCT, DST, and MDCT Algorithms with Reduced Numbers of Arithmetic Operations. Signal Process. 2008, 88, 1313–1326. [Google Scholar] [CrossRef]
  27. Kober, V. Fast Hopping Discrete Sine Transform. IEEE Access 2021, 9, 94293–94298. [Google Scholar] [CrossRef]
  28. Murty, M.N.; Nayak, S.S.; Padhy, B.; Rao, B.J. Novel Systolic Architectures for Realization Type-IV Discrete Sine Transform Using Recursive Algorithm. IOSR J. Electron. Commun. Eng. 2020, 15, 42–46. [Google Scholar]
  29. Hnativ, L.O. Fast Integer Sine and Cosine Transforms Type IV of Low Complexity for Video Coding. Cybern. Syst. Anal. 2025, 61, 305–318. [Google Scholar] [CrossRef]
  30. Hnativ, L.O. Fast 16-Point Integer Sine and Cosine Transforms Type IV Low-Complexity for Video Coding. In Proceedings of the Applications of Digital Image Processing XLVII Conferince, San Diego, CA, USA, 19–22 August 2024. [Google Scholar]
  31. Perera, S.M.; Lingsch, L.E. Sparse Matrix Based Low-Complexity, Recursive, and Radix-2 Algorithms for Discrete Sine Transforms. IEEE Access 2021, 9, 141181–141198. [Google Scholar] [CrossRef]
  32. Cobrnic, M.; Duspara, A.; Dragic, L.; Piljic, I.; Kovac, M. Highly parallel GPU accelerator for HEVC transform and quantization. In Proceedings of the International Conference on Image, Video Processing and Artificial Intelligence, Shanghai, China, 21–23 August 2020. [Google Scholar]
  33. Montero, P.; Gulías, V.M.; Taibo, J.; Rivas, S. Optimising lossless stages in a GPU-based MPEG encoder. Multimed. Tools. Appl. 2013, 65, 495–520. [Google Scholar] [CrossRef]
  34. Ryan, T.; Xiaoming, L. A Code Merging Optimization Technique for GPU. In Languages and Compilers for Parallel Computing. LCPC 2011. Lecture Notes in Computer Science; Rajopadhye, S., Mills Strout, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 7146, pp. 218–236. [Google Scholar]
  35. Alqudami, N.; Kim, S.D. OpenCL-based optimization methods for utilizing forward DCT and quantization of image compression on a heterogeneous platform. J. Real-Time Imag. Proc. 2016, 12, 219–235. [Google Scholar] [CrossRef]
  36. Harris, M. How to Implement Performance Metrics in CUDA C/C++, Nvidia Developer Technical Blog. Available online: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/ (accessed on 20 August 2025).
  37. Cheng, J.; Grossman, M.; McKercher, T. Professional CUDA C programming; John Wiley & Sons, Inc.: Indianapolis, IN, USA, 2014; pp. 273–275. [Google Scholar]
  38. Stokfiszewski, K.; Wieloch, K.; Yatsymirskyy, M. An efficient implementation of one-dimensional discrete wavelet transform algorithms for GPU architectures. J. Supercomput. 2022, 78, 11539–11563. [Google Scholar] [CrossRef]
  39. Keluskar, Y.C.; Singhaniya, N.G.; Vyawahare, V.A.; Jage, C.S.; Patil, P.; Espinosa-Paredes, G. Solution of nonlinear fractional-order models of nuclear reactor with parallel computing: Implementation on GPU platform. Ann. Nucl. Energy 2024, 195, 11013. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.