New Systolic Array Algorithms and VLSI Architectures for 1-D MDST

In this paper, we present two systolic array algorithms for efficient Very-Large-Scale Integration (VLSI) implementations of the 1-D Modified Discrete Sine Transform (MDST) using the systolic array architectural paradigm. The new algorithms decompose the computation of the MDST into modular and regular computational structures called pseudo-circular correlation and pseudo-cycle convolution. The two computational structures for pseudo-circular correlation and pseudo-cycle convolution both have the same form. This feature can be exploited to significantly reduce the hardware complexity since the two computational structures can be computed on the same linear systolic array. Moreover, the second algorithm can be used to further reduce the hardware complexity by replacing the general multipliers from the first one with multipliers with a constant that have a significantly reduced complexity. The resulting VLSI architectures have all the advantages of a cycle convolution and circular correlation based systolic implementations, such as high-speed using concurrency, an efficient use of the VLSI technology due to its local and regular interconnection topology, and low I/O cost. Moreover, in both architectures, a cost-effective application of an obfuscation technique can be achieved with low overheads.


Introduction
The informational age poses multiple technical challenges for data distribution, processing, and representation, such as bandwidth and network congestion, scalability, latency and real-time communication, data synchronization and consistency, content delivery optimization, and data security and privacy [1]. High data volumes, especially for multimedia content such as videos or high-resolution images, can strain network infrastructure and result in slow or unreliable data transfer. Processing and rendering digital media content can consume significant energy, leading to reduced device runtime, while resource-constrained devices may struggle with the processing power required to efficiently encode or decode high-resolution or high-bitrate video, audio, or imaging formats [2,3].
Addressing these technical challenges often requires a combination of efficient algorithms, network protocols, infrastructure optimization, security measures, and continuous monitoring and adaptation to evolving technologies and user requirements. Key research areas such as image/video compression, decoding, accurate real-time data transmission has been reduced significantly. Moreover, such dedicated architectures can be implemented using FPGA, making these hardware implementations almost as flexible as software routines.
Our systolic array is working together with a low-cost and low-power processor where the host processor is working on input and output data management, while the hardware accelerator (the systolic array) implements the computationally intensive tasks.
In our approach, to obtain an efficient VLSI accelerator, it was necessary to restructure the basic form of the MDST algorithm in a such way that regular and modular computational structures are obtained, thus allowing for an efficient VLSI implementation using systolic arrays.
Systolic arrays have been proved to allow efficient VLSI implementations, as shown in [15][16][17][18][19]. It has been demonstrated that they best satisfy the trade-off between area and execution time for some important discrete transforms, as shown in [20].
It was already shown that the flow of the data into the algorithm is very important from a VLSI implementation point of view in a such way that communication complexity is even more important than the computational one in certain cases. Thus, regular and modular computational structures can lead to efficient VLSI implementations using distributed arithmetic [21] or systolic arrays [22]. These architectures have several merits over others, especially due to their regular and local data flow with an efficient input/output and data transfer operations, as in case of systolic arrays architectures. Thus, we have obtained efficient VLSI implementations of certain digital signal processing (DSP) algorithms that are using cyclic convolutions or circular correlations [23][24][25][26][27] that have been extended to some other regular and modular computational structures, such as, for example, skew-circular and pseudo-circular correlations or band-correlations [28][29][30].
In this paper, we have proposed two new systolic arrays for 1D MDST based on such regular and modular computational structures. One is based on pseudo-circular correlations and the other on pseudo-cycle convolutions, as both have the same form and length that allows an efficient VLSI implementation for our hardware accelerator for the computation of MDST. Moreover, since they have the same form and length, they can be computed using a single linear systolic array appropriately operated. Thus, for the first solution, the MDST can be computed in an interleaving manner, and for the second Sensors 2023, 23, 6220 3 of 12 one, they are computed one after the other, leading to a reduced hardware complexity while maintaining high-speed performances specific to systolic array implementations. The obtained VLSI architecture has all the advantages of the cyclic convolution and circular correlation-based structures VLSI implementations, such as a high-speed due to using pipelining and parallelism, efficiency due to local inter-connections, and a low I/O cost. Moreover, we will show that using the proposed VLSI algorithm and architecture, we can efficiently incorporate the hardware security techniques with low overheads.
The rest of the paper is organized as follows: Section 2 presents the original systolic array algorithm for 1D MDST with a low computational complexity using regular and modular computational structures that is well adapted for an efficient VLSI implementation, as presented at the International Symposium on Electronics and Telecommunications ISETC 2022 [31], and a proposed improved version of the algorithm that allows an implementation with increased performance. Section 3 presents the proposed systolic array architecture that allows a more efficient implementation of the VLSI algorithm with a significant reduction of the hardware complexity, and which allows a more efficient incorporation of the obfuscation technique. Section 4 presents a discussion of the obtained results. In Section 5, we present the conclusions and some directions for future work.

A Systolic Array Algorithm for the Computation of 1D MDST [31]
The 1-D MDST is defined as: for k = 0, . . . , M − 1, where M = N/2 and the elementary angle α = π 2M . As shown in [31], to reformulate the basic form of the algorithm given by the Equation (1), we have introduced some restructuring input sequences defined below.
First, we define the following auxiliary input sequences: Using the introduced sequences, we define additional auxiliary input sequences x C (i) and x C (i): and, finally, the auxiliary input sequences x a (i) and x b (i): for i = 1, M − 1.
The matrices in Equations (10) and (11) have a particular structure, where all the elements along the secondary diagonal of the matrix or parallel to it are the same except for the sign. This structure is called a pseudo-circular correlation. This computational structure has an important advantage from a VLSI implementation point of view, as it can be efficiently implemented using the systolic array architectural paradigm. As already known, this architecture is well appropriate for an efficient VLSI implementation.
The output sequence can be recursively computed using Equations (12) and (13) as follows: for k = 1, . . . , M − 1, where T(k) are additional auxiliary output sequences that can be computed as follows: where the auxiliary output sequences Y a and Y b are defined below: and Sensors 2023, 23, 6220 5 of 12

An Improvement of the Proposed Algorithm for the Computation of 1D MDST
To reformulate the basic form of the algorithm given by Equation (1), we have used the sequences defined in (2)-(5) and introduced modified auxiliary input sequences as compared to (6)- (9) in order to obtain the desired matrix-vector products in the following equations: for i = 1, M − 1.
Using these auxiliary input sequences and appropriate permutations of the indices, we can reformulate the computation of the MDST into two pseudo-cyclic convolutions as shown in Equations (22) and (23).
cos 16α cos 32α cos 40α cos 24α cos 48α cos 8α The matrices in Equations (22) and (23) have been constructed such that they can be efficiently implemented using the systolic array architectural paradigm. By achieving an arrangement of the matrix elements such that the lines parallel to the main diagonal (including the main diagonal) contain elements that along the same line are equal in absolute value, one can use pseudo-cycle convolution computational structure to realize the operations in Equations (22) and (23). As previously shown [31], the pseudo-cycle convolution structure is suitable for an efficient VLSI realization.
The output sequence can be recursively computed using Equations (24) and (25) as follows: for k = 1, . . . , M − 1, where T(k) can be computed as follows: Sensors 2023, 23, 6220 (27) and Y a and Y b are defined below: and

Systolic Array Architectures for 1D MDST
3.1. The VLSI Architecture for the Algorithm of Section 2.1 As shown in [31], the VLSI architecture can be obtained by mapping the Equation (10) on a linear systolic array using the design procedure proposed in [28] and the tag control mechanism [32]. The same systolic array can be obtained by mapping Equation (11). So, it is possible to use the same systolic array to compute both equations in an interleaving manner.
The proposed hardware accelerator operates alongside a low-cost and low-power host processor. The host processor is used for input and output data management, while a hardware accelerator using the systolic array can implement the computationally intensive tasks.
In Figure 1, the hardware core of the VLSI architecture that implements Equation (10) is presented. Thus, the hardware core is formed of a linear systolic array that has six elementary processors (PEs). The post-processing stage consists of six multipliers with a constant and six adder/subtracters and implements Equations (16)- (17). The computation of the input sequences in Equations (2)-(9) and the output sequences in Equations (12)-(15) is executed on the host processor.
The function of the elementary processing elements (PEs) from the systolic array presented in Figure 1 is shown in Figure 2. The post-processing stage consists of six multipliers with a constant and six adder/subtracters and implements Equations (16)- (17). The computation of the input sequences in Equations (2)-(9) and the output sequences in Equations (12)-(15) is executed on the host processor.
The function of the elementary processing elements (PEs) from the systolic array presented in Figure 1 is shown in Figure 2. The post-processing stage consists of six multipliers with a constant and six adder/subtracters and implements Equations (16)- (17). The computation of the input sequences in Equations (2)-(9) and the output sequences in Equations (12)-(15) is executed on the host processor.
The function of the elementary processing elements (PEs) from the systolic array presented in Figure 1 is shown in Figure 2. As explained in [31] and shown in Figure 1, the input sequence, , is progressively loaded along the processing chain from right to left, starting with the processing element , and ending with the last processing element, . The sequence, also known as the tag control sequence, defines the input values sampling and storing moments within each processing element's internal registers ′ and , which are subsequently employed in the computations. By traversing the path of the systolic array, the partial result that is forwarded from stage-to-stage accumulates different terms of the dot products that compose the matrix-vector products of Equations (11) and (12) for the vectors and As explained in [31] and shown in Figure 1, the input sequence, x e , is progressively loaded along the processing chain from right to left, starting with the processing element PE 0 , and ending with the last processing element, PE 5 . The t s sequence, also known as the tag control sequence, defines the input values sampling and storing moments within each processing element's internal registers x i and x i , which are subsequently employed in the computations. By traversing the y path of the systolic array, the partial result that is forwarded from stage-to-stage accumulates different terms of the dot products that compose the matrix-vector products of Equations (11) and (12) for the vectors T a and T b , respectively. The rows of the matrix-vector products are computed in an interleaved manner, based on the state of the t i input.
Due to the unique characteristics of the utilized computational structure, it becomes feasible to efficiently integrate the obfuscation hardware security technique using methods similar to the ones described in [30].
As argued in [31], this solution has all the advantages of using modular and regular computational structures as cycle-convolution and circular correlation in the VLSI implementation as regularity, modularity, and local interconnections, and also a high throughput specific to systolic arrays by using pipelining and parallelism. As will be seen in the next section, it is possible to further reduce the hardware complexity without affecting the other advantages of the presented solution.

The VLSI Architecture for the New Algorithm of Section 2.2
Using the same design method as in Section 3.1, we have obtained the systolic array from Figure 3 that can be used to compute both Equations (22) and (23). This particularity can be used to significantly reduce the hardware complexity as we can use the same linear systolic array to compute both equations. Because the same systolic array can be used to compute Equations (22) and (23) just by changing the input sequence x a (i, j) with x b (i, j), a significant reduction of the hardware complexity is achieved. Using the same design method as in Section 3.1, we have obtained the systolic array from Figure 3 that can be used to compute both Equations (22) and (23). This particularity can be used to significantly reduce the hardware complexity as we can use the same linear systolic array to compute both equations. Because the same systolic array can be used to compute Equations (22) and (23) just by changing the input sequence , with , , a significant reduction of the hardware complexity is achieved. Figure 3. Systolic array that implements Equation (22) and also (23) but with the input sequence instead of .
In Figure 4, the function of the processing elements from the systolic arrays from Figure 3 is presented. All the processing elements from Figure 3 have the same functionality, which represents an important advantage from a VLSI implementation point of view. As can be seen from Figure 4, each processing element contains a multiplier and an adder/subtracter and a MUX controlled by a tag control bit denoted as sign that is used to select the correct sign in the operation. One operand in each multiplier is a constant, thus allowing for a significant reduction in the hardware complexity. Compared to the processing element presented in [33], where integer constants are used for the multipliers, in this case fixed-point approximate representations of cosine coefficients are used for the low-complexity multipliers of the processing elements. In Figure 4, the function of the processing elements from the systolic arrays from Figure 3 is presented. All the processing elements from Figure 3 have the same functionality, which represents an important advantage from a VLSI implementation point of view. As can be seen from Figure 4, each processing element contains a multiplier and an adder/subtracter and a MUX controlled by a tag control bit denoted as sign that is used to select the correct sign in the operation. One operand in each multiplier is a constant, thus allowing for a significant reduction in the hardware complexity. Compared to the processing element presented in [33], where integer constants are used for the multipliers, in this case fixed-point approximate representations of cosine coefficients are used for the low-complexity multipliers of the processing elements.  In addition to the hardware core consisting of the systolic array from Figure 3, we use a pre-processing and a post-processing stage. The pre-processing stage computes the auxiliary input sequences , , using Equations (4) and (5) and and using Equations (18)-(21), respectively. As our systolic array is used as a hardware accelerator that works together with a host processor, Equations (2) and (3) are computed in the host processor.
The post-processing stage is used to compute the auxiliary output sequences and using Equations (28) and (29) and using Equations (26) and (27). All the multipliers in Equations (28) and (29) have one constant operand and have been implemented with additions/subtractions only. The auxiliary output sequence is sent back to the host where the output sequence is computed using Equations (24) and (25).
We have synthesized the improved VLSI architecture from Section 3.2 using Cadence Genus 21.14 with Nangate OpenCell Library and North Carolina State University's 15 nm FreePDK15. Table 1 summarizes the synthesis results in terms of area, power, and delay for that VLSI implementation. It can be observed that using a minimum constrained clock In addition to the hardware core consisting of the systolic array from Figure 3, we use a pre-processing and a post-processing stage. The pre-processing stage computes the auxiliary input sequences x C (i), x C (i), using Equations (4) and (5) and x a (i) and x b (i) using Equations (18)-(21), respectively. As our systolic array is used as a hardware accelerator that works together with a host processor, Equations (2) and (3) are computed in the host processor.
The post-processing stage is used to compute the auxiliary output sequences Y a (k) and Y b (k) using Equations (28) and (29) and T(k) using Equations (26) and (27). All the multipliers in Equations (28) and (29) have one constant operand and have been implemented with additions/subtractions only. The auxiliary output sequence T(k) is sent back to the host where the output sequence Y(k) is computed using Equations (24) and (25).
We have synthesized the improved VLSI architecture from Section 3.2 using Cadence Genus 21.14 with Nangate OpenCell Library and North Carolina State University's 15 nm FreePDK15. Table 1 summarizes the synthesis results in terms of area, power, and delay for that VLSI implementation. It can be observed that using a minimum constrained clock period the synthesis tool is able to find a solution at a clock frequency of 7.7 GHz for a delay on the critical path of 130 ps. We have a low area of 950 µm that is slowly increasing while we are increasing the clock frequency and a power of 1.25 mW at 100 MHz that is increasing linearly with the frequency.

Discussion
The proposed two VLSI architectures presented in this paper represent the first systolic array architectures proposed until now, although using of systolic arrays in the VLSI implementations offers certain advantages, as can be seen also from this paper.
First of all, we have obtained two new systolic array algorithms for 1-D MDST that have a low hardware complexity/power consumption and allow an efficient VLSI implementation. At the same time, besides the advantage of a low hardware complexity offered by the systolic array architectural paradigm, the systolic arrays allow a high-speed performance at a reduced hardware complexity due to its low delay on the critical path. Furthermore, the proposed systolic array-based architecture enables an efficient integration of the obfuscation technique with minimal overheads. The incurred overhead due to the incorporation of the obfuscation technique consists of 6 four-way one-bit wide multiplexers, which translates in an under 1% area overhead of the total chip area. Moreover, the impact on the speed of the DCT core operation is negligible as the multiplexers are not placed on the critical data path of the systolic arrays.
For the proposed systolic arrays algorithms, we have obtained two new VLSI architecture one for each systolic array algorithm. Both systolic arrays contain only six processing elements for each one and allow the computation of the two computational structures (pseudo-circular correlation and pseudo-cycle convolutions, respectively) on a single linear systolic array, resulting a significant reduction of the hardware complexity, but the second VLSI architecture developed in Section 3.2 allows a further significant reduction of the hardware complexity and implicitly of the power consumption by replacing the general multipliers with multipliers where one operand is a constant. Due to the fact that each multiplier with a constant can be implemented using only adders and shift operations that does not imply any hardware cost besides a significant reduction in hardware complexity, the speed performances have been increased due to the fact that the delay on the critical path is only 3T a , where T a is the delay of one adder due to the fact that we are using only adders/subtracters and shift operation to implement our VLSI architecture. As can be seen from Table 2, to implement the constant multipliers, we need only three adders and shift operations, with only one exception where there are four such adders.
As a benefit of using the pipelining mechanism and a short critical path of only 3T a stemming from the simple adder-only implementations of the constant multipliers, the proposed VLSI architecture offers high-speed performances, while maintaining a reduced hardware cost due to the low complexity of the multipliers. Furthermore, the described solution can accommodate with low overheads an effective integration of the obfuscation technique by using only six MUXs while maintaining the speed performances.
Additionally, both proposed solutions share the VLSI implementation benefits offered by cycle convolution and circular correlation topologies due to the regular and modular nature of these architectures, resulting in an efficient VLSI implementation while maintaining a low I/O cost.

Conclusions and Future Works
In this paper, an improvement of a previously reported systolic array algorithms for efficient VLSI implementations of the 1-D Modified Discrete Sine Transform (MDST) has been presented. Using the systolic array architectural paradigm and the proposed systolic array algorithms, low-complexity VLSI implementations of 1D MDST have been obtained. The new algorithms decompose the computation of the MDST into modular and regular computational structures called pseudo-circular correlations and pseudocycle convolutions that lead to efficient VLSI implementations. The second proposed algorithm can be used to further reduce the hardware complexity by replacing the general multipliers from the first one with multipliers with a constant that have a significantly reduced complexity. The resulting VLSI architecture can be used to obtain a low hardware complexity implementation with significantly higher speed performances, proving that the systolic array architectural paradigm can be used to overcome the area-speed-power tradeoff. Moreover, in both architectures, a cost-effective application of a hardware security technique can be achieved.
One future trend that we can mention here consists of the use of the systolic array architectural paradigm to obtain VLSI implementations for some other discrete transforms with a low hardware complexity while maintaining high speed performances at the same time.
Another future trend for our work consists in using the systolic array architecture to efficiently incorporate the hardware security techniques, particularly the obfuscation technique, in other discrete transforms.