Hardware Design and Implementation of a Lightweight Saber Algorithm Based on DRC Method

: With the development of quantum computers, the security of classical cryptosystems is seriously threatened, and the Saber algorithm has become one of the potential candidates for post-quantum cryptosystems (PQCs). To address the problems of long delay and the high power consumption of Saber algorithm hardware implementation, a lightweight Saber algorithm hardware design scheme based on the joint optimization of data readout and clock (DRC) was proposed. Firstly, an analysis was carried out on the hardware architecture, timing overhead and power consumption distribution of the Saber algorithm, and the key circuits that limit the performance of the algorithm were identiﬁed; secondly, a dual-port SRAM parallel reading method was adopted to improve the data reading efﬁciency and reduce the timing overhead of double data reading in the multiplier module. Then, a clock gating technology was used to reduce the dynamic ﬂipping probability of internal registers and reduce the hardware power consumption of the Saber algorithm; ﬁnally, data reading and clock gating were jointly optimized to design a high-speed and low-power Saber algorithm hardware IP core. Lightweight IP cores were integrated into RISC-V SoC systems via APB bus in a TSMC 65 nm process to complete the digital back-end design. The experimental results show an IP core area of 0.99 mm 2 and power consumption of 8.49 mW, which is 33% lower than that reported in the related literature. Under 72 MHz & 1 V operating conditions, the number of clock cycles for the Saber algorithm’s key generation, encryption and decryption are 3315, 9204 and 1420, respectively.


Introduction
Quantum computing breaks through existing computing architectures and uses the principle of state superposition to make quantum computers a research hotspot. For example, in 2019, Google developed a superconducting quantum chip [1] circuit that samples a billion times faster than conventional computers; in 2020, the University of Science and Technology of China built a photonic quantum computing prototype, "Jiuzhang" [2], which is trillions of times faster than existing supercomputers; in 2022, the Institute for Molecular Science in Japan announced that its quantum research has produced the world's highest speed in the quantum computer "2-bit quantum gate". With the development of quantum computers, the number of quantum bits has increased dramatically, and the security of cryptosystems based on mathematical puzzles such as classical large integer decomposition and discrete logarithms has been greatly challenged. In classical computer architecture, the underlying mathematical problems on which traditional cryptography relies are extremely difficult to solve in an efficient time. However, in the face of quantum computers, mathematical difficulties based on public-key cryptosystems will no longer be secure [3], so the information security systems and various applications built by relying on cryptosystems will face serious security problems, and there is an urgent need for

Symbolic Representation
This section introduces some preparatory knowledge, including basic definitions and symbols. We use bold lowercase letters to represent vectors, such as a, and use bold uppercase letters to represent matrices, such as A. For a polynomial f, we write f i for the ith coefficient of x i . p and q are integer powers of 2, p = 2 ε p and q = 2 ε q . We use Z q to denote the ring of integers modulo integer q with representants in [0, q), and for an integer z, z mod q denotes the mode approximate reduction of z in the interval [0, q). R q = Z q [x]/(x n + 1) denotes the ring of integer polynomials modulo (x n + 1), where n is a power of 2 and each coefficient of the ring is in [0, q). For any ring R, R L × K denotes the ring on the matrix L × K. β µ denotes the central binomial distribution, with µ being an even number and a distribution ranging from [−µ/2, µ/2]. The l-value represents the matrix dimension in the Saber algorithm, and the value of l in Saber is 3. The l-value in LightSaber is 2 and in FireSaber is 4, where l = 2 means lightweight encryption, l = 3 means standard security level encryption, and l = 4 means high-security level encryption.

Saber Algorithm Process
The Saber algorithm is a lattice-based encryption and decryption mechanism that resists selective ciphertext attacks. Lattice-based cryptographic algorithms can be constructed from two different problems, namely, the lattice-based shortest vector problem and the lattice-based nearest vector problem. The learning with error (LWE) problem is based on the nearest vector problem on a general lattice, and the learning with rounding (LWR) problem can be reduced to the shortest vector problem on a lattice. In contrast to the LWE problem, the noise of the LWR is generated deterministically, avoiding the randomness attached to the noise. Mod-LWE replaces a single ring element with modulo element on the same ring, and is an improved version of LWE. Similarly, Mod-LWR is an improved version of Mod-LWE with smaller ciphertext bandwidth compared to Mod-LWE. The security of Saber algorithm depends on Mod-LWR and has the advantage of being lightweight. The Saber algorithm implementation includes public key encryption (PKE) and the key encapsulation mechanism (KEM). PKE is an IND-CPA secure encryption scheme, and KEM is an IND-CCA secure key encapsulation mechanism, with both implementations consisting of key generation, encryption and decryption algorithms.
The Saber algorithm PKE key generation pseudo-code is shown in Algorithm 1. PKE key generation starts by reading seed A , which is a key seed signal, through a register, and using the random number expansion module SHAKE128 function to process the seed signal for randomness. The random number matrix A is generated by the gen function based on the seed signal seed A . The matrix A is an l × l matrix and the key s is an l × 1 matrix, where the individual matrix elements are polynomials of order 256. In the hardware implementation of the Saber algorithm, the l value is 3, and the individual elements of s take the value interval [−4: 4]. In Algorithm 1, the random number matrix is transposed and multiplied by the key matrix. The matrix transposition operation, vector-to-string operation, and public key splicing operation can be carried out in hardware by changing the read address bits, which improves the efficiency of the algorithm compared to the software implementation of these operations. The secret key generation phase of PKE generates public key pk and private key sk, matrix b and seed A are stitched to obtain public key pk, and matrix S is assigned to sk after vector-to-string operation.
PKE encryption is shown in Algorithm 2. Firstly, a random number matrix A is generated by the gen function, and a new key matrix s' is generated in the encryption algorithm. The product of matrix A and s' is rounded and shifted to produce b as the [0: 512] bits of the ciphertext, and the [513: 1024] bits of the ciphertext c m are generated by the shift algorithm module and the encryption module.
PKE decryption is shown in Algorithm 3, the received encrypted data are first modulo approximately subtracted and then transposed, and the value after the operation enters the product module. The decryption data are given after the shift operation to realize the operation of extracting the plaintext from the ciphertext.
The above is the IND-CPA encryption process, and the CPA security component can be constructed using the Fujisaki-Okamoto transformation for the IND-CCA security KEM. IND-CCA contains three steps: key generation, encryption and decryption. The algorithm achieves encryption and decryption by invoking the key generation, encryption and decryption operations in IND-CPA. The difference between the KEM algorithm and the PKE encryption algorithm is that the former returns a randomly generated string when the public key decapsulation fails, making the algorithm more secure. The KEM key generation calls the PKE key generation process to obtain the public and private keys through numerical stitching and assignment, which is a non-clock consuming operation in the hardware implementation, and is implemented directly through address bit calls.
KEM key generation, KEM encapsulation and decapsulation are shown in Algorithms 4-6. The plaintext m and the public key pk are randomized in KEM encryption by the SHA3-256 function in the random number expansion module to obtain the processed m and HASH-pk, and kr' is obtained by the data bit splicing operation. The kr' is processed by the SHA3-256 function to obtain the upper half of the session key returned by the KEM encryption operation, and the lower half of the session key returned by the KEM operation is the plaintext generated in the PKE.
The ciphertext is obtained by calling the PKE decryption algorithm in KEM decapsulation, and a verification module is added to the algorithm to verify the correctness of the plaintext. If the public key decapsulation fails, a randomly generated string will be returned to improve the resistance to attacks.

Hardware Implementation of Saber Algorithm
The previous section introduced the Saber algorithm key generation and encryption/decryption algorithm. The algorithm computation module consists of a hash function, a polynomial product function, and a shift function. Since p and q are powers of 2 in the Saber algorithm, the hardware implementation does not require modulo reduction operations, and the Keccak core gas pedal is designed to be low-power and fast in hardware. The algorithm hardware IP core implementation is described as follows, and the design of each module of Saber algorithm is specified as follows.

Saber Algorithm Hardware Architecture
The overall hardware IP architecture of the Saber algorithm is shown in Figure 1. The hardware IP architecture of the Saber algorithm encompasses several key components, including a random number expansion module, a binomial sampling module, a multiplier module, a shift module, a string conversion module, an encryption module, a decryption module, a signal control module, a communication control module, and a memory module. The signal control unit module ensures a uniform distribution of input and output data flow for each module. It connects the enable signal, data signal, and address signal of each module, while intermediate data generated during the algorithm execution are stored in the memory module. When integrating the IP through the APB bus, the external port includes various signals, such as the reset signal, enable signal, write data signal, read data signal, write enable signal, address signal, clock signal, chip select signal, and off-chip output signal. The off-chip output signal is specifically designed to output the encrypted ciphertext. According to the Saber algorithm flow, the hardware data flow state machine is designed.
The hardware IP structure of the low-power Saber algorithm is shown in Figure 2. First, a memory module composed of eight SRAMs is used, and the external circuits include a multiplier module, a sampling module, a SHA-3 random number expansion module, an encryption module, a decryption module, a multiplier module, and a string conversion module. The read/write signals and input/output data of each module are connected to the register control module. According to the algorithm principle, Seed is used as the input signal to the IP core, then the data are processed by each module, and the intermediate ule. The signal control unit module ensures a uniform distribution of input and outpu data flow for each module. It connects the enable signal, data signal, and address signa of each module, while intermediate data generated during the algorithm execution ar stored in the memory module. When integrating the IP through the APB bus, the externa port includes various signals, such as the reset signal, enable signal, write data signal, rea data signal, write enable signal, address signal, clock signal, chip select signal, and off chip output signal. The off-chip output signal is specifically designed to output the en crypted ciphertext. According to the Saber algorithm flow, the hardware data flow stat machine is designed. The hardware IP structure of the low-power Saber algorithm is shown in Figure 2 First, a memory module composed of eight SRAMs is used, and the external circuits in clude a multiplier module, a sampling module, a SHA-3 random number expansion mod ule, an encryption module, a decryption module, a multiplier module, and a string con version module. The read/write signals and input/output data of each module are con nected to the register control module. According to the algorithm principle, Seed is use as the input signal to the IP core, then the data are processed by each module, and th intermediate data are stored in the register module. The algorithm is created based on th data flow of the algorithm and the input/output port control of the memory module.

Single-Cycle Double Data Reading and Storage Module Design
The data storage module is the intermediate value reading/writing unit, which is the hardware core module of the IP core. The algorithm involves performing multiple matrix

Single-Cycle Double Data Reading and Storage Module Design
The data storage module is the intermediate value reading/writing unit, which is the hardware core module of the IP core. The algorithm involves performing multiple matrix multiplications with large random number matrices, such as matrix A T *s in Algorithm 1 and A*s in Algorithm 2. Consequently, the storage module of the algorithm's IP core incurs significant overhead in terms of read clock cycles during the data reading process. The conventional memory module is a single-port memory module which has the problem of excessive timing overhead, incurred by reading a set of data per clock. This paper proposes a dual-port memory module design; by using a dual-port SRAM memory module, two sets of coefficients can be read out in one clock cycle, which results in a two-fold increase in overall performance at the cost of a small area. In practice, the storage and recall of data is controlled by the write address and the read address. The data are processed by different modules, such as multiplier module, and random number expansion to complete the data processing in three modes of key generation, encryption and decryption. Data reading and writing is performed in memory through 64-bit transfer operations. This data selection method facilitates integration with the processor (32-bit or 64-bit).
The memory structure is shown in Figure 3. The whole memory module is divided into four SRAM A and four SRAM B , and the size of an individual SRAM is 1024 × 16 bits. Each SRAM A single clock output has 16 bits, the four SRAMA output values are stitched together, and the overall output is a single clock output of 64 bits. SRAM B operates on the same principle.  The plaintext of the encrypted data is generated by the random number expansion module and entered into a predetermined unit of the memory module. The encrypted and decrypted data will be input to the fixed storage unit of the memory module. For the IP core, the top-level encryption and decryption can be used as the encryption IP or decryption IP separately, and both share the random number expansion module, binomial sampling module, polynomial multiplication module, and string steering module, which can effectively reduce the hardware overhead of the Saber algorithm IP core.  The plaintext of the encrypted data is generated by the random number expansion module and entered into a predetermined unit of the memory module. The encrypted and decrypted data will be input to the fixed storage unit of the memory module. For the IP core, the top-level encryption and decryption can be used as the encryption IP or decryption IP separately, and both share the random number expansion module, binomial sampling module, polynomial multiplication module, and string steering module, which can effectively reduce the hardware overhead of the Saber algorithm IP core.

High-Speed Random Number Expansion Module Circuit Design
The implementation of the Saber algorithm involves SHAKE-128, SHA3-256 and SHA3-512 functions, and a random number function selector is used in the random number module for different computational function selections. The random number generation hardware circuit structure is shown in Figure 4. The random number module is controlled by the data control module, and the control port occupies 6 data bits of control signal, which is the total control signal of Saber IP. These 6 bits of data are used to control the random number generation function type and data length, respectively, where the control signal is customized by the initial module. In the random number module, the Keccak core acceleration module is instantiated with 1600 data bits.  The multiplier module mainly implements the multiplication of matrix A and key matrix S. In the Saber algorithm, the l-value is 3, so matrix A is a 3 × 3 matrix, and matrix S is a 3 × 1 matrix. The mode selection side was added to the product module to support polynomial coefficients of both 13-bit and 10-bit bit widths, as defined by the Saber algorithm. The dual-port SRAM can read multipliers simultaneously in a single cycle, which can further increase the multiplication speed and complete the IP core design for the highspeed Saber algorithm.

Lightweight Encryption and Decryption Hardware Module Design
In algorithm encryption and decryption operations, a mode control signal was added to the IP top layer to control the IP core for encryption and decryption by assigning values to the mode console and the enable side. The IP performs key generation operation when the mode console is 01, encryption operation when the value is 10, and decryption operation when it is 11. The processing data was fed to different units to complete the multiplexing of multiplication modules, random numbers, and other modules in the encryption and decryption process, thus reducing the hardware area of the IP core. The intermediate data will be stored in SRAM after being processed by different modules, and the bit splicing operation will be completed by changing the initial bit of the read address. The core acceleration module has five steps: θ, ρ, ∏, χ, ι. The input and output arrays of the module are states A and B, with a width of 1600 bits, respectively. The states can be represented by a three-dimensional array containing 25 words, each with a length of 64 bits. Words can constitute a cube with x, y and z coordinates. Each bit of the cube can be addressed by [x, y, z]. The states are specified as follows: the state of a word is called a "Lane", the two-dimensional part with a concrete coordinate having a fixed z-value state is called a "Slice", and all "Lanes" with the same x-coordinates are called "Sheets" [19].
Step θ: This step first calculates the state The multiplier module mainly implements the multiplication of matrix A and key matrix S. In the Saber algorithm, the l-value is 3, so matrix A is a 3 × 3 matrix, and matrix S is a 3 × 1 matrix. The mode selection side was added to the product module to support polynomial coefficients of both 13-bit and 10-bit bit widths, as defined by the Saber algorithm. The dual-port SRAM can read multipliers simultaneously in a single cycle, which can further increase the multiplication speed and complete the IP core design for the high-speed Saber algorithm.

Lightweight Encryption and Decryption Hardware Module Design
In algorithm encryption and decryption operations, a mode control signal was added to the IP top layer to control the IP core for encryption and decryption by assigning values to the mode console and the enable side. The IP performs key generation operation when the mode console is 01, encryption operation when the value is 10, and decryption operation when it is 11. The processing data was fed to different units to complete the multiplexing of multiplication modules, random numbers, and other modules in the encryption and decryption process, thus reducing the hardware area of the IP core. The intermediate data will be stored in SRAM after being processed by different modules, and the bit splicing operation will be completed by changing the initial bit of the read address.
The output timing of the encrypted IP core is shown in Figure 5. During the encryption of the IP core, the off-chip port provides the ciphertext data, and the mode value is at this point 01. When the encryption module enable signal jumps from high to low, it indicates that the encryption is complete. When the encryption module enable signal jumps from high to low, it indicates that the encryption is complete. The write enable signal jumps from low to high and begins to read the ciphertext data. The output cipher text of the encryption module is 16 × 64 bits, with 64 bits of data being output in each clock, and after 16 clocks, the off_chip_done signal jumps to a high level, at which point the cipher text is all output.

Processor-Level Low-Power Implementation of the Saber Algorithm IP Core
The hardware IP core of the Saber algorithm is integrated into the System on Chip (SoC) system via the APB bus, and the clock, reset and enable signals of the IP core are given by the SoC, as shown in Figure 6. The design uses the Hummingbird e203 processor. The reduction of dynamic flip power consumption is particularly important in IP core integration. This integration includes a clock gating circuit that connects to the bus, which reduces the dynamic flip power consumption of the entire IP core. The clock gating circuit uses clock and reset hosting signals to control the signal flipping of the Saber algorithm IP core. In the gated clock output timing, the system clock and reset signals are fed into the Saber algorithm IP core when the clock and reset signal register signals jump high, inserting a simultaneous global clock gating in the back-end design.

Processor-Level Low-Power Implementation of the Saber Algorithm IP Core
The hardware IP core of the Saber algorithm is integrated into the System on Chip (SoC) system via the APB bus, and the clock, reset and enable signals of the IP core are given by the SoC, as shown in Figure 6. The design uses the Hummingbird e203 processor. The reduction of dynamic flip power consumption is particularly important in IP core integration. This integration includes a clock gating circuit that connects to the bus, which reduces the dynamic flip power consumption of the entire IP core. The clock gating circuit uses clock and reset hosting signals to control the signal flipping of the Saber algorithm IP core. In the gated clock output timing, the system clock and reset signals are fed into the Saber algorithm IP core when the clock and reset signal register signals jump high, inserting a simultaneous global clock gating in the back-end design.
The memory module register addresses assignments when the Saber algorithm IP core is mounted via APB bus, as shown in Table 1. The table gives the offset addresses for different ports, their read/write permissions and register definitions. When the processor reads the off_chip_done signal indicating that encrypted data is output, the processor hands over other signals, such as the clk signal and Start signal, to the IP core. The segment base address of the APB bus is 0x10007000. With bit 0 of the SABER_CLK_ADDR register being 1, there would be a clock for the IP core, and with bit 0 being 0, there would be no clock. A system reset is used when bit 0 of the SABER_RESETN_ADDR register is 1, and a clock is used when bit 0 is 0. Bit 0 of the START_ADDR register is used to receive the Start signal, and bits [1: 0] of the MODE_ADDR register are used to receive the mode signal.

Processor-Level Low-Power Implementation of the Saber Algorithm IP Core
The hardware IP core of the Saber algorithm is integrated into the System on Ch (SoC) system via the APB bus, and the clock, reset and enable signals of the IP core a given by the SoC, as shown in Figure 6. The design uses the Hummingbird e203 process The reduction of dynamic flip power consumption is particularly important in IP co integration. This integration includes a clock gating circuit that connects to the bus, wh reduces the dynamic flip power consumption of the entire IP core. The clock gating circ uses clock and reset hosting signals to control the signal flipping of the Saber algorith IP core. In the gated clock output timing, the system clock and reset signals are fed in the Saber algorithm IP core when the clock and reset signal register signals jump hig inserting a simultaneous global clock gating in the back-end design. The memory module register addresses assignments when the Saber algorithm core is mounted via APB bus, as shown in Table 1. The table gives the offset addresses different ports, their read/write permissions and register definitions. When the process reads the off_chip_done signal indicating that encrypted data is output, the process hands over other signals, such as the clk signal and Start signal, to the IP core. The segme base address of the APB bus is 0x10007000. With bit 0 of the SABER_CLK_ADDR regis being 1, there would be a clock for the IP core, and with bit 0 being 0, there would be clock. A system reset is used when bit 0 of the SABER_RESETN_ADDR register is 1, a a clock is used when bit 0 is 0. Bit 0 of the START_ADDR register is used to receive t Start signal, and bits [1: 0] of the MODE_ADDR register are used to receive the mo signal.

Experimental Results and Analysis
This section presents the experimental results of the Saber algorithm. The co-processor is described in Verilog language, and the algorithm consists of a data control module and a calculation module. The experimental process uses the EDA simulation tool VCS, a logic synthesis tool design compiler, a physical layout tool IC compiler, the Cadence layout design tool IC5141 and the Mentor graphical interface Calibre to complete the back-end design of this IP based on the TSMC 65 nm process under the standard process tcbn65gplustc, where the IP core storage module consumes eight 1024 × 16 SRAMs.

Algorithm Performance Results
The number of execution cycles and time for the Saber algorithm KEM back-end design in different modes of key generation, encryption, and decryption are shown in Table 2. The key generation phase consumes 3304 clock cycles, the encryption phase consumes 4704 clock cycles, and the decryption consumes 1420 clock cycles. For high-performance data reading, the memory module consists of 8 × 1024 × 16 SRAMs. The proposed IP core was implemented under the operating conditions of 72 MHz clock frequency and 1 V supply voltage. The IP core area is 0.65 mm 2 , and the chip layout is shown in Figure 7. The design of the dual-port SRAM and the clock gating circuitry does not provide a distinct advantage in terms of the area of the IP core compared to existing implementations. However, this design primarily focuses on optimizing the timing and power consumption. The back-end design of this IP core is based on the TSMC 65 nm process, and the equivalent logic gate count of the IP core after conversion is 45 K. The power consumption shares of the multiplier module, register module, random number expansion module, encryption and decryption module, string conversion module, and main control module of the Saber algorithm hardware IP are shown in Figure 7. data reading, the memory module consists of 8 × 1024 × 16 SRAMs. The proposed IP core was implemented under the operating conditions of 72 MHz clock frequency and 1 V supply voltage. The IP core area is 0.65 mm 2 , and the chip layout is shown in Figure 7. The design of the dual-port SRAM and the clock gating circuitry does not provide a distinct advantage in terms of the area of the IP core compared to existing implementations. However, this design primarily focuses on optimizing the timing and power consumption. The back-end design of this IP core is based on the TSMC 65 nm process, and the equivalent logic gate count of the IP core after conversion is 45 K. The power consumption shares of the multiplier module, register module, random number expansion module, encryption and decryption module, string conversion module, and main control module of the Saber algorithm hardware IP are shown in Figure 7.  The power consumption shows that by adding the gated clock unit circuit, the power consumption of the IP core at different frequencies can be reduced by 50%, as shown in Figure 8. Using different process corners to complete the back-end design, the power consumption of the IP core at different frequencies with supply voltages of 1.1 V, 1 V, 0.9 V and 0.8 V is shown in Figure 8. The power consumption of the IP core decreases with a reduction in supply voltage. According to the power consumption report, the power consumption of the IP core does not change much when the supply voltage varies from 1 V to 0.9 V; therefore, this design is chosen to complete the backend design under the standard process tcbn65gplustc, with a voltage of 1 V.

12, x FOR PEER REVIEW 12 of 15
The power consumption shows that by adding the gated clock unit circuit, the power consumption of the IP core at different frequencies can be reduced by 50%, as shown in Figure 8. Using different process corners to complete the back-end design, the power consumption of the IP core at different frequencies with supply voltages of 1.1 V, 1 V, 0.9 V and 0.8 V is shown in Figure 8. The power consumption of the IP core decreases with a reduction in supply voltage. According to the power consumption report, the power consumption of the IP core does not change much when the supply voltage varies from 1 V to 0.9 V; therefore, this design is chosen to complete the backend design under the standard process tcbn65gplustc, with a voltage of 1 V. When the plaintext data are shown on the left side of Table 3, the ciphertext data generated by the encryption module are shown on the right side of Table 3. The bit width of the plaintext data is 256, and the bit width of the ciphertext data is 1024. The ciphertext output is achieved by reading the fixed memory cell of SRAM. When the plaintext data are shown on the left side of Table 3, the ciphertext data generated by the encryption module are shown on the right side of Table 3. The bit width of the plaintext data is 256, and the bit width of the ciphertext data is 1024. The ciphertext output is achieved by reading the fixed memory cell of SRAM.

Power Area Analysis of IP Cores
The back-end design is completed using different process corners at 72 MHz operating frequency with the various processes of tcbn65gplustc0d8, tcbn65gpluswc, and tcbn65gplustc at supply voltages of 0.8 V, 0.9 V, and 1 V, respectively. Within this design, we have analyzed the area and power consumption values for multiple modules, such as Mul (multiplier), SHA (random number expansion), Dpram (memory), Addh (shift), Bol2pol (string conversion), Bin (binomial sampling), Pack (encryption), and Unpack (decryption). These analyses were conducted at different supply voltages of 1 V, 0.9 V, and 0.8 V for various processes. The trends of area and power consumption values of each module of the IP core are shown in Figure 9. Due to the hardware adaptability of the random number expansion module, its power consumption and area consumption are at a low level. The power consumption of multiplier module is the primary concern for the IP core. Our next objective is to decrease the power consumption of the multiplier module.
OR PEER REVIEW 13 of 15 low level. The power consumption of multiplier module is the primary concern for the IP core. Our next objective is to decrease the power consumption of the multiplier module.

Comparison with Related Literature
A comparison of the hardware implementation of this design with other post-quantum cryptographic algorithms is shown in Table 4. The comparison involves different post-quantum cryptographic algorithms, including different implementation methods such as hardware-software co-implementation and hardware implementation, as well as involving different implementation platforms and different operating frequencies. For the different schemes using different platforms and different implementations, the relevant data show that the proposed scheme has significant advantages in terms of speed and

Comparison with Related Literature
A comparison of the hardware implementation of this design with other post-quantum cryptographic algorithms is shown in Table 4. The comparison involves different postquantum cryptographic algorithms, including different implementation methods such as hardware-software co-implementation and hardware implementation, as well as involving different implementation platforms and different operating frequencies. For the different schemes using different platforms and different implementations, the relevant data show that the proposed scheme has significant advantages in terms of speed and power consumption. The all-hardware implementation [20,21] was faster than the hardware-software collaborative design, and the random number acceleration core module, Keccak, ran slower when implemented on a software platform, with the timing overhead being higher when implemented using software code on Keccak. A comparison of the time and power consumption of the back-end design completed using the TSMC 65 nm process with other post-quantum cryptographic algorithm hardware implementations is shown in Table 4. Study [10], study [11], and the current design all involve software/hardware implementations of the Saber algorithm. With the same flow of the algorithm, the comparison focuses on the power consumption and area of the algorithm implementation. While the study referenced in [11] achieves lower running time, it does not hold an advantage in hardware consumption compared to the study referenced in [10]. In contrast, this design leverages ASIC implementation, which offers advantages in terms of the overall running time for secret key generation, encryption, and decryption. Furthermore, it increases the size of the IP chip to some extent, thereby enhancing the algorithm's speed. In terms of experimental environment, the comparison with study [21] is more similar, and the running time of the secret key generation, encryption, and decryption modes is significantly reduced during the double data reading session due to the presence of multiple sizeable random number matrix products in the algorithm process. Including the clock gating unit circuit in the bus integration and the insertion of the global clock gating in the back-end design reduces the power consumption of the IP core significantly, and this design still dominates at 72 MHz operating frequency. The global clock insertion has reduced the dominance of equivalent logic gates, but the algorithm's runtime and power consumption are significantly optimized. In terms of runtime, the overall architecture of this design is nearly 100 times faster compared to the other two design solutions. The results show that this design has a faster speed and relatively lower power consumption.

Conclusions
This study completes the design of a hardware IP core for Saber algorithm. Based on the analysis of the Saber algorithm structure, the data reading efficiency was improved by using a dual-port SRAM parallel reading method for the timing overhead problem caused by the data reading of the multiplier module. To address the problem of the clock dynamic power consumption of the IP core, the clock gating technique was used to reduce the probability of internal register dynamic flipping, thus reducing the hardware power consumption of the IP core. Finally, the digital back-end design was completed by integrating lightweight IP cores into the RISC-V SoC system via the APB bus within the TSMC 65 nm process. The Saber algorithm is a strong contender as a post-quantum cryptographic algorithm due to its omission of the mode about subtraction operation, which means it has a better prospect of achieving lightweight post-quantum cryptographic encryption. In our future work, we plan to conduct a more comprehensive analysis of the application's resistance to known attacks and potential vulnerabilities. This analysis aims to enhance the algorithm's overall resilience against attacks by addressing any identified vulnerabilities from an application perspective.