1. Introduction
Information is fundamental in modern society, demanding secure cryptographic systems for protection. Traditional cryptographic systems rely on mathematical problems that are computationally hard within a finite polynomial time [
1,
2]. For instance, the RSA public key encryption algorithm relies on the hardness of integer factorization to ensure the security of its private key [
3]. However, with the emergence of quantum algorithms and the rapid development of specialized quantum computers, it is expected that traditional cryptography will be vulnerable to quantum attacks in the foreseeable future [
4]. Post-quantum cryptographic algorithms, such as CRYSTALS-Kyber, are standardized as the new generation of cryptographic protocols designed to resist quantum attacks [
5]. Similarly, the Saber key encapsulation protocol, which is based on lattice problems [
6,
7], holds substantial value for both research and practical applications in cryptography.
In the third round of evaluation by the National Institute of Standards and Technology (NIST), failure to update the security of Saber with parameter sets to counter the DualAttack led to inferior RAM model security in software compared to Kyber. Although eliminated by NIST, the Saber algorithm retains valuable attributes suitable for research and implementation in academic and industrial settings [
8]. Therefore, this study chooses the Saber algorithm for the overall hardware implementation of the protocol. The research provides valuable insights for the future deployment of Post-Quantum Cryptography (PQC) in various scenarios [
9]. This is particularly important given the growing global concern about the threat of quantum attacks. While Saber is no longer a standard of NIST, it remains an algorithm that boasts clear advantages in hardware implementation, transmission bandwidth and robust security. Overall, the ongoing research into the Saber algorithm continues to hold significance and meaning, particularly within specialized fields.
This paper focuses on the hardware design and implementation of the Saber key encapsulation protocol. In 2017, Jan-Pieter et al. designed the first version of the Saber key encapsulation protocol and submitted it to NIST [
10]. The main operation of the Saber algorithm involves the multiplication of a ring polynomial matrix and a vector in the modular domain. The NTT algorithm demonstrates commendable hardware performance, effectively addressing the polynomial multiplication requirements of KEM schemes rooted in the M-LWE or LWE problem [
11]. Polynomial multiplication is no longer the primary bottleneck in this context. The key bottleneck lies in the functions responsible for key generation. These functions exhibit significant latency and consume considerable hardware resources, necessitating optimization efforts.
Hardware implementation has a clear speed advantage due to its high level of parallelism. As a result, numerous researchers both domestically and internationally, have utilized FPGA platforms and ASICs. In 2020, Sujoy Sinha Roy et al. designed the first hardware implementation of the Saber algorithm using Xilinx’s Zynq UltraScale FPGA platform [
12]. In order to expedite polynomial multiplication, the author employed 256 low-level operators simultaneously, leading to the Schoolbook algorithm being completed in 256 cycles. This approach effectively resolves the memory access bottleneck with minimal impact on the overall area. The co-processing architecture is adopted, and a customized 32-bit instruction set is used for operator control, considering both flexibility and performance. The peripheral control circuit of the SHA3 function is optimized simultaneously [
13]. The hardware results show that the frequency reaches 250 MHz, and only 23.6 k LUT resources are consumed. In 2022, Zhu et al. implemented the Saber algorithm using the TSMC 28 nm process. The overall chip area is only 3.6 mm
2, and it operates at a frequency of 500 MHz. This is also the first complete implementation of the Saber protocol of ASIC and has been verified by a fluidic chip. The ASIC implementation is superior to other cryptographic algorithms in terms of power consumption and security [
14]. Andrea Basso et al. implemented the multiplication of Saber and Dilithium on Artix-7 in 2021 using the same iterative NTT polynomial multiplier. This implementation consumed 519 clock cycles, which is lower compared to the NTT multipliers used in other cryptographic algorithms [
15]. In addition, Aikata et al. designed a unified coprocessor architecture that can implement both Dilithium and Saber [
16]. They aligned the data stream of hash function output random numbers and conducted a thorough analysis of the impact of the bit width of polynomial coefficients on the results of NTT. VietBaDang et al. presented a high-performance benchmark implementation of the Saber, Kyber and NTRU algorithms [
17]. By integrating and optimizing existing implementations, they determined that the Saber protocol at medium security levels can be completed in 48.4 
s. Rentería-Mejía et al. presented the design of LWE cryptoprocessors utilizing NTT cores and Gaussian samplers based on the inverse transform method. The cryptoprocessors were synthesized on a field-programmable gate array and subsequently validated in hardware, which exhibits noteworthy throughput [
18]. In continued research, they proposed a lattice-based encryption scheme for Identity-Based Encryption, alongside a lattice-based cryptoprocessor tailored for the encryption or decryption of the suggested CCA Identity-Based Encryption scheme, specifically for security parameters n = 512 and n = 1024. The hardware implementation demonstrated efficiency, achieving a commendable level of throughput [
19].
In this paper, we propose three types of polynomial multipliers for various application scenarios including the lightweight Schoolbook multiplier, high-throughput multiplier based on TMVP-Schoolbook algorithm and improved pipelined NTT multiplier. Other principal modules of the Saber protocol are designed, encompassing the hash function module, sampling module and functional submodule. Based on our proposed multiplier, we implement the overall hardware circuits of the Saber key encapsulation protocol. Experimental results demonstrate that our overall hardware circuits have different advantages compared with other existing works.
The following sections outline the structure of this paper. In 
Section 2, we provide a description of the Saber key encapsulation protocol. In 
Section 3, we propose three types of polynomial multipliers for various application scenarios including lightweight Schoolbook multiplier, high-throughput multiplier based on TMVP-Schoolbook algorithm and improved pipelined NTT multiplier. Moreover, other principal modules of Saber are designed, encompassing the hash function module, sampling module and functional submodule. In 
Section 4, we implement the overall hardware circuits of the Saber key encapsulation protocol based on our proposed multiplier. In 
Section 5, we analyze the results and compare with other existing works. Finally, we make a conclusion and reflect on future work in 
Section 6.
  2. Preliminaries
The Saber key encapsulation protocol is described in Algorithms 1–3 including key generation, key encapsulation and key decapsulation [
20].
      
| Algorithm 1 KEM.KeyGen() | 
| 1:2:3:4:5:6:return 
                    
 | 
In the key generation phase, step 1 involves invoking the 
 function from the public key encryption algorithm to generate a public-private key pair (
, 
) with IND-CPA security level [
21]. The public key undergoes a hash function, specifically SHA3-256, to obtain a 256-bit message digest, which is the hashed value of 
 with the public key information. In Step 3, 
z represents a 256-bit random number generated by uniform sampling. Its purpose is to generate a random value in case of decryption failure. The 
 operator performs bitwise concatenation, which does not consume any resources.
      
| Algorithm 2 KEM.Encap() | 
| 1:2:3:4:5:6:7:8:9:return 
                    
 | 
The key encapsulation phase aims to obtain the public key from the key generation phase and generate a session key. This phase also includes generating a ciphertext that contains the random seed used to generate the session key. Unlike public key encryption, there is no plaintext involved in the encryption process. The 256-bit message 
m is also obtained by sampling from a uniform distribution and then hashed using a hash function to obtain 
. The public key is hashed to obtain 
, which is then concatenated with 
 to form a 512-bit hash value. The upper 256 bits of the hash value are utilized as input for generating the random number 
r in Step 5, while invoking the public key encryption function. This process results in an IND-CCA secure ciphertext called 
. The session key is generated through two rounds of SHA3-256, and it is stored by the server [
22]. Once the communication is closed, the session key is deleted, effectively protecting the key.
      
| Algorithm 3 KEM.Decap() | 
| 1:2:3:4:5:6:if 
                     
                    then7:   8:else9:   10:end if11:12:return 
                    
 | 
In the decapsulation phase, the ciphertext received along with the private key from the key generation phase, is used to derive the session key corresponding to the ciphertext. The message m is obtained by decrypting the received . Then, m is concatenated with the hash value of  to form a 512-bit . In Step 6, the public key encryption algorithm is invoked again to generate the ciphertext  on the client side. The next step is to verify whether the client’s ciphertext matches the server’s ciphertext. If they do not match, the output is , and if they do match, the output is . The session key is then returned.
  4. Overall Hardware Implementation of Saber
Based on the aforementioned work, this subsection presents the design of the overall Saber protocol circuit. Different multiplication modules are optimized for each module based on different application scenarios. Specifically, an analysis is conducted on the resource consumption and clock cycle overhead of the lightweight Schoolbook multiplier, resulting in an efficient implementation of the overall protocol circuit designed for resource-constrained environments. Subsequently, a higher throughput multiplier is introduced, and the data exchange between upstream and downstream modules is optimized. Furthermore, improvements have been made to the coefficient-wise multiplication (CWM) and data-loading modules of the NTT multiplier. Finally, high-throughput protocol circuits and timing sequence diagrams for the NTT version of the protocol are provided.
  4.1. Hardware Implementation for Resource-Constrained Scenarios of Saber
There are two main approaches to the hardware implementation of the mainstream Saber lattice cipher. The first approach is the coprocessor implementation, which includes a unified interface for each module [
12]. Similar to the peripheral devices in a system-on-chip (SOC) system, these modules are connected to a 64-bit bus. The progress of the flow is driven by completion flags, and a dedicated 32-bit instruction set is designed to control the module, enable/disable signals, and read/write the starting addresses of the BRAM. External control is achieved through stimulus files, providing a high degree of flexibility but introducing more redundant cycles. The second approach is dedicated implementation, without the use of a bus [
16,
29]. The interface width between modules is not fixed, and the overall system is controlled by a predefined state machine. This approach is more efficient in terms of implementation, but it can only execute the predetermined flow defined by the designer.
In this study, the two aforementioned implementation methods are combined to design a resource-constrained Saber protocol circuit as shown in 
Figure 7. The design is based on the specification process of the Saber key encapsulation protocol and the official C language source code. The circuit is controlled by a predefined state machine and utilizes a 64-bit data bus. Although the state machine covers the entire flow of the key encapsulation protocol, due to device limitations, communication between the sender and receiver is not implemented due to device limitations. The state machine transitions between states based on the completion flags of each module. It also arbitrates the usage permissions and addresses of the main storage unit, 
, according to the state changes.
The overall circuit consists of three lightweight Schoolbook polynomial multipliers, two 64-bit-wide BRAMs ( with a depth of 1024 and  with a depth of 256), a 4-bit-wide dual-port BRAM with a depth of 2048, a hash function module, three  modules, and four other functional sub-modules, including . During the operation of the circuit, in the  phase, the hash function module requires a 256-bit random seed as input. This seed is provided by the external user and is input into . The hash function then reads the seed from  to generate an internal random seed. In the  and  functions, the hash function SHAKE-128 first generates pseudo-random numbers for the secret polynomials, which are stored in . It then generates coefficients for the public polynomials, which are also stored in . At the same time as generating the coefficients for the public polynomials, the binomial distribution sampling module initiates parallel operation. It requires two cycles to read data and in the write-back stage, 8 bits of data are serially shifted out per cycle. The total sampling period requires (2 + 8) × 48 + 1 = 481 cycles, which is smaller than the 687 clock cycles required for generating the coefficients of the public polynomials.
Although the lightweight Schoolbook multiplier for single polynomial multiplication does not have the additional overhead cycles mentioned in other literature, it still requires 16,384 cycles due to its low parallelism. To evaluate the entire protocol, which involves 36 polynomial multiplications, it takes approximately 590,000 cycles. Especially in the case of low-frequency implementation of the lightweight version, the delay can be relatively long. Therefore, this paper parallelizes three identical lightweight polynomial multipliers. Each multiplier performs different row-column multiplications during the matrix-vector multiplication. This reduces the number of explicit multiplications to 18, cutting the cycle count in half. The increased space for the additional multipliers is acceptable. Additionally, it is necessary to analyze the inputs and outputs of the two additional multipliers. During the generation of public polynomial coefficients, it is necessary to store the three polynomials in 
. This allows for the simultaneous retrieval of the three rows of public polynomials. Furthermore, since the three multipliers need to be synchronized, the reading of secret polynomials only needs to be controlled by the first multiplier. The circuit diagram shown in 
Figure 8 requires only one BRAM to meet the requirements of the three multipliers. This not only reduces the consumption of BRAM but also enables the remaining two multipliers to conserve logic for address generation and control signal generation during implementation. The comprehensive results show that the area consumption of the remaining two polynomial multipliers is 488 × 2 LUTs and 243 × 2 FFs. During the inner product of polynomials and vectors, certain results of matrix multiplication are stored in 
, requiring the retrieval of data from 
.
 serves as the primary storage unit and is constantly updated to overwrite the computed data and store the recovered session key. Once the overall protocol flow is completed, the “” signal is raised. The overall performance of the Saber protocol circuit in resource-constrained scenarios will be presented later.
  4.2. High-Throughput and Area-Time Balanced Implementation of Saber
An optimized Saber protocol circuit is designed specifically for the high-throughput polynomial multiplier based on TMVP-Schoolbook in 
Figure 8. In this circuit, the “
” signal is used as an output signal, while the “
” signal is used to determine whether to clear the accumulation registers within the multiplier. Only one multiplier and one 
 module is utilized. During the multiplication operation, the circuit simultaneously reads the 4-way common polynomial coefficients from the two BRAMs. Instead of storing the data back into the BRAMs before writing it, the data are directly processed in the 
 module. Additionally, the coefficients of the secret polynomial are read from 
, and the binomial distribution sampling module writes them back to 
. To minimize the number of cycles required for execution, it is necessary to reduce the cycle consumption in data reading and writing between modules and increase parallelism. Therefore, careful arrangement of data interaction between upstream and downstream modules is essential. The timing sequence diagrams for key generation and key encapsulation phases of the key encapsulation protocol are shown in 
Figure 9 and 
Figure 10, respectively. The key generation phase requires 2346 cycles, while the key encapsulation phase requires 3341 cycles.
The timing arrangement for the key decapsulation process is similar to the two diagrams, requiring a total of 4005 cycles. The key decapsulation phase consumes the most cycles due to the additional operation of public key encryption. It can be observed from the timing diagrams that parallelizing the modules and minimizing idle cycles within the functional submodules significantly reduce the overall protocol execution time, thereby increasing the circuit’s throughput rate.
Additionally, the paper selects the pipelined NTT transformation structure for a better balance between speed and area. The radix-2 Multi-path Delay Commutator (MDC) NTT structure, as designed in the referenced work [
18], is considered a classic approach. The architecture has no complex memory control logic and fully utilizes spare cycles in each stage, achieving high throughput. The radix-2 MDC pipelined NTT structure is employed for the Saber algorithm in this paper. The implementation and improvement of the overall NTT multiplier is illustrated in 
Figure 11. Based on a single MDC-NTT/INTT core, two modular multiplication units are externally added for point-wise multiplication. These modular multiplication units are similar to those in the NTT core, reducing the 45-bit value to 23-bit through modular reduction. The loading module for the public polynomial coefficients remains unchanged from the previous description. This module enables the continuous reading of 13-bit or 16-bit coefficient data streams during the forward NTT process. The “
” signal selects the target of the NTT transformation. After performing NTT on one polynomial operand, the two sets of coefficients are merged and stored at the same address in the BRAM. During the NTT forward transformation of another polynomial operand, the coefficients corresponding to the previous storage in the BRAM at the respective order are read, point-wise multiplied, and accumulated when the data output starts per cycle. The input to the INTT is the result of three polynomial multiplications, followed by point-wise multiplication. The resulting polynomial is then reduced modulo the modulus 
 or 
 of the Saber algorithm. Finally, the sequence of results is reversed, bit by bit. The protocol circuit based on the NTT multiplier is similar to 
Figure 3, but with the inclusion of two 36K-bit BRAM blocks to store the data after NTT transformation and point-wise multiplication. The state machine of the entire circuit is configured to achieve module parallelism. Similarly, the timing sequence diagrams for key generation and key encapsulation are shown in 
Figure 12 and 
Figure 13, respectively. The key generation phase requires 3060 cycles, the key encapsulation phase requires 4961 cycles, and the key decapsulation phase requires 5910 cycles.
The main cycle optimization strategy lies in performing the NTT transformation on the secret polynomial coefficients concurrently with the generation of the public polynomial coefficients. Therefore, when the coefficients of the public polynomial  undergo NTT transformation and output the first set of data, they can directly engage in point-wise multiplication with the already read  from the BRAM. This way, the cycle of CWM can also be concealed through parallel operations. The protocol flow for decapsulation is similar to the previous two phases.
  5. Results
Table 2 provides three hardware implementation results of the overall circuit for the Saber key encapsulation protocol, including their respective performance and resource utilization. The “
” column represents the number of cycles for the key generation stage, key encapsulation stage and decapsulation stage, respectively. The “Latency” column shows the corresponding latency in seconds based on the 
 data.
 This paper proposes three kinds of hardware implementation of the Saber protocol designed for resource-constrained, high-throughput and balanced scenarios. In addition to the advantages and disadvantages of their respective focus areas, in order to make a fair comparison of hardware implementations, we evaluate the actual hardware efficiency of a circuit by using a uniform index ATP (area-time product).
      
For the ATP index, equivalent slices (ES) refers to the logic units of an FPGA after equivalence. In the circuit design, minimizing delay and resource consumption enhances performance. Therefore, a lower ATP indicates higher hardware efficiency of the circuit and faster completion of the protocol process with fewer slices.
The current lightweight hardware implementation for resource-constrained scenarios is only available in the literature [
31]. It has implemented two lightweight schemes, one of which includes side channel defense and adds mask protection. This paper compares the version without side-channel protection, which utilizes the Schoolbook algorithm with adjustable parameters for polynomial multiplication, to the overall protocol circuit structure described in this paper. However, only one multiplication module was instantiated and the storage unit was designed manually. The clock cycle decreased by 55.15% compared to the previous design. However, the equivalent slice increased by 64%, and the ATP index increased by 23.4% simultaneously. From the perspective of high performance and high throughput implementation, this implementation achieves a throughput rate of 10,988 Kbps and a TPS value of 1.5, which is the highest. This indicates that the overall protocol circuit maintains high throughput characteristics while optimizing hardware efficiency. At the same time, the ATP index is also the lowest, indicating that the circuit has achieved a good balance. The circuit architecture described in the literature [
12] is a coprocessor. As mentioned above, this architecture is highly flexible and can be controlled directly by the user. It allows for the customization of specific data within the circuit to facilitate the desired process. However, compared to a dedicated circuit, it has disadvantages in terms of clock cycle delay, frequency, and overall performance. Although the equivalent slice is reduced by 58% compared to this design, the clock period is increased by 51.8% and the overall delay is increased by 71.04% compared to this design. The overall implementation of the literature [
30] is very efficient. Polynomial multiplication is performed immediately when the hash function generates common polynomial coefficients. Additionally, a ping-pong buffer is used, which allows the polynomial multiplication to omit the cycle cost of this part through parallel operation. Therefore, the overall protocol process only takes 4223 cycles, which is 56.4% less than the implementation. This shows that the implementation also includes specific optimizations for the hash module, particularly in the input and output sections. However, due to the critical path problem mentioned earlier, the frequency can only reach 160 MHz, resulting in a 12.3% slower latency. The excessive DSP also resulted in a 47.1% increase in equivalent slice and a 53.3% increase in ATP value compared to this implementation. When compared with the literature [
16], which also utilizes an NTT multiplier, this design exhibits a 43.9% increase in frequency, a 70.8% reduction in clock cycle and a 28.6% increase in equivalent slice consumption. But again, the disadvantage of Saber based on NTT multipliers is that the data bit width of the multiplier is larger and consumes more resources compared to other lattice cryptography schemes that are compatible with NTT.