Next Article in Journal
The Eye-Opening Arbiter-PUF FPGA Implementation with Auto Error Detection
Previous Article in Journal
Generation of Affine-Shifted S-Boxes with Constant Confusion Coefficient Variance and Application in the Partitioning of the S-Box Space
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Review of Modular Multiplication Algorithms over Prime Fields for Public-Key Cryptosystems

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China
*
Author to whom correspondence should be addressed.
Cryptography 2025, 9(2), 46; https://doi.org/10.3390/cryptography9020046
Submission received: 13 May 2025 / Revised: 9 June 2025 / Accepted: 12 June 2025 / Published: 17 June 2025
(This article belongs to the Section Cryptography Reviews)

Abstract

Modular multiplication is a pivotal operation in public-key cryptosystems such as RSA, ElGamal, and ECC. Modular multiplication design is crucial for improving overall system performance due to the large-bit-width operation with high computational complexity. This paper provides a classification of integer multiplication algorithms based on their implementation principles. Furthermore, the core concepts, implementation challenges, and research advancements of multiplication algorithms are systematically summarized. This paper also gives a brief overview of modular reduction algorithms for various types of moduli and discusses the implementation principles, application scenarios, and current research results. Finally, the detailed research development of modular multiplication algorithms in four major classes over prime fields is deeply analyzed and summarized, making it essential as a guide for future research.

1. Introduction

The problem of information security in the era of big data is becoming severe, and cryptography has become the fundamental technology and essential support for network and information security. Public-key cryptosystems have been widely adopted due to their security foundations, which rely on complex mathematical problems, such as RSA based on the integer factorization problem, ElGamal and ECC based on the discrete logarithmic problem, etc. In practical applications, the 2014 Apple “goto fail” vulnerability exposed critical flaws in RSA signature verification implementations [1], driving substantial improvements in associated security protocols. Similarly, the 2018 ROCA vulnerability revealed fundamental weaknesses in RSA key generation algorithms [2], significantly accelerating the real-world adoption of Elliptic Curve Cryptography (ECC). These incidents not only demonstrate the persistent need for rigorous optimization of RSA implementations but also highlight ECC’s superior capability in mitigating contemporary security threats. Concurrently, the rapid development of quantum computing presents existential risks to conventional public-key cryptosystems. In response, the U.S. National Institute of Standards and Technology (NIST) has initiated and progressed the standardization process for Post-Quantum Cryptography (PQC). During this transitional period, traditional public-key algorithms will maintain their critical role within hybrid cryptographic architectures [3], ensuring a smooth transition toward quantum-resistant systems while providing essential backward compatibility.
Existing encryption algorithms rely on binary expansion fields GF( 2 m ) or prime fields GF( p ), where m denotes the number of expansions and p represents the prime number. Compared to binary expansion fields, cryptographic algorithms based on prime fields offer more flexibility and security. However, they have high computational complexity and are difficult to implement in hardware. Modular multiplication is the core operation of public-key cryptosystems, and even the PQC that resists quantum attacks involves modular multiplication operations. Therefore, the study of modular multiplication algorithms is extremely important for improving the performance of cryptographic algorithms.
The modular multiplication can be represented as:
C = ( A × B ) mod p , 0 A , B < p
Modular multiplication consists of the multiplication that computes the product of two operands and the modular reduction that computes the remainder, which is implemented in two ways: multiplication followed by modular reduction, interleaving between multiplication and modular reduction. Due to the large operand bit-widths in cryptographic algorithms, modular multiplication operations commonly involve efficient multiplication and modular reduction algorithms. Multiplication algorithms in public-key cryptosystems include Schoolbook multiplication, Comba multiplication [4], Karatsuba multiplication [5], Toom–Cook multiplication [6,7], and NTT-based polynomial multiplication [8], as well as Booth encoding [9] and Redundant Signed Digit (RSD) representation [10] for optimizing the multiplication process. Modular reduction algorithms include Barrett reduction [11], Montgomery reduction [12], and fast reduction for Mersenne primes [13] and NTT polynomial coefficients. Through a deep analysis of the modular multiplication principle, it can be divided into addition and multiplication types. The addition-type modular multiplication algorithm refers to the interleaved modular multiplication algorithm based on Blakley [14], and the multiplication-type modular multiplication algorithm includes the Barrett modular multiplication algorithm adopting quotient estimation techniques, the Montgomery modular multiplication algorithm based on the residual system, and the fast modular multiplication algorithm for specific prime numbers. The classification of modular multiplication algorithms over prime fields for public-key cryptosystems is shown in Figure 1.
In this paper, we systematically investigate the modular multiplication algorithms over prime fields for public-key cryptosystems, comprehensively discuss various multiplication algorithms and modular reduction algorithms, and summarize the references from the dimensions of algorithmic principles, implementation difficulties, improvement schemes, and application scenarios. In addition, this paper organizes the relationship between various algorithms more clearly while considering the modular multiplication algorithm in the PQC. The main research projects are summarized as follows:
(1)
The commonly utilized multiplication algorithms are systematically analyzed and classified into four categories according to their implementation principles. Furthermore, the advantages and disadvantages of each multiplication algorithm are summarized.
(2)
A summary of modular reduction algorithms for modular multiplication algorithms is provided, with two categories based on different modulus forms and a focus on modular reduction algorithms for specific prime numbers. Furthermore, the advantages and disadvantages of each modular reduction algorithm are discussed.
(3)
Four types of modular multiplication algorithms are analyzed and reviewed, outlining the research progress for each algorithm. Furthermore, the hardware implementations of modular multiplication algorithms are analyzed and compared to provide some guidance for the analysis, design, and future research of modular multiplication algorithms.

2. Multiplication Algorithms

2.1. Basic Algorithms

2.1.1. Schoolbook Multiplication

The core idea of the Schoolbook algorithm is to multiply each bit of the multiplicand by the multiplier and sum the resulting product. For two n -bit operands, the Schoolbook algorithm requires approximately n 2 multiplication operations, 4 n 2 addition operations, and n 2 + n store operations, resulting in computational complexity O ( n 2 ) . If a bit of the multiplicand is 0, the corresponding computation can be skipped, thereby reducing the number of computations. The Schoolbook algorithm is the simplest method for performing multi-precision multiplication, but handling carries complicates parallelization.

2.1.2. Comba Multiplication

The Comba algorithm still has computational complexity O ( n 2 ) . It differs in that it computes in columns, first merging the whole columns and then handling carries over the entire column, thus reducing the number of times handling carries from n 2 to 2 n . In addition, the Comba algorithm requires only 2 n 1 partial products to be computed and stored in memory, reducing the number of writes to memory compared to the Schoolbook algorithm. The Comba algorithm is often combined with other algorithms as a scheduling technique [15] to optimize the order of product calculation, and the design challenge is to improve parallelism. The difference between the Schoolbook algorithm and Comba algorithm is now mainly applicable to software design and negligible for hardware design. Specifically, the Comba algorithm is suitable for assembly language, while the Schoolbook algorithm is more suitable for high-level languages that support double-precision 32-bit integer data types. The Comba algorithm is shown in Algorithm 1.
Algorithm 1. Comba multiplication
Input: x , y
Output: z = x × y
1:     for i = 0 to ( 2 n 2 ) do
2:         if i < n then
3:              p a i = j = 0 i ( x j × y i j )
4:         else
5:             p a i = j = i ( n 1 ) n 1 ( x j × y i j )
6:         end if
7:     end for
8:     z = i = 0 2 n 2 ( p a i × 2 i )
return z

2.2. Algorithms Based on Divide and Conquer

2.2.1. Karatsuba Multiplication

The Karatsuba algorithm uses a divide-and-conquer approach, splitting two n-bit multipliers into n/2-bit sub-sequences, recursively computing their products, and combining the results. The algorithm reduces the number of multiplication operations by performing additional addition, which requires approximately 3 n 2 / 4 multiplication operations, 3 n 2 + 4 n + 2 addition operations, and 3 n 2 / 4 + 11 n / 2 + 1 storage operations, decreasing the computational complexity from O ( n 2 ) to O ( n log 2 3 ) . As the input bit-width increases, the cost of addition decreases relative to multiplication. The Karatsuba algorithm exhibits better performance than the basic algorithm when the input bit-width exceeds 128-bit. In order to meet the different scenarios, Kang et al. designed a Karatsuba multiplier that supported variable large integers [16]. In addition, Rafferty et al. analyzed the hardware implementation of combined schemes [17]. The Karatsuba–Schoolbook combination is suitable for input bit-width less than 64-bit. In contrast, the Karatsuba–Comba combination achieves the lowest computational complexity for input bit-width ranging from 64-bit to 256-bit and the lowest latency for input bit-width more than 512-bit. The challenge in designing the Karatsuba algorithm involves determining the appropriate layering strategy for various bit-widths and optimizing the path delay introduced by addition operations. The Karatsuba algorithm for large integer multiplication is shown in Algorithm 2.
Algorithm 2. Karatsuba multiplication
Input: x , y , with x = x 1 × 2 n / 2 + x 0 , y = y 1 × 2 n / 2 + y 0
Output: z = x × y
1:      z 0 = x 0 y 0
2:     z 1 = x 1 y 1
3:     s 1 = x 1 + x 0
4:      s 2 = y 1 + y 0
5:      s 12 = s 1 × s 2
6:      z 2 = s 12 z 0 z 1
7:     z = z 0 + z 2 × 2 n / 2 + z 1 × 2 n
return z
Polynomial multiplication can also be efficiently performed using the Karatsuba algorithm. First, the two input polynomials are evaluated, followed by the evaluation of the product polynomial. Finally, the product polynomial is reconstructed using interpolation. When evaluating an n 1 degree polynomial, 2 n 1 evaluation points need to be selected, with 0 ,   1 ,   and   being commonly used points. Further details on the Karatsuba algorithm and its optimal usage with minimal computation cost can be found in [18]. In recent years, many studies have employed the Karatsuba method to design polynomial multiplication architecture for hardware implementation. Wong et al. analyzed the advantages of Karatsuba-based polynomial multipliers in Saber [19]. Furthermore, they proposed a four-layer Karatsuba architecture that achieved low hardware resource consumption and high throughput by employing optimization strategies such as fully parallel and reusing intermediate registers. Heidarpur et al. combined the Schoolbook algorithm with Overlap-free Karatsuba multiplication to reduce the critical path delay and designed a high-speed polynomial multiplier based on lookup tables [20].

2.2.2. Toom–Cook Multiplication

The general form of the Toom–Cook algorithm is Toom–Cook-k, which divides into k segments and generates a polynomial of degree k−1 from k coefficients. For large k values, the procedure becomes extremely complex, so commonly used k values are below 6. To obtain the product result of degree 2(k−1) quickly, appropriate variable values should be chosen [21], converting the multiplication problem into solving the linear equation system problem. By solving the unknown coefficients within the matrix, shifting and summing them provides the product result. The Toom–Cook algorithm further reduces the number of multiplication operations and adopts a divide-and-conquer strategy, lowering the computational complexity to O ( n log 3 5 ) O ( n 1.464 ) . However, compared to the Karatsuba algorithm, it introduces more addition operations and involves division operations that are difficult to implement in hardware. As a result, it is suitable for software implementations that do not consider the resource overhead.
Since the beginning of the NIST PQC Standardization process, the Toom–Cook algorithm has attracted attention due to its applicability to polynomial multiplication over rings. Bermudo et al. utilized precomputation and lazy interpolation techniques to reduce the overhead of evaluation and interpolation, which were applied to Saber [22]. Wang et al. utilized the Winograd algorithm to reduce the computation of the Low-degree Schoolbook step and reduce the density of the interpolation matrix as well as the merge post-processing to reduce the number of multiplications. The method can be applied to lattice-based cryptosystems with a modulus less than or equal to 32-bit [23]. In addition, Li et al. analyzed the leakage characteristics of the Toom–Cook algorithm for the first time and proposed some countermeasures to prevent side-channel attacks [24].

2.2.3. NTT Multiplication

The FFT method is more advantageous for high-degree polynomials, as it ensures efficient interpolation points for evaluation. It uses the n -th complex roots as evaluation points and transforms the polynomial coefficient representation into the point-value representation. Two efficient FFT algorithms are the Decimation-In-Time (DIT) Cooley–Tukey and the Decimation-In-Frequency (DIF) Gentleman–Sande algorithm. Both algorithms consist of multiple levels of iterative butterfly computations, utilizing the divide-and-conquer strategy to rapidly compute the Discrete Fourier Transform (DFT) with computational complexity of O ( n log n ) . The butterfly unit of the DIT FFT algorithm performs modular multiplication followed by modular addition and subtraction. In contrast, the butterfly unit of the DIF FFT includes performing modular addition and subtraction followed by modular multiplication. Radix-2 [25], radix-4 [26], and split-radix FFT [27] algorithms can be selected based on the specific application scenario.
Number Theoretic Transform (NTT) is a variant of DFT in the finite field. Its operational process and computational complexity are equivalent to the FFT, with the difference that n -th primitive roots of unity are selected to calculate. The NTT algorithm eliminates the floating-point operation in FFT, making it more suitable for hardware implementation. The NTT-based polynomial multiplication algorithm is shown in Algorithm 3. Firstly, the higher-order parts of the two input vectors are complemented with the zero-padding operation to avoid the overflow of high-degree terms. Subsequently, the two expansion vectors respectively perform NTT, and the two obtained sequences perform pointwise multiplication, and then the result of the pointwise multiplication performs INTT (Inverse Number Theoretic Transform). Finally, carry operations are performed to obtain the multiplication result.
Algorithm 3. NTT-based polynomial multiplication
Input: a , b , with a , b p [ x ]
Output: z = a × b
1:      a = ( a 0 , a 1 , , a n 1 , 0 m 1 ) , b = ( b 0 , b 1 , , b m 1 , 0 n 1 )
2:     a ^ = N T T ( a ) , b ^ = N T T ( b )
3:     z ^ = a ^ b ^
4:     z = I N T T ( z ^ )
return z
Currently, research on hardware acceleration for NTT-based polynomial multiplication of PQC schemes is of great significance. There are three main methods to optimize NTT computational units: developing low computational complexity NTT/INTT algorithms to reduce the number of modular multiplication; optimizing precomputed twiddle factors to reduce the storage overhead; and designing a unified parallel NTT/INTT hardware architecture to minimize resource consumption. For the quotient ring p [ x ] / f ( x ) , p [ x ] represents the set of all polynomials with coefficients in the ring of integers modulo a prime number p . If f ( x ) = x n 1 , positive wrapped convolution (PWC) can be employed to avoid zero-padding. If f ( x ) = x n + 1 , negative wrapped convolution (NWC) can be utilized to achieve the same purpose. However, pre-processing (consisting of N modular multiplications) and post-processing (consisting of 2N modular multiplications) are necessary. Zhang et al. proposed a low-computational-complexity NTT/INTT algorithm that eliminates pre-processing and post-processing steps while removing the scaling factor N 1 in the INTT [28]. Xing et al. eliminated the bit-reverse operation in the NTT/INTT by adjusting the loop structure [29]. Utilizing the optimization strategies in [28,29], the number of cycles required to complete one polynomial multiplication operation is reduced from 1.5 n lg n + 8 n to 1.5 n lg n + n . Guo et al. investigated the advantages of the split-radix NTT algorithm in Kyber and maximized the utilization of memory bandwidth of radix-2 butterfly units by adjusting the calculation order [30]. Li et al. proposed a mapping strategy to optimize the data flow between NTT and INTT stages, simplifying the routing overhead among processing units [31]. Mu et al. proposed scalable NTT hardware accelerators for any polynomial length and modulus, as well as the depth and width of processing element (PE) array [32]. They designed a conflict-free memory access scheme that supported variable PE arrays to meet different security parameters. Chen et al. eliminated the bit-reverse and additional memory overhead in NTT/INTT by readjusting the loop structure and reusing the twiddle factors [33]. Furthermore, they proposed a conflict-free memory mapping scheme that supported an arbitrary radix of NTT. However, the in-place NTT algorithm leads to complex conflict-free memory access patterns, which is not favorable to the scalability of multicore architectures. To address this problem, the constant-geometry (CG) NTT algorithm has the same memory access pattern at each stage but at the cost of doubling the memory overhead. Su et al. proposed a unified reconfigurable multicore CG NTT architecture that supported variable Pes [34]. In addition, the cyclic-sharing memory management pattern was designed to reduce 25% memory overhead. Liu et al. also proposed a conflict-free memory access pattern with a capacity of 1.5 N but further decreased the utilization of BRAM by 66.7% [35].

2.3. Partial Product Optimization Techniques

2.3.1. Booth Encoding

The Booth encoding algorithm optimizes groups with consecutive 0s and 1s to reduce the number of partial products. In multiplication computations, applying Booth transformation to a multiplicand with a relatively high frequency of consecutive 0s can reduce the number of non-zero partial products, thereby optimizing the accumulation process of partial products. A brief description of the Booth algorithm is provided below:
The complement of the multiplier y can be represented as:
y = y n 1 2 n 1 + i = 0 n 2 y i × 2 i
By appending an auxiliary bit y ( 1 ) with the value of 0 to the Least Significant Bit (LSB) of the multiplier y, Equation (2) is equivalently transformed into Equations (3) and (4):
y = y n 1 2 n 1 + y n 2 2 n 2 + y n 3 2 n 3 + + y 1 2 1 + y 0 2 0 + y 1     = ( y n 2 y n 1 ) 2 n 1 + ( y n 3 y n 2 ) 2 n 2 + + ( y 1 y 2 ) 2 2 + ( y 0 y 1 ) 2 1 + ( y 1 y 0 ) 2 0
y = ( 2 y n 1 + y n 2 + y n 3 ) 2 n 2 + ( 2 y n 3 + y n 4 + y n 5 ) 2 n 4 + + ( 2 y 1 + y 0 + y 1 )
Equation (3) represents radix-2 Booth encoding, where the polynomial radix coefficients are derived by subtracting the adjacent bits of the multiplier y . This method reduces the number of non-zero partial products without changing the number of terms in the polynomial expression, indicating that the actual number of partial products remains unchanged.
Equation (4) represents radix-4 Booth encoding. Starting from the LSB of the multiplier, groups of three bits are formed (the LSB of the first group is the auxiliary bit y [ 1 ] ). Adjacent groups overlap by one bit, meaning the Most Significant Bit (MSB) of the lower group overlaps with the LSB of the higher group. The three bits in each group form polynomial radix coefficients according to a specific relationship, so the partial product operation corresponding to all value combinations of adjacent three bits can be obtained.
When n -bit multiplication is performed using radix-2 Booth encoding, n partial products are generated. In contrast, radix-4 Booth encoding is equivalent to generating a partial product by multiplying two bits of the multiplier at each step, thereby reducing the number of partial products by half. Other high-radix encoding methods can be derived using a method similar to Equation (4). Although high-radix encoding techniques can reduce the number of partial products, the resulting circuits are more complex [36]. This complexity can be mitigated by introducing additional encoding parameters to improve the circuit logic.

2.3.2. RSD Representation

Common redundant representations include the Carry-Save (C-S) representation and the RSD representation.
In the C-S representation, a signed number A can be expressed as the sum of two signed numbers:
A = A c + A s = i = 0 n 1 a i 2 i = i = 0 n 1 ( a c + a s ) 2 i
In the RSD representation, a signed number A A can be expressed as the difference between two unsigned numbers:
A = A p A m = i = 0 n 1 a i 2 i = i = 0 n 1 ( a p a m ) 2 i
The RSD representation of a signed number A is denoted as a = { a n 1 , a n 2 , , a 0 } . Each bit is represented by the two bits a p and a m , where a p and a m take the value 0 or 1, and a i takes the value −1, 0, or 1. In the RSD representation system, a negative integer can be achieved by changing the sign of non-zero bits in the encoding, toggling between −1 and 1.
The representation of signed numbers can reduce the carry propagation delay of the accumulation operations in multiplication. However, it requires additional bits in the signed number systems, and the signed numbers need to be converted back to the original number system after computation. Compared to the C-S representation, the RSD representation offers the following advantages [37]: the RSD encoding represents the difference between two unsigned numbers without requiring additional sign bits; the number 0 in the RSD encoding representation is easy to analyze and determine; and the negation operation in the RSD encoding representation only needs to exchange the position of a p and a m , facilitating the implementation of subtraction operations.

2.4. Comparative Analysis of Multiplication Algorithms

This paper summarizes the multiplication algorithms in public-key cryptography and classifies them into three types. Schoolbook multiplication and Comba multiplication belong to basic algorithms. The Karatsuba, Toom–Cook, and NTT multiplication algorithms utilize the divide-and-conquer recursive approach. Booth encoding and RSD representation are partial product optimization techniques. Table 1 summarizes the advantages and disadvantages of these multiplication algorithms.

3. Modular Reduction Algorithms

3.1. Modular Reduction for General Moduli

3.1.1. Barrett Reduction

Barrett reduction is the first estimator reduction algorithm without division operations. The central principle is to use multiplication and precomputation instead of division. Assuming the radix is denoted by b = 2 L and the word length is represented by L , the Barrett reduction algorithm is shown in Algorithm 4.
Algorithm 4. Barrett reduction
Input: 0 z < b 2 k , 2 k 1 < p < 2 k , b 3 , μ = b 2 k / p
Output: r = z mod p
1:      q ^ = z / b k 1 μ / b k + 1
2:     r = ( z mod b k + 1 ) ( q ^ p mod b k + 1 )
3:     if r < 0 then
4:          r = r + b k + 1
5:     end if
6:     while r p do
7:          r = r p
return r
The value of L in the Barrett reduction algorithm can be selected based on the modulus p . Since the modulus p is constant during performing modular multiplication operation once, μ = b 2 k / p can be precomputed, and at most two subtractions are required to keep the final result within k + 1 digits.

3.1.2. Montgomery Reduction

The Montgomery reduction algorithm refers to generating T R 1 mod p , where p represents the modulus. In practical implementations, R is typically chosen as a power of 2 to avoid time-consuming division operations, replaced by simple shift operations. The Montgomery reduction algorithm is shown in Algorithm 5.
Algorithm 5. Montgomery reduction
Input: 0 T < R p , p = ( p 1 ) mod R
Output: r = T R 1 mod p
1:      m = ( T mod R ) p mod R
2:     r = ( T + m p ) / R
3:     if r p then
4:           r = r p
5:     end if
return r
Since ( T + m p ) / R < ( p 2 + R p ) / R < ( R p + R p ) / R = 2 p , the final result can be constrained within the range [ 0 , p ) with at most one subtraction.

3.2. Modular Reduction for Specific Prime Numbers

3.2.1. Fast Modular Reduction for Mersenne Primes

Mersenne primes are represented in p = 2 n 1 . Modular reduction can be performed quickly utilizing the congruence operation ( 2 n 1 mod p ). Assuming the modulus p is n -bit, the input operand Z is 2 n -bit, the higher n -bit of Z is Z H , the lower n -bit of Z is Z L , and the intermediate result ( t = Z H + Z L ) can be obtained using the congruence operation. Therefore, the final result can be constrained within the range [ 0 , p ) with at most one subtraction. Among the five prime field elliptic curves recommended by NIST, p 521 = 2 521 1 satisfies the Mersenne prime form. In addition, there are two variants of Mersenne primes: pseudo-Mersenne primes and generalized Mersenne primes.
Pseudo-Mersenne primes are represented as p = 2 n c , where c is a small odd number. The modulus p = 2 255 19 of the Curve25519 [38] satisfies the pseudo-Mersenne prime form, transforming modular reduction into multiplication and addition operations through the equivalence relation 2 256 38 mod p . Yu et al. proposed a fast modular reduction algorithm for the Curve25519, representing the modulus p = 2 255 19 as 2 255 2 4 2 2 + 1 and transforming the 510-bit Z into the form Z 254 2 508 + Z 253 2 506 + Z 252 2 504 + + Z 1 2 2 + Z 0 , where the width of Z i is 2-bit [39]. This algorithm involves two rounds: the first round only processes the higher 254-bit to obtain the residue term and S , while the second round reduces the highest two terms of S .
Generalized Mersenne primes are denoted as p = 2 n c 1 2 n 1 c i 2 n i c n , where c i is an integer with a smaller absolute value. The modular reduction operation can transform into a series of addition and subtraction using the congruence relation. Compared with the reduction process of pseudo-Mersenne primes, the shift and addition operation replace the multiplication operation with high complexity. In addition to the p 521 introduced in Mersenne primes, there are 192-bit, 256-bit, and 384-bit generalized Mersenne primes. Take the NIST curve p 256 as an example, where two integers are multiplied to get the 512-bit result Z = z 15 2 480 + z 14 2 448 + + z 1 2 32 + z 0 . The fast modular reduction algorithm of NIST curve p 256 is shown in Algorithm 6.
Algorithm 6. Fast modular reduction for p 256
Input: Z = ( z 15 z 14 z 13 z 2 z 1 z 0 ) 2 32
Output: r = Z mod p 256
1:      T = ( z 7 z 6 z 5 z 4 z 3 z 2 z 1 z 0 ) 2 32
2:     S 1 = ( z 15 z 14 z 13 z 12 z 11 000 ) 2 32
3:     S 2 = ( 0 z 15 z 14 z 13 z 12 000 ) 2 32
4:     S 3 = ( z 15 z 14 000 z 10 z 9 z 8 ) 2 32
5:     S 4 = ( z 8 z 13 z 15 z 14 z 13 z 11 z 10 z 9 ) 2 32
6:     D 1 = ( z 10 z 8 000 z 13 z 12 z 11 ) 2 32
7:      D 2 = ( z 11 z 9 00 z 15 z 14 z 13 z 12 ) 2 32
8:     D 3 = ( z 12 0 z 10 z 9 z 8 z 15 z 14 z 13 ) 2 32
9:     D 4 = ( z 13 0 z 11 z 10 z 9 0 z 15 z 14 ) 2 32
10: r = ( T + 2 S 1 + 2 S 2 + S 3 + S 4 D 1 D 2 D 3 D 4 ) mod p 256
return r
The fast modular reduction algorithm may require appending zeros to the higher bits due to carry or misalignment of the data undergoing reduction. The algorithm requires a large number of modular addition operations. In order to reduce the computational complexity, Choi et al. iteratively reduced the high part of the product result [40]. In addition, Choi et al. also proposed a novel method combining partial modular reduction with Montgomery reduction [41].

3.2.2. Modular Reduction in NTT Multiplication

Modular reduction is the core computational module of PQC algorithms based on RLWE and MLWE. In lattice-based cryptographic schemes, common moduli such as 3329, 7681, and 12,289 can be reduced by their special properties. Yaman et al. proposed a constant-time modular reduction algorithm, utilizing the property 2 12 2 9 + 2 8 1 (mod 3329) to reduce higher-order bits to smaller bit-widths [42].
Aikata et al. proposed a unified architecture to perform modular reduction in Kyber and Dilithium. For the modulus 3329, they utilized the properties 2 12 2 9 + 2 8 1 and 2 11 2 10 2 8 1 to generate partial results. For the modulus 8,380,417, they recursively utilized the property 2 23 2 13 1 [43].
Longa et al. proposed an efficient modular reduction method suitable for the NTT computation process, applicable to the modulus form p = k 2 m + 1 , in which k is odd and 2 m > k [44]. This method is a kind of incomplete modular reduction algorithm, including two forms of K-RED and K-RED-2X. The K-RED algorithm is shown in Algorithm 7, where Z is denoted as Z = Z h 2 m + Z l , and fast reduction can be achieved by utilizing the property k 2 m 1 mod p . The algorithm requires one multiplication, one subtraction, two shifts, and one modular operation to complete a modular reduction operation, generating an output within the range ( k p , p ) .
Algorithm 7. K-RED reduction
Input: Z [ 0 , p 2 )
Output: r = Z k mod p
1:      Z l = Z mod 2 m
2:      Z h = Z / 2 m
3:     r = k Z l Z h
return r
The K-RED-2x algorithm can also reduce the intermediate result that leads to overflow, where Z is represented as Z h 2 2 m + Z m 2 m + Z l ( 0 Z l , Z m < 2 m ) . Bisheh-Niasar et al. proposed a K2-RED modular reduction algorithm, which performs K-RED operations twice [45]. For the modulus 3329 in Kyber, K2-RED reduces one shift and addition operation compared to K-RED-2x. Li et al. optimized the parameters of the K2-RED algorithm for use in Dilithium, reducing hardware resource consumption by more than 50% compared to Barrett reduction and Montgomery reduction [46].

3.3. Comparative Analysis of Modular Reduction Algorithms

The traditional modular reduction algorithm obtains the remainder less than the modulus by time-consuming division operations, which is unsuitable for hardware implementation. This paper summarizes the modular reduction algorithms in public-key cryptosystems and divides them into two categories. It briefly discusses algorithm principles and improved forms of modular reduction for general moduli and fast modular reduction for specific prime numbers. Table 2 summarizes the advantages and disadvantages of the above-mentioned modular reduction algorithms.

4. Modular Multiplication Algorithms

4.1. Blakley-Based Interleaved Modular Multiplication

The Blakley-based interleaved modular multiplication algorithm, as shown in Algorithm 8, transforms modular multiplication into a series of addition operations by alternately performing multiplication and modular reduction. In each iteration, comparison and subtraction operations are required to ensure that the intermediate results remain within the modulus range.
Algorithm 8. Interleaved modular multiplication
Input: A , B [ 1 , p 1 ] , p
Output: C = ( A × B ) mod p
1:     C = 0 ; p 2 = 2 × p ;
2:     for i = m 1 down to 0 do
3:          C 1 = C ; C 2 = 2 × C 1 ;
4:         I 1 = A [ i ] × B
5:         C 3 = C 2 + I 1 ; C 4 = C 3 p ; C 5 = C 3 p 2 ;
6:         if C 3 p then
7:                                  C 6 = C 4
8:         else if C 3 p 2 then
9:                                  C 6 = C 5
10:       else C 6 = C 3
11:       end if
12:       C = C 6
13:   end for
return C
The interleaved modular multiplication algorithm requires multiple iterations and addition operations. In recent years, researchers have utilized parallel techniques, Booth encoding, and high-radix forms to improve algorithmic performance. Hossain et al. proposed a radix-2 interleaved modular multiplication algorithm with precomputation, which performs two subtractions and two comparisons for each iteration and computes intermediate results in parallel [47]. Islam et al. utilized loop operations to reduce the computational complexity of the radix-2 interleaved modular multiplication algorithm. In addition, they designed a hardware architecture supporting five prime field elliptic curves for lightweight ECC processors [48]. Javeed et al. applied parallel techniques to the radix-2 interleaved modular multiplication algorithm at a lower hardware cost by reducing data dependencies between critical operations [49]. In addition, they utilized Booth encoding to decrease the number of iterations. Rahman et al. proposed an efficient interleaved modular multiplication algorithm and hardware architecture over 256-bit prime fields, achieving faster speed and higher throughput with minimum area consumption [50]. Kudithi et al. proposed simplifying the comparison chain in the radix-4 interleaved algorithm by comparing with 0 to select the smallest positive number [51]. Lin et al. employed a high-radix interleaved modular multiplication algorithm, calculating four bits at each iteration and adding the radix-4 Booth encoding values to the accumulator, thereby balancing overall computation time and hardware area [52]. Li et al. also adopted the Booth encoding technique and further optimized the serial addition structure by reducing the critical path to a single adder and two multiplexers, thereby significantly reducing the computation time [53].

4.2. Barrett Modular Multiplication

The Barrett modular multiplication algorithm, as shown in Algorithm 9, utilizes quotient estimation techniques to convert division operations into multiplication and shift operations. However, this algorithm has unique constraints for the modulus and requires precomputation.
Algorithm 9. Barrett modular multiplication
Input: A , B p , k = log 2 p , μ = 4 k / p
Output: C = ( A × B ) mod p
1:      T = A × B
2:      T h = T / 2 k 1
3:      q = ( T h × μ ) / 2 k + 1
4:      r = T q × p
7:     if r > 2 p then
8:         C = r 2 p
9:     else if r > p then
10:       C = r p
11:   else C = r
12:   end if
return C
In recent years, researchers have utilized modulus subsets optimization, multiplication decomposition, and full-width pipelined multiplication techniques to improve algorithmic performance. Dhem et al. further reduced the computational complexity of the Barrett modular multiplication algorithm by optimizing parameters, and most future research has been conducted on this foundation [54]. Hao et al. effectively applied the Toom–Cook algorithm to Barrett modular multiplication by adding redundant factors and modifying interpolation operations. In addition, they proposed a precomputation method to eliminate errors arising from redundant factors [55]. Yu et al. analyzed the Ed25519 algorithm of elliptic curve signature schemes and proposed a Barrett modular multiplication design scheme by reusing the 257-bit multiplier [56]. Agrawal et al. investigated the ECDSA algorithm of elliptic curve signature schemes, in which the modular multiplication unit first combined the Schoolbook algorithm and the Karatsuba algorithm to perform 258-bit integer multiplication, followed by the Barrett reduction algorithm [57]. Zhang et al. utilized C-S representation to optimize multiplication operations in each iteration and operation scheduling techniques to increase parallelism, thereby avoiding complex multiplication and large-bit-width addition operations in the Barrett modular multiplication algorithm [58]. Zhang et al. proposed an interleaved Barrett modular multiplication algorithm that leveraged compression and encoding techniques to replace multiplications and additions in each iteration, enhancing overall performance through parallel computation of quotient and intermediate results [59]. Zhang et al. designed a universal architecture for a Barrett modular multiplier, avoiding two cascaded large-bit-width multipliers employing binary expansion technology for encryption schemes based on residue number systems [60]. Zhang et al. applied the four-term Karatsuba algorithm to Barrett modular multiplication, reducing multiplication and addition operations on the critical path [61]. Compared with the Montgomery modular multiplication algorithm, the Barrett modular multiplication algorithm is more suitable for scenarios of fixed modulus [62,63]. For the modulus 3355037 = 2 25 2 12 + 1 in Saber, Xu et al. utilized Barrett reduction in the NTT calculation, which converted the multiplication operation into a shift operation by controlling the input operand range and transforming the modulus [62]. Guo et al. proposed a mix-radix NTT algorithm for the modulus 3329 in Kyber [63]. The modular multiplication unit first reduced the bit-width of the product from 24-bit to 15-bit and then performed Barrett reduction.

4.3. Montgomery Modular Multiplication

The central principle of the Montgomery modular multiplication algorithm is to transform the modular operation into simple shift and addition operations by constructing a residue system of modulus. However, converting the original operands to the Montgomery domain incurs additional overhead.
In 1996, Koc et al. summarized various implementation methods of the Montgomery modular multiplication algorithm into five categories—Separated Operand Scanning (SOS), Finely Integrated Product Scanning (FIPS), Coarsely Integrated Operand Scanning (CIOS), Coarsely Integrated Hybrid Scanning (CIHS), and Finely Integrated Operand Scanning (FIOS)—with the CIOS and FIPS methods exhibiting superior performances [64]. Assuming the lengths of the input operands and modulus are k bytes, the FIPS method requires 6 k 2 addition, 2 k 2 + k multiplication, and 2 k + 1 storage operations, while the CIOS method requires 8 k 2 + 4 k addition, 2 k 2 + k multiplication, and 2 k 2 + 3 k storage operations. At present, most research focuses on these two implementation methods. Mrabet et al. designed a CIOS method with a systolic structure that reduced the number of modular multiplication cycles, suitable for RSA, ECC, and pairing-based cryptography [65]. Gallin et al. improved the CIOS method without final subtraction, leading to higher parallelism [66]. However, the hardware architecture was complex and failed to consider the number of pipelines for the digit multiplier. Botrel et al. optimized the CIOS method, reducing addition operations when the MSB of the modulus is 0 [67]. Buhrow et al. optimized the FIPS method to reduce the number of data loads and stores, achieving greater latency hiding by removing data dependency [68].
For the radix-2 Montgomery modular multiplication algorithm, Walter C D has proved that the final subtraction can be avoided [69]. Therefore, by zero-padding the highest bits of X and Y , setting R = 2 l ( l n + 2 ) with l = n + 2 , the final result can be constrained within the range of [ 0 , 2 N ) . In order to reduce the carry propagation delay, intermediate results are usually stored in C-S form. The radix-2 Montgomery modular multiplication algorithm is shown in Algorithm 10.
Algorithm 10. Radix-2 Montgomery modular multiplication
Input: X = i = 0 n + 1 x i 2 i , Y = i = 0 n + 1 y i 2 i , 0 X , Y < 2 N , R = 2 n + 2
Output: Z = X Y 2 ( n + 2 ) ( mod N )
1:      S S [ 0 ] = 0 ; S C [ 0 ] = 0
2:     for i = 0 to n + 1 do
3:         q i = ( S S [ i ] 0 + S C [ i ] 0 + x i y 0 ) mod 2
4:          ( S S [ i + 1 ] , S C [ i + 1 ] ) = ( S S [ i ] + S C [ i ] + x i Y + q i N ) / 2
5:     end for
6:      Z = S S [ n + 2 ] + S C [ n + 2 ]
return Z
Kuang et al. introduced the semi-carry-save (SCS) strategy into the radix-2 Montgomery modular multiplication algorithm, which reduced the critical path delay and clock cycles by improving the first-level CSA structure [70]. Coliban utilized fewer iterations to compute the radix-2 Montgomery modular multiplication while designing a three-input adder structure instead of two adders [71]. The method significantly improves the computation time, throughput, and maximum operating frequency. Abirami et al. proposed a radix-2 Montgomery modular multiplication algorithm for lightweight elliptic curve encryption systems [72]. The algorithm reduces loop iterations by precomputing an addition, avoiding unnecessary multiplication and subtraction operations.
In 1999, Tenca et al. proposed the multi-word radix-2 Montgomery modular multiplication algorithm and hardware implementation for the first time [73]. When the operand is n -bit, one Montgomery modular multiplication takes 2 n cycles. The central principle of the word-based Montgomery modular multiplication algorithm involves segmenting the multiplier, multiplicand, and modulus into short integers for computation and subsequently concatenating these short integers. The algorithm mainly consists of two steps: the multiplicand Y and the modulus N are scanned word by word, and the multiplier X is scanned bit by bit. Assuming Z , Y , and N are split into e words of w -bit, with C as the carry, and Z i ( k ) representing the i -th bit of the k -th word, the multi-word radix-2 Montgomery modular multiplication algorithm is shown in Algorithm 11.
Algorithm 11. Multi-word radix-2 Montgomery modular multiplication
Input: X = i = 0 n 1 x i 2 i , Y = j = 0 e 1 Y ( j ) 2 w j , N = j = 0 e 1 N ( j ) 2 w j
     0 X , Y < N , R = 2 n , e = ( n + 1 ) / w
Output: Z = j = 0 e 1 Z ( j ) 2 w j = X Y R 1 , 0 Z < 2 N
1:      Z = 0
2:     for i = 0 to n 1 do
3:           q i = ( x i Y 0 ( 0 ) + Z 0 ( 0 ) ) mod 2
4:         ( C , Z ( 0 ) ) = x i Y 0 ( 0 ) + q i N ( 0 ) + Z ( 0 )
5:         for j = 1 to e do
6:              ( C , Z ( j ) ) = C + x i Y 0 ( j ) + q i N ( j ) + Z ( j )
7:             Z ( j 1 ) = ( Z 0 ( j ) , Z w 1...1 ( j 1 ) )
8:              Z ( e ) = 0
9:         end for
10:   end for
return Z
Li et al. implemented a multi-word radix-2 Montgomery interleaved modular multiplication algorithm based on the pipelined structure by decomposing operands into small-bit-width words [74]. For NTT-friendly primes ( q = q H 2 log 2 ( 2 n ) + 1 ), Mert et al. improved the word-based Montgomery modular multiplication algorithm to support various polynomial lengths and modulus bit-widths [75].
The high-radix Montgomery modular multiplication algorithm [76] can further reduce the complexity of the multi-word radix-2 Montgomery modular multiplication algorithm. Assuming the radix is 2 w ( w is the word length) and R = 2 w e , N can be precomputed as N = ( N ) 1 mod 2 w = ( ( N ( 0 ) ) 1 ) mod 2 . The high-radix Montgomery modular multiplication algorithm is shown in Algorithm 12.
Algorithm 12. High-radix Montgomery modular multiplication
Input: X = i = 0 e 1 X ( i ) 2 w i , Y = j = 0 e 1 Y ( j ) 2 w j , N = j = 0 e 1 N ( j ) 2 w j
     0 X , Y < 2 N , R = 2 w e , e = ( n + 1 ) / w , n = log 2 N + 1
Output: Z = j = 0 e 1 Z ( j ) 2 w j = X Y R 1 , 0 Z < 2 N
1:     Z = 0
2:     for i = 0 to e + 1 do
3:         q i = ( Z ( 0 ) + X ( i ) Y ( 0 ) ) N mod 2 w
4:         ( C , Z ( 0 ) ) = X ( i ) Y ( 0 ) + q i N ( 0 ) + Z ( 0 )
5:         for j = 0 to e + 1 do
6:             ( C , Z ( j ) ) = C + X ( i ) Y ( j ) + q i N ( j ) + Z ( j )
7:             Z ( j 1 ) = Z ( j )
8:         end for
9:          Z ( e 1 ) = C
10:   end for
return Z
Due to the requirement of domain conversion in Montgomery modular multiplication, it is more suitable for numerous continuous modular multiplication applications, such as the RSA cryptosystem. Wu proposed a high-radix Montgomery modular multiplication algorithm with precomputation for the RSA cryptosystem [77]. Xiao et al. proposed a variable segmentation Montgomery modular multiplication algorithm, which can divide operands into segments of arbitrary bit-widths to match the given data width of DSPs [78]. Kolagatla et al. utilized lookup table techniques to compute the reduction part in high-radix Montgomery modular multiplication [79]. This method addresses the issue of the AT metric increasing with the radix for a given large modulus. Zhang et al. proposed a low-delay, high-radix, and scalable Montgomery modular multiplication algorithm, which simplified the multiplication operation in each iteration and employed Booth encoding to reduce the number of partial products [80]. Wu et al. proposed a Montgomery modular multiplication algorithm and structure with precomputation. By analyzing the relationship between the number of clock cycles of multiplication operation, the pipelined structure, and the number of multipliers, they maximized the utilization of multipliers and adders, significantly reducing the computation time [81]. Liu et al. employed the radix- 2 32 Montgomery modular multiplication algorithm to design a modular multiplication unit that supported operand bit-widths of up to 544-bit [82]. Li et al. applied Redundant Binary Representation (RBR) to the Montgomery modular multiplication algorithm to eliminate the long carry propagation delay in multiplication calculations [83]. However, achieving an optimal balance between area and time becomes increasingly challenging as the input bit-widths increase. Zhang et al. proposed a Montgomery modular multiplication algorithm with parallel precomputation to reduce the area complexity when dealing with large input bit-widths [84].
In addition, researchers also apply the efficient multiplication algorithm to Montgomery modular multiplication. Zhang et al. applied the KO-3 algorithm to Montgomery modular multiplication and designed a 256-bit modular multiplier [85]. Gu et al. applied the division-free Toom–Cook-4 algorithm to Montgomery modular multiplication and designed 256-bit and 1024-bit modular multipliers [86]. Zhao et al. proposed a radix-2 Montgomery modular multiplication algorithm based on RSD, which improved the operation speed by reducing carry propagation and the number of addition rounds [87]. They further extended this approach to develop a high-performance radix-4 Montgomery algorithm, also leveraging RSD to eliminate long carry propagation delays [88]. To reduce idle cycles in critical computational units, Wang et al. designed a parallel and pipelined Montgomery multiplier that effectively shortens the critical path by employing carry-save adders and the Karatsuba algorithm [89].

4.4. Fast Modular Multiplication for Specific Prime Numbers

Barrett reduction and Montgomery reduction can perform modular reduction for any prime number, but specialized fast modular reduction algorithms can be designed for specific prime numbers to improve computational speed.
Marzouqi et al. introduced RSD encoding into the Karatsuba algorithm for the NIST curve p 256 and implemented an RSD-based ECC processor on FPGA for the first time [90]. Ding et al. combined the Karatsuba algorithm with a fast modular reduction algorithm for signed numbers. Furthermore, they employed a compact pipeline scheduling method to realize a reconfigurable ECC processor that supported elliptic curves over arbitrary prime fields [91]. In addition, Ding et al. combined the division-free Toom–Cook algorithm with a fast reduction algorithm. Furthermore, they introduced the NLP form to implement a high-speed ECC processor that supported primes recommended by NIST [92]. Park et al. proposed a fast modular multiplication algorithm for the NIST curve p 256 . By introducing the Range-Shifted Representation (RSR), the reduction operation was integrated into the Karatsuba algorithm to optimize the intermediate results during the accumulation process, thereby reducing unnecessary memory accesses [93]. Hu et al. implemented an ECC processor that supported the NIST curve p 192 , where the modular multiplication unit was completed using a divide-and-conquer Karatsuba multiplier, followed by fast modular reduction [94]. In addition, Hu et al. also proposed a two-stage fast modular reduction algorithm over the SCA-256 prime field, which reduced numerous repetitive addition and subtraction operations [95]. Liu et al. improved the operational scheduling of the Karatsuba algorithm and designed a fast reduction algorithm based on bit reorganization. Furthermore, they implemented a fast modular multiplier that supported secp256k1, secp256r1, and SCA-256 [96]. For specific moduli in Kyber, Zhang et al. designed a pipelined modular multiplier utilizing a fast lookup table. However, this structure required adjustments to the size of the lookup table and the division of the pipeline according to different moduli, limiting its flexibility [97]. Nguyen et al. employed low-complexity NTT and INTT to eliminate pre-processing and post-processing steps and proposed the Exact-KRED reduction algorithm to solve the resulting overflow [98]. For NTT-friendly primes, Hu et al. proposed a low-complexity fast modular multiplication algorithm suitable for modulus q = 2 2 N δ ( δ 2 4 N / 3 ) [99]. This algorithm expanded the range of available primes with low hardware overhead. Plantard proposed a modular multiplication algorithm for small moduli that reduced one multiplication operation compared to Barrett modular multiplication and Montgomery modular multiplication, suitable for scenarios involving constant multiplication [100]. Huang et al. analyzed the shortcomings of the original Plantard modular multiplication algorithm in lattice-based cryptographic schemes and proposed a signed Plantard modular multiplication algorithm [101]. This algorithm expanded the input range while narrowing the output range, facilitating lazy reduction. Subsequently, Huang et al. further expanded the input range and utilized memory optimization techniques to accelerate Kyber on Cortex-M3 and RISC-V platforms [102]. The Plantard modular multiplication algorithm has demonstrated excellent performance on software platforms, but its hardware implementation still requires further research.

4.5. Comparative Analysis of Modular Multiplication

Based on the design of various modular multiplication algorithms, it can be observed that fewer hardware resources and higher processing speed are two conflicting parameters. Some research attempts to minimize hardware utilization, some focuses on reducing latency, and other strives to balance performance and area. Table 3 presents a performance comparison of several modular multipliers implemented on Field-Programmable Gate Arrays (FPGAs). Specifically, Refs. [47,48,53] all utilized interleaved modular multiplication. At the same time, Ref. [47] implemented a compact modular multiplier for ECC processors using Jacobian coordinates, while Ref. [48] focused on optimizing the trade-off between performance and area over five NIST prime fields for lightweight ECC. Ref. [53] adopts the radix-4 Booth encoding technique, achieving a balance between clock cycles and operating frequency. However, the area slightly increases due to architectural optimizations required to support reconfigurability for arbitrary prime fields and operand widths. Refs. [59,61] utilized Barrett modular multiplication. Ref. [59] employs novel compression and encoding techniques to optimize the critical path delay, demonstrating particularly superior performance in high-radix modular multiplication implementations. Ref. [61] implemented a high-throughput and low-area-efficiency Barrett modular multiplier, tailored to the bit-width requirements of fully homomorphic encryption systems. Refs. [71,78,81,89] all utilized Montgomery modular multiplication. Ref. [71] aimed to reduce computation time by optimizing the adder architecture and demonstrated that the proposed modular multiplication design achieved high performance over five NIST prime fields for ECC. Ref. [78] proposed a variable segmentation approach to design a high-throughput Montgomery modular multiplier for implementing RSA cryptosystems. Ref. [81] addressed the inefficiency of high-radix Montgomery modular multiplication in low-bit computations and achieved a high operating frequency by making full use of the multiplier. Ref. [89] presented an optimized modular multiplication implementation for 256-bit, 384-bit, and 512-bit operands. The design exhibits significant throughput improvements with increasing operand sizes, making it particularly suitable for security-sensitive applications.
In contrast, Table 4 provides a performance comparison of modular multipliers implemented on Application-Specific Integrated Circuits (ASICs). Ref. [55] applied Toom–Cook multiplication to Barrett modular multiplication, making it suitable for handling larger input bit-widths in ECC and RSA cryptographic systems. Refs. [83,84] utilized RBR-based Montgomery modular multiplication. Ref. [83] targeted a small area design, achieving smaller AT values at 256-bit and 8192-bit widths when the radix was 4 and at 1024-bit and 2048-bit widths when the radix was 16. Ref. [84] attained faster speed at the cost of hardware resources, achieving smaller AT values at a 1024-bit width when the radix was 2 16 and at 2048-bit and 8192-bit widths when the radix was 2 32 . Ref. [85] applied multi-cycle Karatsuba multiplication to design a 256-bit pipeline Montgomery modular multiplier, achieving the computation in only 17 clock cycles.
Table 5 compares and analyzes four modular multiplication algorithms in public-key cryptosystems.

5. Conclusions

Public-key cryptographic algorithms involve large-bit-width operations, such as modular multiplication, modular exponentiation, and modular inverse. Due to their high computational complexity, large resource consumption, and long processing time, these operations have become the key designs. Since modular exponentiation and modular inversion operations are typically implemented via multiple modular multiplication operations, the efficient implementation of modular multiplication algorithms is crucial for improving the performance of cryptographic algorithms. This paper first provides a comprehensive review and analysis of multiplication algorithms in public-key cryptosystems and the encoding methods used to optimize multiplication algorithms. Subsequently, the modular reduction algorithms for general moduli and fast modular reduction algorithms for specific prime numbers in public-key cryptosystems are summarized and analyzed. Finally, four types of modular multiplication algorithms are summarized: the Blakley-based interleaved modular multiplication algorithm, the Barrett modular multiplication algorithm employing the quotient estimation, the Montgomery modular multiplication algorithm based on residue number systems, and the fast modular multiplication algorithm for specific prime numbers. A comprehensive review is conducted from the perspectives of algorithm principles, improved forms, hardware implementations, and application scenarios. Additionally, this paper summarizes the design difficulties of the algorithms above and provides a comparative analysis of their advantages and disadvantages, providing a reference for further research of the modular multiplication algorithm over prime fields for public-key cryptosystems.
Future research on modular multiplication algorithms in public-key cryptosystems should focus on the following areas: (1) With increasing security requirements, designing modular multipliers that support larger bit-widths will be critical to meeting the performance demands of next-generation cryptographic schemes. (2) Although current optimization techniques have improved the efficiency of modular multiplication, they still fall short in addressing idle cycles in the multiplier core caused by data dependencies. Future work should focus on optimizing data scheduling and dynamic resource allocation. (3) Future research should focus on the efficient application of high-radix Booth encoding and RSD representation techniques to reduce carry propagation delay without significantly increasing hardware complexity. (4) Most current research has focused on the efficient implementation of modular multiplication algorithms. Future work should explore power-efficient and high-throughput modular multiplication designs. (5) Future research should focus on the design optimization of NTT computation modules in post-quantum cryptographic algorithms. The study of modular multiplication units should emphasize efficient modular reduction techniques for moduli with specific prime structures, as well as the development of reconfigurable multiplication schemes, such as the combination of Barrett and Plantard modular multiplication algorithms. (6) Future research should continue to address side-channel attacks and defenses, with a particular focus on developing lightweight protection schemes to ensure the security of cryptographic algorithms.

Author Contributions

Conceptualization, H.H. and J.Z.; methodology, H.H., J.Z. and S.Z.; validation, Z.C., H.W. and B.Y.; formal analysis, Z.L.; investigation, J.Z. and S.Z.; writing—original draft preparation, J.Z.; writing—review and editing, H.H. and S.Z.; visualization, Z.C.; supervision, H.H. and S.Z.; project administration, S.Z.; funding acquisition, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key R&D Program of China (2023YFB4403500).

Data Availability Statement

No new data were generated or analyzed in support of this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. van Deursen, A. Learning from Apple’s# Gotofail Security Bug. Available online: https://avandeursen.com/2014/02/22/gotofail-security/ (accessed on 11 June 2025).
  2. Nemec, M.; Sys, M.; Svenda, P.; Klinec, D.; Matyas, V. The return of coppersmith’s attack: Practical factorization of widely used RSA moduli. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 30 October–3 November 2017; pp. 1631–1648. [Google Scholar]
  3. Saiyed, A.I. Hybrid Quantum-Classical Cryptographic Protocols: Enhancing Security in the Era of Quantum Supremacy. Spectr. Res. 2025, 5. [Google Scholar]
  4. Comba, P.G. Exponentiation cryptosystems on the IBM PC. IBM Syst. J. 1990, 29, 526–538. [Google Scholar] [CrossRef]
  5. Karatsuba, A.A.; Ofman, Y.P. Multiplication of many-digital numbers by automatic computers. In Proceedings of the Doklady Akademii Nauk, Moscow, Russia, 18 March 1962; Russian Academy of Sciences: Moscow, Russia, 1962; pp. 293–294. [Google Scholar]
  6. Toom, A.L. The complexity of a scheme of functional elements simulating the multiplication of integers. In Proceedings of the Doklady Akademii Nauk, Moscow, Russia, 18 March 1963; Russian Academy of Sciences: Moscow, Russia, 1963; pp. 496–498. [Google Scholar]
  7. Cook, S.A.; Aanderaa, S.O. On the minimum computation time of functions. Trans. Am. Math. Soc. 1969, 142, 291–314. [Google Scholar] [CrossRef]
  8. Cooley, W.; Tukey, W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
  9. Booth, A.D. A signed binary multiplication technique. Q. J. Mech. Appl. Math. 1951, 4, 236–240. [Google Scholar] [CrossRef]
  10. Avizienis, A. Signed-digit numbe representations for fast parallel arithmetic. IRE Trans. Electron. Comput. 1961, EC-10, 389–400. [Google Scholar] [CrossRef]
  11. Barrett, P. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Proceedings of the Conference on the Theory and Application of Cryptographic Techniques, Berlin/Heidelberg, Germany, August 1986; Springer: Berlin/Heidelberg, Germany, 1986; pp. 311–323. [Google Scholar] [CrossRef]
  12. Montgomery, P.L. Modular multiplication without trial division. Math. Comput. 1985, 44, 519–521. [Google Scholar] [CrossRef]
  13. Ananyi, K.; Alrimeih, H.; Rakhmatov, D. Flexible hardware processor for elliptic curve cryptography over NIST prime fields. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2009, 17, 1099–1112. [Google Scholar] [CrossRef]
  14. Blakely, G.R. A computer algorithm for calculating the product AB modulo M. IEEE Trans. Comput. 1983, 100, 497–500. [Google Scholar] [CrossRef]
  15. Zoni, D.; Galimberti, A.; Fornaciari, W. Flexible and scalable FPGA-oriented design of multipliers for large binary polynomials. IEEE Access 2020, 8, 75809–75821. [Google Scholar] [CrossRef]
  16. Kang, B.; Cho, H. Flexka: A flexible karatsuba multiplier hardware architecture for variable-sized large integers. IEEE Access 2023, 11, 55212–55222. [Google Scholar] [CrossRef]
  17. Rafferty, C.; O’Neill, M.; Hanley, N. Evaluation of large integer multiplication methods on hardware. IEEE Trans. Comput. 2017, 66, 1369–1382. [Google Scholar] [CrossRef]
  18. Weimerskirch, A.; Paar, C. Generalizations of the Karatsuba algorithm for efficient implementations. Cryptol. Eprint Arch. 2006. Available online: http://eprint.iacr.org/2006/224 (accessed on 11 June 2025).
  19. Wong, Z.Y.; Wong, D.C.K.; Lee, W.K.; Mok, K.M.; Yap, W.S.; Khalid, A. KaratSaber: New speed records for saber polynomial multiplication using efficient Karatsuba FPGA architecture. IEEE Trans. Comput. 2023, 72, 1830–1842. [Google Scholar] [CrossRef]
  20. Heidarpur, M.; Mirhassani, M. An efficient and high-speed overlap-free Karatsuba-based finite-field multiplier for FGPA implementation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 667–676. [Google Scholar] [CrossRef]
  21. Bodrato, M. Towards optimal Toom-Cook multiplication for univariate and multivariate polynomials in characteristic 2 and 0. In Proceedings of the Arithmetic of Finite Fields: First International Workshop, Madrid, Spain, 21–22 June 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 116–133. [Google Scholar]
  22. Bermudo Mera, M.; Karmakar, A.; Verbauwhede, I. Time-memory trade-off in Toom-Cook multiplication: An application to module-lattice based cryptography. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020, 2020, 222–244. [Google Scholar] [CrossRef]
  23. Wang, J.; Yang, C.; Zhang, F.; Meng, Y.; Xiang, S.; Su, Y. A high-throughput Toom-Cook-4 polynomial multiplier for lattice-based cryptography using a novel winograd-schoolbook algorithm. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 71, 359–372. [Google Scholar] [CrossRef]
  24. Li, Y.; Zhu, J.; Huang, Y.; Liu, Z.; Tang, M. Single-trace side-channel attacks on the toom-cook: The case study of saber. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 2022, 285–310. [Google Scholar] [CrossRef]
  25. Zhou, X.; Chen, X.; He, Y.; Mou, X. A flexible-channel MDF architecture for pipelined radix-2 FFT. IEEE Access 2023, 11, 38023–38033. [Google Scholar] [CrossRef]
  26. Yang, C.; Wu Xiang, S.; Liang, L.; Geng, L. A high-throughput and flexible architecture based on a reconfigurable mixed-radix FFT with twiddle factor compression and conflict-free access. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2023, 31, 1472–1485. [Google Scholar] [CrossRef]
  27. Liu, M.; Zhao, P.; Wu, T.; Parhi, K.K.; Zeng, X.; Chen, Y. A low-power twiddle factor addressing architecture for split-radix FFT processor. Microelectron. J. 2021, 117, 105276. [Google Scholar] [CrossRef]
  28. Zhang, N.; Yang, B.; Chen, C.; Yin, S.; Wei, S.; Liu, L. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020, 2020, 49–72. [Google Scholar] [CrossRef]
  29. Xing, Y.; Li, S. A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-KYBER on FPGA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 328–356. [Google Scholar] [CrossRef]
  30. Guo, W.; Li, S. Split-radix based compact hardware architecture for CRYSTALS-Kyber. IEEE Trans. Comput. 2023, 73, 97–108. [Google Scholar] [CrossRef]
  31. Li, D.; Pakala, A.; Yang, K. MeNTT: A compact and efficient processing-in-memory number theoretic transform (NTT) accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 579–588. [Google Scholar] [CrossRef]
  32. Mu, J.; Ren, Y.; Wang, W.; Hu, Y.; Chen, S.; Chang, C.-H.; Fan, J.; Ye, J.; Cao, Y.; Li, H.; et al. Scalable and conflict-free NTT hardware accelerator design: Methodology, proof, and implementation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 42, 1504–1517. [Google Scholar] [CrossRef]
  33. Chen, X.; Yang, B.; Yin, S.; Wei, S.; Liu, L. CFNTT: Scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 2022, 94–126. [Google Scholar] [CrossRef]
  34. Su, Y.; Yang, B.L.; Yang, C.; Yang, Z.P.; Liu, Y.W. A highly unified reconfigurable multicore architecture to speed up NTT/INTT for homomorphic polynomial multiplication. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 993–1006. [Google Scholar] [CrossRef]
  35. Liu, S.H.; Kuo, C.Y.; Mo, Y.N.; Su, T. An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2023, 32, 519–529. [Google Scholar] [CrossRef]
  36. RamaLakshmi, B.V.; Noorbasha, F. FPGA Implementation of Optimized Radix 4 and Radix 8 Booth Algorithm. Int. J. Perform. Eng. 2021, 17, 552. [Google Scholar] [CrossRef]
  37. Zhu, D.; Zhang, R.; Ou, L.; Tian Wang, Z. Low-latency design and implementation of the squaring in class groups for verifiable delay function using redundant representation. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2023, 2023, 438–462. [Google Scholar] [CrossRef]
  38. Bernstein, D.J. Curve25519: New Diffie-Hellman speed records. In Proceedings of the Public Key Cryptography-PKC 2006: 9th International Conference on Theory and Practice in Public-Key Cryptography, New York, NY, USA, 24–26 April 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 207–228. [Google Scholar]
  39. Huang, H.; Liu, Z. Design and implementation of high-speed scalar multiplier for multi-elliptic curve. J. Commun. 2020, 41, 100–109. [Google Scholar]
  40. Choi, P.; Lee, M.K.; Kim, H.; Kim, D.K. Low-complexity elliptic curve cryptography processor based on configurable partial modular reduction over NIST prime fields. IEEE Trans. Circuits Syst. II Express Briefs 2017, 65, 1703–1707. [Google Scholar] [CrossRef]
  41. Choi, P.; Lee, M.K.; Kim, D.K. ECC coprocessor over a NIST prime field using fast partial Montgomery reduction. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 68, 1206–1216. [Google Scholar] [CrossRef]
  42. Yaman, F.; Mert, A.C.; Öztürk, E.; Savaş, E. A hardware accelerator for polynomial multiplication operation of CRYSTALS-KYBER PQC scheme. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1020–1025. [Google Scholar] [CrossRef]
  43. Aikata, A.; Mert, A.C.; Imran, M.; Pagliarini, S.; Roy, S.S. KaLi: A crystal for post-quantum security using Kyber and Dilithium. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 70, 747–758. [Google Scholar] [CrossRef]
  44. Longa, P.; Naehrig, M. Speeding up the number theoretic transform for faster ideal lattice-based cryptography. In Proceedings of the International Conference on Cryptology and Network Security, Milan, Italy, 14–16 November 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 124–139. [Google Scholar]
  45. Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography. In Proceedings of the 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 14–16 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 94–101. [Google Scholar] [CrossRef]
  46. Li, L.; Tian, Q.; Qin, G.; Chen, S.; Wang, W. Compact Instruction Set Extensions for Dilithium. ACM Trans. Embed. Comput. Syst. 2024, 23, 1–21. [Google Scholar] [CrossRef]
  47. Hossain, M.S.; Kong, Y.; Saeedi, E.; Vayalil, N.C. High-performance elliptic curve cryptography processor over NIST prime fields. IET Comput. Digit. Tech. 2017, 11, 33–42. [Google Scholar] [CrossRef]
  48. Islam, M.M.; Hossain, M.S.; Shahjalal, M.D.; Hasan, M.K.; Jang, Y.M. Area-time efficient hardware implementation of modular multiplication for elliptic curve cryptography. IEEE Access 2020, 8, 73898–73906. [Google Scholar] [CrossRef]
  49. Javeed, K.; El-Moursy, A.; Gregg, D. EC-crypto: Highly efficient area-delay optimized elliptic curve cryptography processor. IEEE Access 2023, 11, 56649–56662. [Google Scholar] [CrossRef]
  50. Rahman, M.S.; Halder, K.K. Area-Time Effective Modular Multiplication for Elliptic Curve Cryptography. In Proceedings of the 2023 International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM), Gazipur, Bangladesh, 16–17 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
  51. Kudithi, T.; Potdar, M.; Sakthivel, R. Radix-4 interleaved modular multiplication for cryptographic applications. In Proceedings of the 2019 International Conference on Vision Towards Emerging Trends in Communication and Networking (ViTECoN), Vellore, India, 30–31 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar] [CrossRef]
  52. Lin, L.; Zheng, P.Y.; Chao, P.C.P. A new ECC implemented by FPGA with favorable combined performance of speed and area for lightweight IoT edge devices. Microsyst. Technol. 2024, 30, 1537–1546. [Google Scholar] [CrossRef]
  53. Madani, B.; Azzaz, M.S.; Sadoudi, S.; Kaibou, R. High-Speed FPGA Implementation of Modular Multiplication Over Prime Field. In Proceedings of the 2024 1st International Conference on Electrical, Computer, Telecommunication and Energy Technologies (ECTE-Tech), Oum El Bouaghi, Algeria, 17–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
  54. Dhem, F.; Quisquater, J. Recent results on modular multiplications for smart cards. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Amsterdam, The Netherlands, 16–18 September 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 336–352. [Google Scholar] [CrossRef]
  55. Hao, Y.; Wang, W.; Dang, H.; Wang, G. Efficient barrett modular multiplication based on toom-cook multiplication. IEEE Trans. Circuits Syst. II Express Briefs 2023, 71, 862–866. [Google Scholar] [CrossRef]
  56. Yu, B.; Huang, H.; Liu, Z.; Zhao, S.; Na, N. High-performance hardware architecture design and implementation of Ed25519 algorithm. J. Electron. Inf. Technol. 2021, 43, 1821–1827. [Google Scholar] [CrossRef]
  57. Agrawal, R.; Yang Javaid, H. Efficient FPGA-based ECDSA verification engine for permissioned blockchains. In Proceedings of the 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP), Gothenburg, Sweden, 12–14 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 148–155. [Google Scholar] [CrossRef]
  58. Zhang, B.; Cheng, Z.; Pedram, M. A high-performance low-power Barrett modular multiplier for cryptosystems. In Proceedings of the 2021 IEEE/ACM Int. Symposium on Low Power Electronics and Design (ISLPED), Boston, MA, USA, 26–28 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
  59. Zhang, B.; Cheng, Z.; Pedram, M. Design of a high-performance iterative Barrett modular multiplier for crypto systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 897–910. [Google Scholar] [CrossRef]
  60. Zhang, Q.; He, W.; Yang, R. Efficient configurable modular multiplier for rns. In Proceedings of the 2023 8th International Conference on Integrated Circuits and Microsystems (ICICM), Nanjing, China, 20–23 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 224–228. [Google Scholar] [CrossRef]
  61. Zhang, B.; Yan, S. Area-efficient Barrett modular multiplication with optimized Karatsuba algorithm. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 4626–4639. [Google Scholar] [CrossRef]
  62. Xu, T.; Cui, Y.; Liu, D.; Wang, C.; Liu, W. Lightweight and efficient hardware implementation for saber using NTT multiplication. In Proceedings of the 2022 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Shenzhen, China, 11–13 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 601–605. [Google Scholar] [CrossRef]
  63. Guo, W.; Li, S. Highly-efficient hardware architecture for CRYSTALS-Kyber with a novel conflict-free memory access pattern. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 4505–4515. [Google Scholar] [CrossRef]
  64. Koc, C.K.; Acar, T.; Kaliski, B.S. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 1996, 16, 26–33. [Google Scholar] [CrossRef]
  65. Mrabet, A.; El-Mrabet, N.; Lashermes, R.; Rigaud, B.; Bouallegue, B.; Mesnager, S.; Machhout, M. A scalable and systolic architectures of montgomery modular multiplication for public key cryptosystems based on dsps. J. Hardw. Syst. Secur. 2017, 1, 219–236. [Google Scholar] [CrossRef]
  66. Gallin, G.; Tisserand, A. Generation of finely-pipelined GF (P) multipliers for flexible curve based cryptography on FPGAs. IEEE Trans. Comput. 2019, 68, 1612–1622. [Google Scholar] [CrossRef]
  67. Botrel, G.; El Housni, Y. Faster montgomery multiplication and multi-scalar-multiplication for snarks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2023, 2023, 504–521. [Google Scholar] [CrossRef]
  68. Buhrow, B.; Gilbert, B.; Haider, C. Parallel modular multiplication using 512-bit advanced vector instructions: RSA fault-injection countermeasure via interleaved parallel multiplication. J. Cryptogr. Eng. 2022, 12, 95–105. [Google Scholar] [CrossRef]
  69. Walter, C.D. Montgomery exponentiation needs no final subtractions. Electron. Lett. 1999, 35, 1831–1832. [Google Scholar] [CrossRef]
  70. Kuang, S.R.; Wu, K.Y.; Lu, R.Y. Low-cost high-performance VLSI architecture for Montgomery modular multiplication. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 24, 434–443. [Google Scholar] [CrossRef]
  71. Coliban, R.M. Fast Radix-2 Montgomery modular multiplication on FPGA using ternary adder. In Proceedings of the 2022 International Conference on Computing, Electronics & Communications Engineering (iCCECE), Southend, UK, 17–18 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
  72. Abirami, T.; Saravanan, S.; Rajeshkumar, A.; Santhosh, K.M. FPGA–based Optimized Design of Montgomery Modular Multiplier using Karatsuba Algorithm. In Proceedings of the 2023 Second International Conference on Electronics and Renewable Systems (ICEARS), Tuticorin, India, 2–4 March 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 131–135. [Google Scholar] [CrossRef]
  73. Tenca, A.F.; Koç, Ç.K. A scalable architecture for modular multiplication based on Montgomery’s algorithm. IEEE Trans. Comput. 2003, 52, 1215–1221. [Google Scholar] [CrossRef]
  74. Li, H.; Ren, S.; Wang, W.; Zhang Wang, X. A low-cost high-performance montgomery modular multiplier based on pipeline interleaving for iot devices. Electronics 2023, 12, 3241. [Google Scholar] [CrossRef]
  75. Mert, A.C.; Karabulut, E.; Öztürk, E.; Savaş, E.; Becchi, M.; Aysu, A. A flexible and scalable NTT hardware: Applications from homomorphically encrypted deep learning to post-quantum cryptography. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 346–351. [Google Scholar] [CrossRef]
  76. Satoh, A.; Takano, K. A scalable dual-field elliptic curve cryptographic processor. IEEE Trans. Comput. 2003, 52, 449–460. [Google Scholar] [CrossRef]
  77. Wu, T. Radix-16 CSA-based low-latency non-Montgomery modular multiplier. J. Eng. 2022, 2022, 244–248. [Google Scholar] [CrossRef]
  78. Xiao, H.; Yu, S.; Cheng, B.; Liu, G. FPGA-based high-throughput Montgomery modular multipliers for RSA cryptosystems. IEICE Electron. Express 2022, 19, 20220101. [Google Scholar] [CrossRef]
  79. Kolagatla, V.R.; Desalphine, V.; Selvakumar, D. Area-time scalable high radix Montgomery modular multiplier for large modulus. In Proceedings of the 2021 25th International Symposium on VLSI Design and Test (VDAT), Surat, India, 16–18 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–4. [Google Scholar] [CrossRef]
  80. Zhang, B.; Cheng, Z.; Pedram, M. High-radix design of a scalable montgomery modular multiplier with low latency. IEEE Trans. Comput. 2021, 71, 436–449. [Google Scholar] [CrossRef]
  81. Wu, R.; Xu, M.; Yang, Y.; Tian, G.; Yu, P.; Zhao, Y.; Lian, B.; Ma, L. Efficient high-radix GF (p) montgomery modular multiplication via deep use of multipliers. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 5099–5103. [Google Scholar] [CrossRef]
  82. Liu, Z.; Liu, L.; Huang, H.; Zhang, Q.; Yu, B.; Zhao, S.; Cui, J. Multi-curve-oriented general high-performance ECC processor design. Acta Electonica Sin. 2023, 51, 1562–1571. [Google Scholar] [CrossRef]
  83. Li, B.; Wang, J.; Ding, G.; Fu, H.; Lei, B.; Yang, H.; Bi, J.; Lei, S. A high-performance and low-cost montgomery modular multiplication based on redundant binary representation. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 2660–2664. [Google Scholar] [CrossRef]
  84. Zhang, Z.; Zhang, P. A scalable montgomery modular multiplication architecture with low area-time product based on redundant binary representation. Electronics 2022, 11, 3712. [Google Scholar] [CrossRef]
  85. Zhang, S.; Li, S. An implementation of montgomery modular multiplier based on KO-3 multiplication. In Proceedings of the 2022 4th International Conference on Communications, Information System and Computer Engineering (CISCE), Shenzhen, China, 27–29 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 596–600. [Google Scholar] [CrossRef]
  86. Gu, Z.; Li, S. A division-free Toom–Cook multiplication-based Montgomery modular multiplication. IEEE Trans. Circuits Syst. II: Express Briefs 2018, 66, 1401–1405. [Google Scholar] [CrossRef]
  87. Zhao, S.; Huang, H.; Liu, Z.; Yu, B.; Yu, B. An efficient signed digit montgomery modular multiplication algorithm. Microelectron. J. 2021, 114, 105099. [Google Scholar] [CrossRef]
  88. Zhao, S.; Zheng, J.; Shao, Y.; Huang, H.; Liu, Z.; Yu, B.; Zhang, Z. RSD-based high-performance radix-4 Montgomery Modular Multiplication for Elliptic Curve Cryptography. Microelectron. J. 2024, 153, 106433. [Google Scholar] [CrossRef]
  89. Wang, J.; Wang, X.; Liu, W.; Xing, Q.; Tang, X.; Deng, T.; Cao, R.; Huang, M. A parallel and pipelined high speed Montgomery modular multiplier for IoT devices. Comput. Netw. 2025, 265, 111282. [Google Scholar] [CrossRef]
  90. Marzouqi, H.; Al-Qutayri, M.; Salah, K.; Schinianakis, D.; Stouraitis, T. A high-speed FPGA implementation of an RSD-based ECC processor. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 24, 151–164. [Google Scholar] [CrossRef]
  91. Ding, J.; Li, S. A reconfigurable high-speed ECC processor over NIST primes. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, NSW, Australia, 1–4 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1064–1069. [Google Scholar] [CrossRef]
  92. Ding, J.; Li, S.; Gu, Z. High-speed ECC processor over NIST prime fields applied with Toom–Cook multiplication. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 66, 1003–1016. [Google Scholar] [CrossRef]
  93. Park, D.W.; Hong, S.; Chang, N.S.; Cho, S.M. Efficient implementation of modular multiplication over 192-bit NIST prime for 8-bit AVR-based sensor node. J. Supercomput. 2021, 77, 4852–4870. [Google Scholar] [CrossRef]
  94. Hu, X.; Li, X.; Zheng, X.; Liu, Y.; Xiong, X. A high speed processor for elliptic curve cryptography over NIST prime field. IET Circuits Devices Syst. 2022, 16, 350–359. [Google Scholar] [CrossRef]
  95. Hu, X.; Zheng, X.; Zhang, S.; Li, W.; Cai, S.; Xiong, X. A high-performance elliptic curve cryptographic processor of SM2 over GF (p). Electronics 2019, 8, 431. [Google Scholar] [CrossRef]
  96. Liu, Z.; Zhang, Q.; Huang, H.; Yang, X.; Chen, G.; Zhao, S.; Yu, B. Design of high area efficiency elliptic curve scalar multiplier based on fast modulo reduction of bit reorganization. J. Electron. Inf. Technol. 2024, 46, 344–352. [Google Scholar] [CrossRef]
  97. Zhang, C.; Liu, D.; Liu, X.; Zou, X.; Niu, G.; Liu, B.; Jiang, Q. Towards efficient hardware implementation of NTT for kyber on FPGAs. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea, 22–28 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
  98. Nguyen, H.; Tran, L. Design of polynomial NTT and INTT accelerator for post-quantum cryptography CRYSTALS-Kyber. Arab. J. Sci. Eng. 2023, 48, 1527–1536. [Google Scholar] [CrossRef]
  99. Hu, X.; Tian Li, M.; Wang, Z. AC-PM: An area-efficient and configurable polynomial multiplier for lattice based cryptography. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 70, 719–732. [Google Scholar] [CrossRef]
  100. Plantard, T. Efficient word size modular arithmetic. IEEE Trans. Emerg. Top. Comput. 2021, 9, 1506–1518. [Google Scholar] [CrossRef]
  101. Huang, J.; Zhang, J.; Zhao, H.; Liu, Z.; Cheung, R.C.C.; Koç, Ç.K.; Chen, D. Improved plantard arithmetic for lattice-based cryptography. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 2022, 614–636. [Google Scholar] [CrossRef]
  102. Huang, J.; Zhao, H.; Zhang, J.; Dai, W.; Zhou, L.; Cheung, R.C.C.; Koç, Ç.K.; Chen, D. Yet another improvement of Plantard arithmetic for faster Kyber on low-end 32-bit IoT devices. IEEE Trans. Inf. Forensics Secur. 2024, 19, 3800–3813. [Google Scholar] [CrossRef]
Figure 1. Classification of modular multiplication algorithms over prime fields for public-key cryptosystems.
Figure 1. Classification of modular multiplication algorithms over prime fields for public-key cryptosystems.
Cryptography 09 00046 g001
Table 1. Advantages and disadvantages of multiplication algorithms.
Table 1. Advantages and disadvantages of multiplication algorithms.
MultiplicationAdvantagesDisadvantages
Basic algorithmsSchoolbook multiplicationThe hardware implementation is straightforward.High computational complexity and difficulty in parallel computation are possessed.
Comba multiplicationThe number of carry propagations and memory access are reduced.
Algorithms based on divide and conquer Karatsuba multiplicationBinary recursive is utilized to reduce computational complexity.Additional recursive and addition operations are introduced.
Toom–Cook multiplicationThe concept of dynamic divide-and-conquer recursion is utilized to further reduce computational complexity.The hardware implementation of division operations is relatively challenging.
NTT multiplicationThe lowest computational complexity is achieved due to the point-value representation. The length of the input operand and the modulus are limited.
Partial product optimization techniquesBooth encodingThe number of partial products is decreased.The operational efficiency is reduced for multipliers containing non-contiguous 1s or 0s.
RSD representationThe carry propagations of accumulation operations are decreased.Redundancy phenomena are susceptible to generate.
Table 2. Advantages and disadvantages of modular reduction algorithms.
Table 2. Advantages and disadvantages of modular reduction algorithms.
Modular ReductionAdvantagesDisadvantages
Barrett reductionConvert division operations into multiplication operations by utilizing quotient estimation techniques. One modular reduction operation requires two multiplications, one shift, and one subtraction. Multiple result corrections need to be performed.
Montgomery reductionConvert modular operations into shift operations by constructing residue systems. One modular reduction operation requires two multiplications, one shift, one addition, and two modular and subtraction operations.The domain transformation operation is required.
Fast modular reduction for Mersenne primesConvert modular operations into addition and subtraction operations after data reorganization.Only applicable to specific types of moduli.
Modular reduction in NTT multiplicationConvert modular operations into addition, subtraction, and shift operations.
Table 3. Performance comparison of modular multipliers implemented on FPGAs.
Table 3. Performance comparison of modular multipliers implemented on FPGAs.
ReferenceYearAlgorithmPlatformBit-WidthClock CyclesFrequency
(MHz)
Area
(SLICEs/LUTs/DSPs)
Time
(µs)
AT
(SLICEs × ms/LUTs × ms)
Throughput
(Mbps)
[47]2017Radix-2 IMMKintex-7224225130.49365--1.710.63-130.99
256257135.89397--1.880.76-136.17
[48]2020Radix-2 IMMVirtex-7192193207.13861151-0.930.361.07206.0
224225190.74901409-1.180.581.66189.9
256257177.35141491-1.450.752.16176.6
384385137.68202355-2.802.306.59137.2
521522111.29752496-4.694.5711.71111.0
[53]2024Radix-4 Booth-IMMVirtex-7192972385861367-0.410.240.56471
2561292107412292-0.620.461.42417
38419316810712527-1.151.232.91334
52126114212682986-1.842.335.50284
[59]2024Radix- 2 8 BMMVirtex-725641--6459-0.13-0.8480,757
1024137--25,034-0.50-12.52279,019
[61]2024Karatsuba-BMMVirtex-7327--3862-0.023-0.099726
648--10,032-0.027-0.2818,659
[71]2022Radix-2 MMMVirtex-7192-310.0-717-0.62-0.45308.47
224-284.17-835-0.79-0.66282.9
256-271.29-955-0.94-0.90270.24
384-253.35-1499-1.51-2.26252.69
521-238.6-2414-2.18-5.26238.14
[78]2022Radix- 2 52 MMMUltraScale-XCKU11525647285.7-1223390.17-0.2024,899.7
512168285.7-2348710.59-1.3813,931.9
1024345285.7-42091311.21-5.0813,562.9
[81]2022Radix- 2 5 MMMVirtex-7256-345-2900-0.32-0.93802
512-299-3700-1.04-3.84494
Radix- 2 6 MMM256-290-5500-0.21-1.181196
512-290-9500-0.45-4.261143
[89]2025PPMMMVirtex-725628228-2800320.12-0.342098.36
38428187-5164480.15-0.772565.13
51228147-53021280.19-1.012688.09
IMM—interleaved modular multiplication; BMM—Barrett modular multiplication; MMM—Montgomery modular multiplication; PP—parallel and pipelined.
Table 4. Performance comparison of modular multipliers implemented on ASICs.
Table 4. Performance comparison of modular multipliers implemented on ASICs.
ReferenceYearAlgorithmTechnologyBit-WidthClock CyclesFrequency (MHz)AreaTime
(ns)
AT
[47]2017Radix-2 IMM65 nm224225549.4512.8 KGates4095.23
KGates × µs
256257549.4513.3 KGates4686.21
KGates × µs
[55]2023Toom–Cook-BMM40 nm256-61364,877 µm257.053701
µm2 × µs
65 nm256-581113,565 µm260.26836
µm2 × µs
40 nm1024-555397,457 µm26325,040
µm2 × µs
[83]2021RBR-MMM65 nm256130100021.5 KGates1302.80
KGates × µs
1024258685121.8 KGates37745.92
KGates × µs
2048514654224.8 KGates786176.70
KGates × µs
81924098870568.7 KGates47132680.29 KGates × µs
[84]2022RBR-MMM65 nm1024114617222.4 KGates18541.09
KGates × µs
2048114485707.1 KGates235166.22
KGates × µs
81927884281493.8 KGates18422751.53
KGates × µs
[85]2022Karatsuba-MMM65 nm25617-145,281 µm237.45430
µm2 × µs
Table 5. Comparative analysis of modular multiplication.
Table 5. Comparative analysis of modular multiplication.
Modular MultiplicationAnalysis
Blakley-based interleaved modular multiplicationModular multiplication is transformed into a series of addition operations, but the low parallelism and large-bit-width addition operations result in a high path delay. This method is suitable for low-power scenarios.
Barrett modular multiplicationModular multiplication is transformed into multiplication and shift operations, but estimation errors may result in multiple reduction operations to correct the results.
This method is suitable for small-bit-width scenarios.
Montgomery modular multiplicationBy utilizing residue systems, division can be converted into shift operations, but more resources are required to control the domain transformation. This method is suitable for large-bit-width scenarios.
Fast modular multiplication for specific prime numbersAdopting a specially designed fast modular reduction algorithm results in high computational efficiency, but the hardware structure lacks generality. This method is suitable for specific modulus scenarios.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, H.; Zheng, J.; Chen, Z.; Zhao, S.; Wu, H.; Yu, B.; Liu, Z. Review of Modular Multiplication Algorithms over Prime Fields for Public-Key Cryptosystems. Cryptography 2025, 9, 46. https://doi.org/10.3390/cryptography9020046

AMA Style

Huang H, Zheng J, Chen Z, Zhao S, Wu H, Yu B, Liu Z. Review of Modular Multiplication Algorithms over Prime Fields for Public-Key Cryptosystems. Cryptography. 2025; 9(2):46. https://doi.org/10.3390/cryptography9020046

Chicago/Turabian Style

Huang, Hai, Jiwen Zheng, Zhengyu Chen, Shilei Zhao, Hongwei Wu, Bin Yu, and Zhiwei Liu. 2025. "Review of Modular Multiplication Algorithms over Prime Fields for Public-Key Cryptosystems" Cryptography 9, no. 2: 46. https://doi.org/10.3390/cryptography9020046

APA Style

Huang, H., Zheng, J., Chen, Z., Zhao, S., Wu, H., Yu, B., & Liu, Z. (2025). Review of Modular Multiplication Algorithms over Prime Fields for Public-Key Cryptosystems. Cryptography, 9(2), 46. https://doi.org/10.3390/cryptography9020046

Article Metrics

Back to TopTop