A Scalable Montgomery Modular Multiplication Architecture with Low Area-Time Product Based on Redundant Binary Representation

: The Montgomery modular multiplication is an integral operation unit in the public key cryptographic algorithm system. Previous work achieved good performance at low input widths by combining Redundant Binary Representation (RBR) with Montgomery modular multiplication, but it is difﬁcult to strike a good balance between area and time as input bit widths increase. To solve this problem, based on the redundant Montgomery modular multiplication, in this paper, we propose a ﬂexible and pipeline hardware implementation of the Montgomery modular multiplication. Our proposed structure guarantees a single-cycle delay between two-stage pipeline units and reduces the length of the critical path by redistributing the data paths between the pipelines and preprocessing the input in the loop. By analyzing the structure and comparing the related work in this paper, our structure ensures a lower area-time product while achieving a controllable and small area consumption. The comprehensive results under different Taiwan Semiconductor Manufacturing Company (TSMC) processes demonstrate the advantages of our structure in terms of ﬂexibility and area-time product.


Introduction
Many public key cryptosystems such as Rivest-Shamir-Adleman (RSA), elliptic curve cryptography (ECC), and sm9 require a lot of modular multiplication.Conventional modular multiplication is an expensive operation, as it requires division; this would take a lot of time and space to implement in hardware.What's more, for RSA, some operands can even reach 8192 bits, which means there is no suitable solution for traditional modulo multiplication methods.To solve this problem, Montgomery Modular Multiplication (MMM) [1] has been raised, which is an efficient method for calculating large integer modular multiplication by replacing the modular operation with a simple shift operation for hardware systems.
Several efforts have been made to optimize MMM's hardware implementation.Some papers mainly focus on using different multiplication algorithms such as the Karatsuba multiplier [2] or the Toom-Cook multiplier [3], which will lead to a high-performance wordbased MMM design.Fast Fourier Transform (FFT) [4], Residue Number System (RNS) [5], and Non-least Positive Form [2] are also introduced for accelerating.Meanwhile, others tend to trade off between performance and area as when giving a large input width, it will cause an extremely high cost in hardware.Digital-serial implementations are proposed using high radix MMM, such as a radix-4 architecture using lookup tables [6] or booth encoding [7].Radix-8 architecture uses booth encoding [8] and typically improves schemes based on carry-save adder (CSA) for the radix-2 system [9,10].Some optimization has also been made on the CSA architecture [11][12][13].However, many implementations for a low area cost have problems with their long carry chains, which will highly impact the performance of MMM.Although several high-performance adders such as the Kogge-Stone adder [14], carry-skip adder [15], carry-select adder [16], and Reverse Carry Propagate adder [17] are trying to solve this problem, they either have a higher demand for area or cannot meet all radix requirements.Some optimization [18,19] are made in achieving one-cycle latency pipeline word-based MMM by rearranging the basic pipeline algorithm in [20], but they do not change the basic algorithm's iteration and pay much effort on scheduling.In that case, Ref. [21] has proposed a Redundant Binary Representation MMM (RBR-MMM) algorithm to solve the carry-chain problem by bringing the operations of MMM into the redundant system, which successfully eliminates the long carry chain and improves the performance of MMM.
Although the RBR-MMM method reaches a balance of performance and cost when the input width is 256 bits or 1024 bits, there will be an enormous area cost when the input width becomes large as a reason for RBR-MMM's parallel design, which costs a large number of multipliers.RBR-MMM's critical path delay is mainly dependent on the k parameter, typically when k is small; this design can reach a high frequency, but for advanced technology to reach a low latency, we always choose to have a larger optimized multiplier, so the RBR-MMM's 'ATP' (Area-Time Product) (kGates * µs) will be unacceptable.Moreover, its high memory bandwidth requirement problem along with high bit width input is critical.
To solve the RBR-MMM's problem, in this paper, we first propose a hardware suitable pipeline RBR-MMM (PRBR-MMM) to lessen the high area cost when the input width is large.Then, we optimized the RBR-MMM to reach a one-cycle delay between two pipeline stages.Finally, we take the critical path in each stage in two paths and precalculate the first path by using our pipeline buffer, and this leads to an almost 2/3 critical path compared with basic RBR-MMM, which will make up for a large part of the delay caused by the pipeline.As a result, our pipeline previous calculated RBR-MMM design (PPCRBR-MMM) has the following features:

•
Low and customizable area cost: our area cost is mainly affected by the pipeline-stagenum.When using a memory outside our module, area cost will be lower by storing the temporary result outside.

•
Small critical path: our critical path is smaller by cutting calculating into precalculate quotient logic and multiplication-shift logic.

•
Low memory bandwidth requirement: compared with RBR-MMM, which needs whole operators simultaneously, our design only needs two words every cycle and only writes one word out.The word size is determined by the multiplier used in our design.

•
Low and customizable latency: we adapt the algorithm to the pipeline to prevent the 2-cycle delay between each stage.Latency will be determined by the stage num and input width.

Algorithm Fundamentals
A redundant binary representation is a numeral system that uses more bits than needed to represent a single binary digit.In that case, an RBR allows addition without using a typical carry, which will prevent the long carry-chain in MMM system [16].What is more, different from another redundant system-RNS [5], RBR is easy to covert from normal representation, making MMM's operand's translation simple.A typical 2 2k RBR number X with its component x i can be expressed as (1).
When changing a simple operand into the above-mentioned system, we only need to divide data into 2k bits and let the Most Significant Bit (MSB) be zero.The RBR-MMM takes the above-mentioned expression into the MMM system and prevents the transformation between redundant numbers and nonredundant numbers during the iteration.As many public-key systems call MMM several times, only one additional cycle for transformation is needed to get the final result.The RBR-MMM algorithm is described in Algorithm 1.By representing the operands in the calculation process with redundant bases, RBR-MMM makes the carry in the algorithm iteration process directly stored in the intermediate result without additional conversion.At the same time, because o itself is also a redundant binary representation, no extra carry is generated in the process of calculating o for each iteration.The carry propagation problem in modular multiplication is solved by converting the carry into a redundant number.The proof of Algorithm 1 has been provided by [21]; we only discuss its critical path and area consumption in this paper.
x = x 0 mod 2 k ; //computing quotient logic 6: for i = 0 to n − 1 do 8: end for //parallel computing shift right logic to cal new_o 10: for i = 0 to n − 1 do 11: end for //convert y to binary representation In Algorithm 1, M is precalculated for the reason that MMM will be called several times using the same M, so there is no need for us to calculate M in our algorithm; n should be an integer that is determined by M. Usually, we take n as n = [ m+2 2k ] + 1, where m bits is the bit width of M. Thus, we can get that the cycles for hardware to do a single RBR-MMM with the precalculated M can be expressed as (2), which is determined by the k value.
Meanwhile, we can get RBR-MMM's critical path by reviewing the main operands in Algorithm 1, take computing quotient logic as an example.From Figure 1, we can see that the max delay is created by two k bits multipliers and a half adder; thus, its path is rough , where T FA is the delay of a full adder, we can quickly get a full critical path in (3).Finally, let us look at the RBR-MMM's area consumption: for computing quotient logic, there will be about two k bits multipliers and one k bits adder used as computing quotient is needed one time for j-loop.However, for parallel computing s and shift right part, n k bits × 2k bits multipliers, k bits × 2k + 1 bits multipliers, 3k + 1 bits full adder, and 2k bits adder are demanded.
It can be seen from the above analysis that the critical path length of RBR-MMM is only related to the k value.This means that in the case of low input bit width, the parallelized RBR-MMM algorithm can achieve better frequency and lower area through lower k.In the case of high input bit width, the algorithm can also meet higher frequency requirements through a lower k value.When input bits are high, which generally occurs for RSA in high-speed encrypt system, area cost of RBR-MMM will not be decreased due to its parallel implementation.If high speed is needed, there may not be several 2k bits multipliers for this unit to use.Another problem is that there will be more additional cycles for RBR-MMM to read its input and write its output back to memory, so an area scalable structure to prevent high cost and keep a good ATP is recommended, which leads us to mix the typical pipeline design with RBR-MMM.Our PPCRBR-MMM system will be discussed in the next section.

Pipeline Precalculate Redundant Binary Representation Montgomery Modular Multiplication
According to the analysis mentioned in the previous section, a typical pipeline MMM [20] is needed to get a better performance in ATP.However, when we go through the parallel computing in RBR-MMM, we can find that its hard for us to implement a fully pipeline RBR-MMM as a reason of the high dependency between the j = n and j = n + 1 loop, which is explained as: for a new o i , both o i+1 and o i is needed, which will cost two cycles for a typical pipeline design.So, we rearrange the algorithm to adapt to a pipeline design with one cycle delay between two stages.Then for computing quotient logic, we split it with parallel computing, as q will not be changed for the same j loop, and consuming one more cycle will lead to a shorter critical path.This splitting has another advantage: when a buffer is needed to temporary storage the last stage's result, we can call pre-q logic to do computing quotient logic without an additional cycle; this will be discussed later.Our PPCRBR-MMM can be described as Algorithm 2.
x = x 0 mod 2 k ; //computing quotient logic 6: pre_q j = (((ys j • x) mod 2 k ) • M )mod 2 k ; //inner-loop: for each pipeline, n cycles are needed 7: for i = 0 to n − 1 do 8: end for 20: return O Typical pipeline design often goes with a problem in carrying.In RBR-MMM, it occurs between two adjacent j-loop.Let the j-th pipeline-stage out and temp out be o j i and o j i ,we can get (4).
In this equation, every o j i needs o j−1 i 's result, which needs both o j−1 i and o j−1 i+1 to solve out, so there will be a two cycles delay between the pipeline stage, thus we consider rearranging our algorithm by changing the o j−1 i+1 to the next cycle.We first ignore the o j−1 i+1 k to calculate o j−1 i , let our output be new(output), we can get ( 5) Ingeniously, our q calculating will be the same due to this lack of addition will only affect the high k + 1 bits of o j−1 i , then we only need to consider the o j i 's lack of o j−1 i+1 .From ( 4) and ( 5), the temp result can be calculated as ( 6) and (7).
In order to avoid dependencies on the upper-level unit, we transfer the addition of new(o j i+1 )[k − 1 : 0] to the calculation of the next-level unit.Let . Finally, we can get that the output of the next stage should be computed as ( 8) From (8), only one cycle's previous stage's result is needed to get a new output with an additional add operation of the output of the first two stages; thus, we can build up a single cycle delay pipeline RBR-MMM design.Meanwhile, to prevent a long critical path when k is large, we precalculate the quotient or part of the quotient in the first cycle of each stage, which only costs one more cycle for a pipeline design.Then, we get our Algorithm 2: PPCRBR-MMM.
According to Algorithm 2, an RBR-MMM-based pipeline architecture is proposed in this brief, which will get a better ATP than RBR-MMM.Our algorithm's hardware implementation consists of several processing elements (PEs), pre-calculate quotient units, data registers, and fewer control logics.To clarify our structure, a dependency graph is delivered as Figure 2, and each Q job calculates pre_q value in Algorithm 2, the inner loop is done by I job.By rearranging the RBR-MMM algorithm's data chain, our PE can work out its result just with the output of the last cycle, total three inputs from left PE and left of left PE and PE self are needed.For each PE's first cycle, the quotient will be calculated, costing only one more cycle delay than before.
Figure 3 shows a block diagram of our pipeline RBR-MMM.The kernel part consists of p k-param PEs.Each PE contains (8)'s logic and some registers to hold input value and temp result.For common and low-area usage, p will often be lower than n, which means some PE-unit will do more than one loop calculation.Thus, the PE p will compute out before PE 1 has finished n-cycle computation, and the output must be queued in a buffer before PE1 becomes available again.The buffer depth is determined by n − p and its width is defined (the m i 's width) + 2k + 1 (the x i 's width), this buffer will be an inevitable but acceptable cost compared with basic RBR-MMM design, and this register cost can be transferred to memory outside if needed.Furthermore, quotient value can also be pre-calculated in our buffer.In that case, o • M mod 2 k part is not needed in (8) (as that optimization will have no effect on our frequency and to keep our equation easy, we still do this part in our design), and no more cycle is needed.Other parts of our block are some control logic, select logic of buffer, final output, and ys storage.
Temp data buffer Output or store in buffer

Analysis of PPCRBR-MMM
In this section, a critical path, area, and cycle analysis of PPCRBR-MMM is proposed for comparison with basic RBR-MMM implementation.

Timing Analysis
Typically, we take p smaller than n to prevent a high-cost design, and then from Figure 4, each PE unit takes n + 1 cycles to compute its first loop result.Total 2n are needed to solve out final O from Algorithm 2, thus we can get that [2n/p] inner-loops are needed for PE0-PE(2n%p), and for the final loop, 2n%p + n + 2 − 1 cycles are required.Finally, we get Equation (9).
In comparison with (2),we can get a cycle_ratio in (10).Previously, we have mentioned that the whole operand cannot be read or written only in one cycle, which will also inevitably impact basic RBR-MMM.Thus the loop_ratio will be smaller than (10) in hardware design.

Critical Path Analysis
To better understand our path delay and area cost, a detailed circuit schematic is illustrated in Figure 5, the partial computing output O i consists of k × 2k, k × (2k + 1), k × k multipliers which are independent, thus they can fuse to addition using CSA (i.e., 4:2 compressor), to simplify our delay calculate, we let this part of critical path be a k × 2k + 1 bits multiplier's half adders' path with 7:2 compressors and 3k + 1 Carry Propagate Adder (CPA), o • M and calculate o part can be broken into 4:2 compressor.The MSB of u j i−1 will be calculated simultaneously and will have no impact on our critical path.At last, we take a 2k bits carry propagation adder to solve the v j i out.Thus we can get our critical path will be ( 11) Obviously, pre-calculate quotient logic lets our critical path decreases; this ratio can be expressed as ( 12) 2k bits k bits k bits 2k bits

Area Analysis
For area analysis, we mainly focus on the area cost for single PE compared with the parallel computation module in the RBR-MMM system.From Figure 5, we can summarize a rough area cost for single PE: one 2k bits carry propagation adder for v j i , one 3k + 2 carry propagation adder, 5k 2 + k 'AND' Gates and some compressors for u j i and one calculate carry bit logic for the MSB.The last one can be implemented by k 'AND' Gates and a k bits adder.We can see that our additional logic will not consume much more area for single PE.
Obviously, considering area cost, our PE's number will not be large; in that case, high area cost will be eliminated for a large input case.For intuitive explanation, if we take p = n/2, we can get a roughly ATP ratio shown in (13); Clearly, our PPCRBR-MMM design will lead to a better ATP as area_ratio will be nearly 1/2 in that case.

Experiments of PPCRBR-MMM and RBR-MMM Algorithms
In order to better demonstrate the advantages of PPCRBR-MMM over the RBR-MMM algorithm, we use Verilog to implement the two algorithms and use the TSMC process library to synthesize to obtain area and delay information under different parameters.According to the analysis of critical path and area in the previous article, when the input bit width is high, the proportion of control logic in the total area due to pipelining will be reduced, which can better reflect the superiority of PPCRBR-MMM algorithm in ATP and customizability, at the same time, to make the comparison of experimental results general, we selected two designs with different parameters under the input of 1024 bits, 2048 bits, and 8192 bits for implementation.For better performance, we apply the parameter condition of k = 16 to the two algorithms with 1024 bits width and the parameter condition of k = 32 to the case of 2048 bits input.For the 8192 bits input, although a higher k value can bring higher performance and algorithm improvement, choosing a higher k will bring too high an area requirement, so we still choose the parameter condition of k = 32.To reflect the advantages of customization, we analyzed the PPCRBR-MMM algorithm when the number of PEs under the input of 8192 bits is 16, 24, 32, 48, 64.

Results and Comparisons
Table 1 shows the area and time consuming comparison of RBR-MMM and PPCRBR-MMM under different input parameters.When the input is 1024 bits or 2048 bits, it can be seen that to obtain better performance, the RBR-MMM algorithm must have higher area consumption.And the pipeline algorithm can achieve better delay and area through lower pipeline stages, thereby increasing ATP by about 20%.When the input is 8192 bits, our algorithm still has a 10% advantage in ATP.Compared with the lower input bit width, the area consumption of the buffer for storing the intermediate results becomes larger, so the magnitude of improvement becomes smaller.The loss of this area can be reduced by the memory module external to the algorithm module, as mentioned above.Figure 6 exhibits the comparison of the area and time consumption of RBR-MMM and PPCRBR-MMM with different parameters under 8192 bits input.When the PE number is low, our algorithm will provide a lower area and higher latency.When the PE number becomes high, since the area of the PE is still the main part of the algorithm, the total area loss increases almost linearly.At the same time, because the critical path caused by high fan-out becomes longer, the frequency of the algorithm will decrease significantly when the number of PEs is large, so the delay cannot be reduced linearly.The final ATP results are shown in Figure 6b.When the PE number is increased to 48, the ATP improvement of the algorithm reaches the best.
Similarly, to further reflect the advantages of our algorithm on ATP, we compared the design under different parameters with some modules.The results are delivered in Table 2. Ref. [22] introduces the half-carry-save form to reduce the bit width of the operand in the CPA, thereby reducing the delay of the critical path.However, it still has the problem of a long carry chain compared with the redundant number system, so it cannot achieve a good performance in a lower area.Ref. [23] uses the full-carry-save method to parallelize the results of modular multiplication, but there are many splitting processes for the intermediate results, resulting in a less outstanding performance in multiple loops.
Refs. [10,24] achieve a higher frequency by applying a customized CSA design, but their design require more cycles to complete more iterations, and finally cannot achieve higher performance and ATP value.Ref. [25] reorganizes the operands on the basis of [19] to achieve low memory bandwidth and high frequency while keeping the number of iterations unchanged, but its delay chain under high input bit width contains two-stage multiplication and addition modules, so the balance between frequency and total number of cycles cannot be achieved.Although [11] also uses the full-carry-save method, it uses CPA to complete the data conversion in the iterative process, thus reducing the overall running frequency.Ref. [5] uses the RNS-based modular multiplication system to achieve better frequency and ATP, but it avoids the pipeline design like [21].In this paper, RBR is introduced to lessen the critical path in calculating units.While by modifying the algorithm, a full pipeline is achieved during iteration.Then we decompose the critical path of calculating q logic using the convenience of the pipeline, thus increasing the overall frequency.Finally, lower latency and better ATP results were achieved.
totally 2n iteration, each j means one pipeline stage 3: for j = 0 to 2n − 1

Figure 5 .
Figure 5. Brief schematic of PE unit in PPCRBR-MMM.(a) Computing temp result logic.(b) Computing MSB bits.Computing output of PE.

Figure 6 .
Figure 6.ATP result of PPCRBR-MMM for 8192 bits design.(a) Area and time cost for two algorithms.(b) ATP comparison of two algorithms with different PE num.