A Low-Cost High-Performance Montgomery Modular Multiplier Based on Pipeline Interleaving for IoT Devices

: Modular multiplication is a crucial operation in public-key cryptography systems such as RSA and ECC. In this study, we analyze and improve the iteration steps of the classic Montgomery modular multiplication (MMM) algorithm and propose an interleaved pipeline (IP) structure, which meets the high-performance and low-cost requirements for Internet of Things devices. Compared to the classic pipeline structure, the IP does not require a multiplexing processing element (PE), which helps shorten the data path of intermediate results. We further introduce a disruption in the critical path to complete an iterative step of the MMM algorithm in two clock cycles. Our proposed hardware architecture is implemented on Xilinx Virtex-7 Series FPGA, using DSP48E1, to realize the multiplier. The implemented results show that the modular multiplication of 1024 bits by 2048 bits requires 1.03 µ s and 2.13 µ s, respectively. Moreover, our area–time–product analysis reveals a favorable outcome compared to the state-of-the-art designs across a 1024-bit and 2048-bit modulus.


Introduction 1.Research Background
With the rapid advancement of communication technology, the Internet of Things (IoT) represents a technological revolution that makes future computing and communications different [1].IoT devices, ranging from wearable devices and smartphones to wireless sensors, offer a multitude of applications across various fields, including big data, business analytics, and information sharing [2].However, the diverse nature of IoT devices and the vast amount of handled sensitive data pose challenges in terms of consumer privacy and secure data transfer [3].To address these concerns, the adoption of cryptography solutions is imperative to ensure user authentication and data security.A public-key cryptography system (PCS) plays a fundamental role in information security [4].There are various PCSbased communications protocols and sensitive applications (e.g., the transport layer security (TLS) protocol [5]), which are widely used in Internet communications.Based on TLS and HTTPS protocols, a cloud server is able to authenticate IoT devices.Additionally, in the IoT field, blockchain is a popular public-key cryptography-based technology that prevents IoT devices from attacks and synchronizes them [6].However, due to limited resources, IoT devices seek cheap and efficient implementations of PCS.Software implementations can achieve basic PCS functions but suffer from limited memory, battery power, and computing power [7,8].Hardware implementations can solve computational burdens and limited memory problems since they perform better and do not occupy the computational resources of a central processor.Some existing works [9][10][11] are dedicated to the efficient hardware implementations of PCS with low areas.
Modern PCSs are represented by Rivest-Shamir-Adleman (RSA), proposed in 1978 [12], and elliptic curve cryptography (ECC), proposed by Miller [13] and Koblitz [14] in 1986.that these multiplication systems require a significant amount of hardware resources, especially when dealing with large input sizes.Regarding [26,27], their systems have efficient multiplication without carrying propagation, improving the processing speed of MMM itself.Even so, the area cost of conversion between weighted binary numbers and residue numbers is still high and results in low area efficiency overall.This category of work with complex multiplication algorithms or systems is not suitable for IoT devices where the resources are highly restricted.
Even though various works have focused on optimizing the multiplication operations of MMM, there have been fewer inventive modifications made to the iteration steps of the MMM algorithm itself.Among the mentioned works, almost all of them employed the classic pipeline form of MMM introduced in [28] in 1999.Although [19] proposed a different iterative MMM algorithm based on encoding and compression methods, the basic bit-wise scanning steps have not changed.Several other works attempted to modify the classic MMM pipeline to reduce the total clock cycles or increase the maximum clock frequency.Ref. [29] modified the input data paths of each processing element, aiming to enhance the pipeline structure.Ref. [30] relaxed the data dependency by reducing the operands, leading to a new pipeline form.Ref. [9] introduced a separate iterative MMM that needs pre-computation, making the calculation process more efficient.However, refs.[9,29,30] did not modify the core iteration steps and needed extra resources with the carry-save adders.Ref. [31] introduced inventive changes to both the data dependency and the iteration steps of the MMM algorithm.Although this modification aimed to improve the overall performance, it resulted in a considerable increase in the critical path length due to the extended data paths.
Previous research studies have indeed emphasized adder performance optimizations and the use of higher-performing multiplication algorithms.Adder optimizations have involved compression and encoding methods as well as different addition systems, while multiplication algorithm optimizations have primarily focused on scheduling classic iteration steps of MMM.However, in our work, we aim to approach optimization differently.We do not consider optimization methods, such as encoding, data compression, or data preprocessing, as they do not yield significant benefits compared to algorithm optimization.Instead, our motivation lies in optimizing the performance and reducing area costs by modifying the classic iteration steps of the MMM algorithm.We also explore a new pipeline form and comprehensive scheduling that is specifically tailored to hardware implementations.

Paper Contributions
In this paper, we present a high-performance, low-cost Montgomery modular multiplier based on the proposed interleaved pipeline of MMM.The main contributions of this paper are as follows: (1) We modify the iteration steps of the classic MMM algorithm and propose the interleaved pipeline multiple-word radix-2 k Montgomery multiplication (IP-MWR2 k MM) algorithm.This modification allows us to reduce the data path length of intermediate results by eliminating the necessity of reusing processing elements (PEs).The execution steps of the interleaved pipeline (IP) form are also presented in our work.
(2) To improve the operating frequency, we schedule an iterative step in the IP-MWR2 k MM algorithm to execute in two clock cycles.By doing so, the calculation of the coefficient Q[j] in IP-MWR2 k MM algorithm can be completed for one extra cycle of time instead of being completed within one cycle.This can reduce the (CPD) and the overall computation time of MMM.
(3) We provide a comprehensive hardware structure for our proposed algorithm, including the design of each PE and the overall architecture.The implementation utilizes DSP48E1 blocks on the Xilinx Virtex-7 FPGA series.Additionally, we performed a detailed analysis of the performance and area costs, demonstrating that our approach achieves superior performance in terms of area at a lower level.
The remainder of this paper is organized as follows.Section 2 presents the preliminary for radix-2 Montgomery multiplication, high-radix Montgomery multiplication, and the pipeline form of high-radix MMM.Section 3 introduces the proposed IP-MWR2 k MM algorithm and its corresponding pipeline form.It also presents the hardware architecture for each PE and the overall system.In Section 4, we analyze the performance and area cost and provide a comparison of state-of-the-art implementations.Finally, we conclude this paper in Section 5.

Preliminaries
This section provides an overview of the classic radix-2 Montgomery multiplication (R2MM) algorithm and classic multiple-word radix-2 k Montgomery multiplication (MWR2 k MM) algorithm, covering their backgrounds and basic notations.This section also analyzes the data dependency and limitations of the classic pipeline of the MWR2 k MM algorithm when implemented on hardware.

Radix-2 Montgomery Modular Multiplication Algorithm
The MMM algorithm has two forms based on the different radices of the multiplicand: the R2MM form and the MWR2 k MM form.The R2MM form only requires performing an add operation as the multiplicand just needs to determine if the multipliers need to be added to the result.This form is widely used in resource-limited systems due to its simplicity.The MWR2 k MM form performs actual multiplication because both the multiplicand and multiplier are scanned over multiple bits.Under the requirement of multipliers, the MWR2 k MM form can reach higher performance than the R2MM form but results in more resource utilization.Using DSP blocks on FPGAs is a convenient way to meet the need for multipliers.Algorithm 1 presents the detailed pseudo-code for classic R2MM.
{Last reduction} 8: end if 9: return S; In Algorithm 1, S is the n-bit result of R2MM.Q is a variable to determine if M needs to be added to step 4. Steps 3 and 4 are always performed within a loop and can be implemented on hardware in a pipeline style.Nevertheless, S needs to be compared to M and subtracted if it is greater than M in step 7. Thus, the pipeline must stall if we want to use result S to perform R2MM continuously.

Multiple-Word Radix-2 k Montgomery Modular Multiplication Algorithm
In [32], an optimized radix-2 k MMM algorithm is provided without the final quotient determination by simply adding a zero-value word on the most significant bit (MSB) of the multiplicand X.Therefore, the pipeline does not need a stall and can perform MMM continuously just after the results are calculated.Here, we provide a classic MWR2 k MM algorithm without the final subtraction, as shown in Algorithm 2, based on the MWR2 k MM algorithm proposed in [16].
end for 8: In Algorithm 2, M is the negative modular multiplicative inverse of the modulus M and treated as a pre-calculated parameter because it is only determined by M, which is a constant during the calculation of MWR2 k MM.The initial precision of M is n bits.However, during the computation, both Y and M have g = ( n w + 1) words, where an extra zero-value word is added, since the result S needs an extra word to obtain precision extended to the correct value [16].According to [32], X is also extended with an extra zero-value word to avoid final subtraction.C needs to be considered when it comes to MWR2 k MM.It represents the carry bits that are propagated from the computation of one word to the next word.C[i] j represents the jth word of C in ith loop.The concatenation of vectors C and S is represented as {C, S}. S is calculated after scanning X once and shifting r bits to the right.

Pipeline of the Classic MWR2 k MM Algorithm
A classic pipeline form suitable for the multiple-word radix-2 Montgomery multiplication (MWR2MM) algorithm was mentioned in [28].Based on the pipeline of MWR2MM, the pipeline form under MWR2 k MM can be obtained with a slight modification, as shown in Figure 1.
For this classic high-radix pipeline (CHRP), a column represents one PE, which is one pipeline stage (PS).A row represents one clock cycle (CC).Each PE has two calculation states, A and B. A state represents the first clock period when the PE starts to calculate.In this cycle, a PE does not require carrying bits C, and at the end of the cycle, the result Here, we note that the result According to the natural characteristics of the classic MWR2 k MM algorithm, the radix size r cannot exceed the word size w.Therefore, the number of PSs in CHRP cannot exceed g 2 .Otherwise, more PEs may reduce performances on more CCs.This is concluded from [16], where the performance is analyzed thoroughly with different numbers of PSs and word sizes.
Since the number of PSs is limited, the upper limit of the processing speed of CHRP design depends on the word size w and the performance of multipliers.Moreover, under CHRP, reusing PEs is necessary to achieve high parallelism and improve computation efficiency.However, the requirement for reusing PEs introduces the challenge of passing the intermediate result S from the last PE to the first PE.This can lead to a high net delay when implemented on hardware, potentially impacting the overall performance.One possible approach to mitigate the net delay is to use buffers on the result S within the pipeline.By inserting buffers, the net delay can be reduced, but this comes at the cost of additional CCs and increased complexity in managing the flow of data.Finding the right balance between parallelism (in CCs) and frequency (in net delay) is indeed a trade-off that needs to be considered.It requires careful optimization techniques to achieve the desired performance while taking into account the available resources and other limitations.x 0 y 0 m 0

Proposed Interleaved Pipeline Design
In this section, we introduce the proposed IP-MWR2 k MM algorithm, which aims to improve the performance and efficiency of the classic MWR2 k MM algorithm by modifying the iteration steps.Based on the IP-MWR2 k MM algorithm, we present a novel pipeline form, IP.This pipeline form takes advantage of the modified iteration steps and data dependency to achieve better performance, since the long data path of reusing PEs is avoided.Notably, we introduce an interruption in the critical path, which involves adding a pipeline stage within a PE to compute the intermediate result S.This approach allows for a higher operating frequency and faster computation.Furthermore, we present the hardware architecture of PEs and the overall design of the MMM multiplier based on the IP-MWR2 k MM algorithm.

Proposed IP-MWR2 k MM Algorithm
To modify the iteration steps of the classic MWR2 k MM algorithm, we reverse the data path of S[i + 1] and C[i] at PE[i] and propose a novel IP-MWR2 k MM algorithm, as shown in Algorithm 3.
The initial precision of M is n bits.Multiplicand X is extended with extra zero-value words to omit the final reduction of S. Different from classic MWR2 k MM, the number of zero-value words added to the MSB of Y and M depends on the parity of n w to ensure g is even.Thus, we can always calculate S twice with two adjacent words when scanning X, which contributes to simplifying the architecture of the pipeline.The lower (w − r) bits of a w-bit word are represented as low{} and the higher r bits of a w-bit word are represented as high{}.S is shifted on steps 20 to 26 and concatenated to the correct position with the help of checking the value of t.To understand Algorithm 3, we provide an execution example, as shown in Figure 2, where n = 16, r = 2, and w = 4.

Algorithm 3
Proposed interleaved pipeline MWR2 k MM.
{t is the number of PSs} {Shift S} 20: else if t = (g − 2) then 23: t ← t − 2; 28: end for 29: end for 30: return S[k]; In Figure 2, we provide a high-level overview of the pipeline structure based on the IP-MWR2 k MM algorithm, where steps 17 and 18 are key computation steps represented by the multiplication of x j and y t .The data dependency and results of every step are discussed in the following subsection.X and Y are divided and expanded to k = 9 words and g = 6 words, respectively.The upper bound t 2 of the inner loop is 0 (when i is 0), 1 (when i is 1 or 9), or 2 (when i is from 2 to 8), and is determined by steps 8 to 14 in Algorithm 3. Within an inner loop, S is computed twice with two neighboring words of Y.The core difference of the IP-MWR2 k MM algorithm is that the scanning steps of X are reversed.X is only scanned between different inner loops while the classic MWR2 k MM algorithm performs it in the outer loop.

Parallel Computation of the IP-MWR2 k MM Algorithm
In this subsection, the proposed IP form is presented, along with an analysis of the data dependency and computation efficiency.Figure 3 provides an illustration of the IP structure.
In Figure 3, each column represents a PE, which can be seen as a PS.The computation of the inner loop, where i = 0 in Algorithm 3, is represented in CC0 and CC1, and only PE[0] is active.When i = 1, the computation of the inner loop is represented in CC2 and CC3, where both PE[0] and PE [1] are active.Figure 3 also shows the detailed data dependency and the transmission direction of the carry bits (C) and results (S) through the PEs.There are two states of PE: A and B. When PE[i] is in the A state, it receives the carry bits C passed from the previous PE[i − 1], as well as the result S from itself.For the B state, PE[i] needs the result S transferred from the next PE[i + 1] and the carry bits C from itself.Thus, the data dependency changes in every CC, resulting in the interleaved pipeline architecture.The coefficient Q that is multiplied by M is computed on the fly, which means that PE[0] is also responsible for calculating Q when in the A state.
Because of the data dependency of S, the architecture of IP is non-scalable and the total number of PEs requires the following: Since a PE scans two neighboring words of Y and needs two CCs to scan an X word, the total computation time T IP (measured in CCs) is as follows: Reviewing the CHRP and we can find its maximum total number of PEs is as follows: Under this condition, the total computation time T CHRP (measured in CCs) is as follows: Comparing the total number of PEs with the computation times of CHRP and IP, we find that the computation efficiency is the same in terms of the CCs and resources.However, the reachable maximum frequencies of two pipeline forms are different because of the data dependency of S. In our proposed IP form, the reuse of PEs is not a must, and the results S are only passed between two neighbor PEs, while PEs must be reused in CHRP, resulting in a long path of S being passed from the last PE to the first one.Another improvement of IP is that when continuously performing MMMs, the final result can be passed directly to the neighbor PEs to perform the next computation.In CHRP, PE[0] must wait to be idle if 2k g is not equal to 2k g .To improve the operating frequency, we break one CC into two to allow for separate computations of x j • y t and Q[j] • m t .This modification results in one state of the PE that occupies two CCs instead of the original interleaved design shown in Figure 3. Hence, benefits taken by this modification involve reducing the lengths of the carry chains when computing the result S and, therefore, improving the overall computational speed of the MMM.An inner loop in Algorithm 3 is performed in four CCs instead of the original two CCs within a PE because of the modification we made to compute x j • y t and Q[j] • m t separately.Situations of inputs i_m a and i_m b are shown in the following equations: The computation result is {o_C, o_S}, which represents the updated values of {C[j] t , S[j + 1] t }.In the modified pipeline design, PE[0] starts computing 5 CCs earlier than PE [1].By adding an extra CC, we can compute x j • y t earlier because Q[j] needs two CCs to output while the result S[j + 1] t only needs one CC.For an example, x 0 • y 0 is calculated at CC0 and (x 0 1 is computed at CC4, and then PE [1] will be activated at CC5.
The other PEs have the same hardware architecture, but are simpler with only a DSP, a full adder, a 2-1 multiplexer, and a register, as shown in Figure 5.These PEs all require four CCs to compute a step within the inner loop in Algorithm 3.

Overall Architecture
Figure 6 depicts the overall architecture, where X, Y, and M are sequentially inputted into shifting registers.The computation of Q is performed by PE[0] and subsequently fed into another shifting register.We can recognize that the data only pass between neighbor PEs and the last PE is not responsible for passing the result S to the first PE.Consequently, the net delay is reduced, and the need for additional buffers to store S is eliminated.Furthermore, the last PE needs an additional register to store S for one CC because C and S are generated simultaneously but used in different CCs.The proposed architecture requires more multiplexers compared to CHRP due to the PEs performing different computations in different CCs.With the help of the concise architecture of PEs, all we need to control is the select port of the multiplexers and the shift of registers, which are only determined by PE states.

Implementation Results and Comparison
In this section, we conduct a comprehensive analysis of the timing, critical path, and area costs of our proposed design.We present the implementation results for the 1024-bit and 2048-bit modulus scenarios.Furthermore, we compare our design with existing works to demonstrate its exceptional performance and superiority.

Timing Analysis
In Figure 3, we separate one CC into two to implement a deep pipeline; thus, the total computation time T IP (measured in CCs) is as follows: Despite the increase in total CCs, the original CPD of the proposed IP is reduced due to the reduction of the long addition chain in (S[j] t + x j • y t + Q[j] • m t + C[j] t−1 ) from three additions to two.Moreover, the input bit width of the DSP48E1 module is 18 × 25 with a sign bit.Since our design exclusively uses unsigned numbers, the effective bit width of the inputs is 17 × 24.
The critical path of our hardware architecture depends on the radix r and word size w.When the bit width of r is lower than 17 and w is lower than 24, the critical path is between the output of x j • y t and the output of Q[j].Optimizing this critical path is challenging since it involves addition followed by multiplication.However, if r is still below 17 but w is beyond 24, more than one DSP block should be employed to ensure the multiplication of x j • y t computed within a single CC.In this case, the critical path arises in the path of multiplication followed by the addition for i_m a and i_m b , with the carry bit being propagated from one DSP to another.Nevertheless, the net delay in the computation of Q[j] remains unaffected as it only requires the lower r bits of S[j]; the number of DSP blocks required is one.
Furthermore, there is a great need to perform MMM operations continuously in PCSs such as RSA and ECC.In this scenario, the average computation time for a single MMM operation is reduced due to the uninterrupted execution of the pipeline.Given the times (t MMM ) that MMM need to perform, the average computation time T IPA (in CCs) of one MMM is shown as (7).
where T IPA is near (4k + 1) when t MMM > (2g − 4), revealing that the more the MMM performs, the fewer the total CCs.We adopt T IPA as the evaluation of the clock cycles required to perform one MMM operation in our implementation, because our IP form has the property of calculating MMM continuously without any stall between every MMM operation.So, T IPA represents the actual performance when our design is used in applications like point multiplication in ECC and modular exponentiation in RSA.

Area Analysis
The area of the proposed architecture can be evaluated by considering the number of PEs, as shown in Equation ( 1).However, it is important to note that the resource requirements within a single PE vary depending on the size of the radix r and the word size w.Specifically, within a PE, the number (A DSP ) representing the implemented DSPs can be calculated as follows: Equation ( 8) shows the trade-offs that can be made in the area of DSPs and the performance.The higher the radix is, the more computations are made within a CC, and the total CC is reduced.However, the number of DSPs is increased at the same time.When the radix r ≤ 17 and the word size w ≤ 24, as well as the multiplication of i_m a by i_m b , can be implemented using a single DSP.In this case, the total number of DSPs required is g 2 + 2, which is equal to the number of N IP + 2. It is also possible to cascade DSPs when w > 24 as long as r is less than or equal to 24.However, an important point to consider is that the computational efficiency is reduced when r exceeds 24, because w ≥ r, and it would require cascading multiple DSPs and adding the results of the DSPs to obtain the correct value.Therefore, we can conclude from Equation ( 8) that when implementing designs with DSPs, the condition radix r ≤ 17 and the word size w ≤ 24 have the smallest DSP area.We implement r = 16 and w = 24 in our design to cut down the area.
The resource usage of a full adder is evaluated based on the number of LUTs it occupies.A one-bit full adder has three inputs and two outputs, corresponding to two LUT3 resources in an FPGA, equivalently, one LUT6 resource.The total number of LUT resources occupied by the full adders, denoted as A aLUT , can be calculated in the following two cases:

Results Comparison and Discussion
In the proposed hardware implementation of MMM, we used a radix r of 16 and a word size w of 24.The design was implemented using Vivado 2022 on the Xilinx Virtex-7 FPGA Series, with part number XC7V585TFFG1157-3.To provide a comprehensive evaluation, we present the implementation results of four different modulus sizes: 256 bits, 512 bits, 1024 bits, and 2048 bits.These results are compared with other existing works in Table 1.
where #LUT is the number of LUTs.The number under the index SLICE is marked with an asterisk if it is measured by #LUTs, otherwise, it is the actual SLICE number provided in the work.
In order to evaluate the trade-offs between the performance and area in a fair manner and compare the different works, we utilize the area-time-product (ATP) metric: The ATP is calculated by multiplying the total processing latency by the total area (in SEC).By comparing the ATP values of different works, we can assess the overall efficiency and effectiveness of each design in terms of both performance and area utilization.In addition, the comprehensive comparisons are shown in Figure 7. Ref. [27] employed the residue number system in their design, achieving lower total latency than ours, outperforming by 56%.However, their work utilized many DSP and SLICE areas because of the extra logic area taken by their system, resulting in low area efficiency.Our implementation improved by 85% in area and 90% in ATP.The area will explode as the bit-width of the modulus increases, so their design is not suitable for a system like RSA, which has a large modulus size.Ref. [9] outperforms ours by 21% in latency, but we have an advantage of 26% in area and 6% in ATP for a 256-bit modulus.The advantage of our ATP grows as the modulus size increases.We have an advantage of 16% in ATP for a 512-bit modulus.By this comparison, we can see that our computation efficiency exists even for large modulus sizes.
Although the platform used in [31] may be considered outdated, it is still valuable for comparison purposes due to its adoption of a new pipeline form.The architecture in [31] shares a similar approach to ours in regard to passing intermediate results, but it suffers from a long data dependency path, resulting in a significant increase in the number of CCs required for computations.Despite the differences in platform frequencies, our work has demonstrated a significant advantage in terms of the number of CCs, achieving a 76% improvement compared to [31].Ref. [30] aimed to relax data dependencies through reduced operands and proposed a new pipeline form.However, their implementation still exhibited a high total area (in SEC) and total latency, indicating that their approach to the new pipeline design was ineffective.Compared to [30], our design significantly improves the ATP by 86%.
The design presented in [25] is based on FFT and requires a significant amount of BRAM resources.While FPGAs often have sufficient BRAM resources to accommodate this overhead, the ATP of [25] is relatively inferior when considering SEC.Although our design may have an advantage in terms of the platform, it is important to note our absolute leading positions in both total latency and SEC.Latency is improved by 74% in the 1024-bit modulus and by 73% in the 2048-bit modulus.In terms of SEC, our design achieves a 40% improvement in the 1024-bit modulus configuration, while maintaining a slightly higher SEC (by 5% in the 2048-bit modulus design) compared to [25].
The designs in [17,18] are all based on digit-serial MMM and utilize optimized adders or encoding methods to complete the designs.Among them, ref. [18] stands out as a competitive design, with a smaller ATP and total SEC compared to ours.However, we maintain an advantage in terms of total latency.It is worth noting that the superior performance in [18] under the 1024-bit modulus cannot be well inherited into the design of the 2048-bit modulus.In the case of the 2048-bit modulus, we achieve a 10% advantage in total latency, a 32% advantage in total SEC, and a 40% advantage in ATP over [18].Compared to [17], we have advantages in total latency, SEC, and ATP.Specifically, for the 1024-bit modulus, we achieve an 8% advantage in total latency, a 35% advantage in SEC, and a 41% advantage in ATP.For the 2048-bit modulus, we have similar total latency, but a 41% advantage in ATP.In [19], the design excels in total latency for both the 1024-bit modulus and 2048-bit modulus.However, our design takes the lead in ATP for both modulus sizes.Furthermore, as the modulus size increases, our design maintains its advantage in terms of SEC and ATP.For [10], our design has a similar total latency but a 25% advantage in area for the 1024-bit modulus.For the 2048-bit modulus, our implementation outperforms 14% in latency, 29% in area, and 39% in ATP.Compared to [22], we have an absolute advantage in both latency and area, represented by an 83% advantage in ATP.
Indeed, comparing and selecting the best design for MMM can be challenging due to the diverse structures and optimizations employed by the listed designs.Each design incorporates different techniques, such as compression, encoding, modified pipelines, as well as different multiplication algorithms.Given the available data, our design stands out with the best ATP and low latency in the 2048-bit modulus scenario.Additionally, it demonstrates our competitive performance in the 1024-bit modulus scenario.Generally, our design shows notable strengths in terms of ATP and latency, making it a compelling choice for Montgomery modular multiplication.

Conclusions
This paper proposes a high-performance and low-cost implementation of Montgomery modular multiplication.The proposed interleaved pipeline form of MWR2 k MM effectively addresses the issue of long data dependency paths, thereby reducing delay.To further enhance the clock frequency, we divided the iteration process in the MMM algorithm into two CCs, effectively shortening the critical path and overall delay.As a result, the proposed MMM algorithm can achieve an average completion time of 4k + 1 CCs.The multiplier in our hardware architecture is implemented using DSP48E1 blocks in the Xilinx Virtex-7 FPGA Series.The processing elements in our design feature a simple control structure, which enhances the ease of operation.The implementation results highlight the superior ATP and overall performance of our proposed algorithm and pipeline form.This makes it an attractive solution for IoT devices requiring high performance at a low cost.Moving forward, our future work will aim to further improve performance and area efficiency by leveraging digit-serial MMM and exploring additional optimizations with our proposed IP.Since there are various IoT applications, the IoT chip architectures vary, both in size and power consumption.Therefore, we may optimize the DSP logic into compatible fast-adder logic to be more portable and flexible for different applications.

3. 3 .
Proposed Hardware Architecture 3.3.1.Processing Elements PE[0] is different from other PEs as it needs to perform the computation of Q on the fly, as shown in Figure4, where the multipliers are implemented with DSPs on the Xilinx Virtex-7 FPGA series.