MDPI - Publisher of Open Access Journals

16 pages, 1263 KiB

Open AccessArticle

Accelerating CRYSTALS-Kyber: High-Speed NTT Design with Optimized Pipelining and Modular Reduction

by Omar S. Sonbul, Muhammad Rashid and Amar Y. Jaffar

Electronics 2025, 14(11), 2122; https://doi.org/10.3390/electronics14112122 - 23 May 2025

Viewed by 790

The Number Theoretic Transform (NTT) is a cornerstone for efficient polynomial multiplication, which is fundamental to lattice-based cryptographic algorithms such as CRYSTALS-Kyber—a leading candidate in post-quantum cryptography (PQC). However, existing NTT accelerators often rely on integer multiplier-based modular reduction techniques, such as Barrett [...] Read more.

The Number Theoretic Transform (NTT) is a cornerstone for efficient polynomial multiplication, which is fundamental to lattice-based cryptographic algorithms such as CRYSTALS-Kyber—a leading candidate in post-quantum cryptography (PQC). However, existing NTT accelerators often rely on integer multiplier-based modular reduction techniques, such as Barrett or Montgomery reduction, which introduce significant computational overhead and hardware resource consumption. These accelerators also lack optimization in unified architectures for forward (FNTT) and inverse (INTT) transformations. Addressing these research gaps, this paper introduces a novel, high-speed NTT accelerator tailored specifically for CRYSTALS-Kyber. The proposed design employs an innovative shift-add modular reduction mechanism, eliminating the need for integer multipliers, thereby reducing critical path delay and enhancing circuit frequency. A unified pipelined butterfly unit, capable of performing FNTT and INTT operations through Cooley–Tukey and Gentleman–Sande configurations, is integrated into the architecture. Additionally, a highly efficient data handling mechanism based on Register banks supports seamless memory access, ensuring continuous and parallel processing. The complete architecture, implemented in Verilog HDL, has been evaluated on FPGA platforms (Virtex-5, Virtex-6, and Virtex-7). Post place-and-route results demonstrate a maximum operating frequency of 261 MHz on Virtex-7, achieving a throughput of 290.69 Kbps—1.45× and 1.24× higher than its performance on Virtex-5 and Virtex-6, respectively. Furthermore, the design boasts an impressive throughput-per-slice metric of 111.63, underscoring its resource efficiency. With a 1.27× reduction in computation time compared to state-of-the-art single butterfly unit-based NTT accelerators, this work establishes a new benchmark in advancing secure and scalable cryptographic hardware solutions. Full article

► Show Figures

Figure 1

19 pages, 4017 KiB

Open AccessArticle

Efficient Large-Width Montgomery Modular Multiplier Design Based on Toom–Cook-5

by Kuanhao Liu, Xiaohua Wang, Yue Hao, Jingqi Zhang and Weijiang Wang

Electronics 2025, 14(7), 1402; https://doi.org/10.3390/electronics14071402 - 31 Mar 2025

Viewed by 405

Abstract

Toom–Cook-n multiplication is an efficient large-width multiplication algorithm based on a divide-and-conquer strategy, widely used in modular multiplication operations for cryptographic algorithms. Theoretically, as the degree n increases, Toom–Cook-n can split the multiplicands into more sub-terms to further enhance the performance [...] Read more.

Toom–Cook-n multiplication is an efficient large-width multiplication algorithm based on a divide-and-conquer strategy, widely used in modular multiplication operations for cryptographic algorithms. Theoretically, as the degree n increases, Toom–Cook-n can split the multiplicands into more sub-terms to further enhance the performance of the multiplier. However, constrained by the computational burden brought by the growing size of the interpolation matrix as the degree increases, current research predominantly focuses on Toom–Cook-4 and Toom–Cook-3. This paper proposes a Montgomery modular multiplication design based on Toom–Cook-5, which alleviates the computational difficulty of the interpolation step by introducing an interpolation matrix pre-simplification strategy. Additionally, the design incorporates and optimizes carry–save adder and Karatsuba multiplication, enabling Toom–Cook-5 multiplication to be applied in practical and efficient hardware implementation. This paper presents the ASIC implementation results of the hardware architecture under a 90nm process, demonstrating superior performance compared to previous works. Full article

► Show Figures

Figure 1

8 pages, 1195 KiB

Open AccessArticle

Cost-Efficient Pipelined Modular Polynomial Multiplier for Post-Quantum Cryptography Saber

by Hua Li

Quantum Rep. 2025, 7(1), 10; https://doi.org/10.3390/quantum7010010 - 20 Feb 2025

Viewed by 932

Abstract

The development of quantum computers presents a great challenge for current cryptographic algorithms. Post-quantum cryptography has been proposed to secure against quantum computers in the near future. Modular polynomial multiplication is a frequent arithmetic operation in post-quantum cryptography. In this paper, a low-cost [...] Read more.

The development of quantum computers presents a great challenge for current cryptographic algorithms. Post-quantum cryptography has been proposed to secure against quantum computers in the near future. Modular polynomial multiplication is a frequent arithmetic operation in post-quantum cryptography. In this paper, a low-cost and efficient pipelined architecture for modular polynomial multiplication in Saber has been proposed and synthesized with the Virtex UltraScale + xcu200-fsgd2104-2-e board. It can achieve a frequency of 250 MHz and only uses 11,499 LUTs, 7034 FFs and 32 IOs. Full article

► Show Figures

Figure 1

17 pages, 595 KiB

Open AccessFeature PaperArticle

Hardware Optimized Modular Reduction

by Alexander Magyari and Yuhua Chen

Electronics 2025, 14(3), 550; https://doi.org/10.3390/electronics14030550 - 29 Jan 2025

Viewed by 1063

Abstract

We introduce a modular reduction method that is optimized for hardware and outperforms conventional approaches. By leveraging calculated reduction cycles and combinatorial logic, we achieve a remarkable 30% reduction in power usage, 27% reduction in Configurable Logic Blocks (CLBs), and 42% fewer look-up [...] Read more.

We introduce a modular reduction method that is optimized for hardware and outperforms conventional approaches. By leveraging calculated reduction cycles and combinatorial logic, we achieve a remarkable 30% reduction in power usage, 27% reduction in Configurable Logic Blocks (CLBs), and 42% fewer look-up tables (LUTs) than the conventional implementation. Our Hardware-Optimized Modular Reduction (HOM-R) system can condense a 256-bit input to a four-bit base within a single 250 MHz clock cycle. Further, our method stands out from prevalent techniques, such as Barrett and Montgomery reduction, by eliminating the need for multipliers or dividers, and relying solely on addition and customizable LUTs. This innovative method frees up FPGA resources typically consumed by power-intensive DSPs, offering a compelling low-power, low-latency alternative for diverse design needs. Full article

(This article belongs to the Special Issue Emerging Applications of FPGAs and Reconfigurable Computing System)

► Show Figures

Figure 1

10 pages, 638 KiB

Open AccessArticle

Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs

by Kai Chen, Xinyu Wang, Yuxiang Fu and Li Li

Electronics 2025, 14(3), 464; https://doi.org/10.3390/electronics14030464 - 23 Jan 2025

Cited by 1 | Viewed by 746

Abstract

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and [...] Read more.

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and area efficiency compared to the unencrypted version. In this work, we first improve the additive powers-of-two (APoT) quantization method for HCNN to achieve a better tradeoff between the complexity of modular multiplication and the network accuracy. An efficient multiplicationless modular multiplier–accumulator (M-MAC) unit is accordingly designed. Furthermore, a batch-processing HCNN accelerator with M-MACs is implemented, in which we propose an advanced data partition scheme to avoid multiple moves of the large-size ciphertext polynomials. Compared to the latest FPGA design, our accelerator can achieve

11 \times

resource reduction of an M-MAC and

2.36 \times

speedup in inference latency for a widely used CNN-11 network to process 8K images. The speedup of our design is also significant compared to the latest CPU and GPU implementations of the batch-processing HCNN models. Full article

► Show Figures

Figure 1

18 pages, 779 KiB

Open AccessArticle

A Pipelined Hardware Design of FNTT and INTT of CRYSTALS-Kyber PQC Algorithm

by Muhammad Rashid, Omar S. Sonbul, Sajjad Shaukat Jamal, Amar Y. Jaffar and Azamat Kakhorov

Information 2025, 16(1), 17; https://doi.org/10.3390/info16010017 - 31 Dec 2024

Cited by 1 | Viewed by 1274

Abstract

Lattice-based post-quantum cryptography (PQC) algorithms demand number theoretic transform (NTT)-based polynomial multiplications. NTT-based polynomials’ multiplication relies on the computation of forward number theoretic transform (FNTT) and inverse number theoretic transform (INTT), respectively. Therefore, this work presents a unified NTT hardware accelerator architecture to [...] Read more.

Lattice-based post-quantum cryptography (PQC) algorithms demand number theoretic transform (NTT)-based polynomial multiplications. NTT-based polynomials’ multiplication relies on the computation of forward number theoretic transform (FNTT) and inverse number theoretic transform (INTT), respectively. Therefore, this work presents a unified NTT hardware accelerator architecture to facilitate the polynomial multiplications of the CRYSTALS-Kyber PQC algorithm. Moreover, a unified butterfly unit design of Cooley–Tukey and Gentleman–Sande configurations is proposed to implement the FNTT and INTT operations using one adder, one multiplier, and one subtractor, sharing four routing multiplexers and one Barrett-based modular reduction unit. The critical path of the proposed butterfly unit is minimized using pipelining. An efficient controller is implemented for control functionalities. The simulation results after the post-place and -route step are provided on Xilinx Virtex-6 and Virtex-7 field-programmable gate array devices. Also, the proposed design is physically implemented for validation on Virtex-7 FPGA. The number of slices utilized on Virtex-6 and Virtex-7 devices is 398 and 312, the required number of clock cycles for one set of FNTT and INTT computations is 1410 and 1540, and the maximum operating frequency is 256 and 290 MHz, respectively. The average figure of merit (FoM), where FoM is the ratio of throughput to slices, illustrates 62% better performance than the most relevant NTT design from the literature. Full article

(This article belongs to the Special Issue Feature Papers in Information in 2024–2025)

► Show Figures

Figure 1

18 pages, 818 KiB

Open AccessArticle

An Optimized Flexible Accelerator for Elliptic Curve Point Multiplication over NIST Binary Fields

by Amer Aljaedi, Muhammad Rashid, Sajjad Shaukat Jamal, Adel R. Alharbi and Mohammed Alotaibi

Appl. Sci. 2023, 13(19), 10882; https://doi.org/10.3390/app131910882 - 30 Sep 2023

Cited by 2 | Viewed by 1001

Abstract

This article proposes a flexible hardware accelerator optimized from a throughput and area point of view for the computationally intensive part of elliptic curve cryptography. The target binary fields, defined by the National Institute of Standards and Technology, are [...] Read more.

This article proposes a flexible hardware accelerator optimized from a throughput and area point of view for the computationally intensive part of elliptic curve cryptography. The target binary fields, defined by the National Institute of Standards and Technology, are

G F (2^{163})

,

G F (2^{233})

,

G F (2^{283})

,

G F (2^{409})

, and

G F (2^{571})

. For the optimization of throughput, the proposed accelerator employs a digit-parallel multiplier. The size of the digit is 41 bits. The proposed accelerator has reused the multiplication and squaring circuit for area optimization to compute modular inversions. Flexibility is included using three additional buffers on top of the proposed accelerator architecture to load different input parameters. Finally, a dedicated controller is used to optimize control signal handling. The architecture is modeled using Verilog and implemented up to the post-place-and-route level on a Xilinx Virtex-7 field-programmable gate array. The area utilization of our accelerator in slices is 1479, 1998, 2573, 3271, and 4469 for

m = 163

to 571. The time needed to perform one-point multiplication is 7.15, 10.60, 13.26, 20.96, and 30.42 μs. Similarly, the throughput over area figures for the same key lengths are 94.56, 47.21, 29.30, 14.58, and 7.35. Consequently, achieved results and a comprehensive performance comparison show the suitability of the proposed design for constrained environments that demand throughput/area-efficient implementations. Full article

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

► Show Figures

Figure 1

17 pages, 1791 KiB

Open AccessArticle

A High-Efficiency Modular Multiplication Digital Signal Processing for Lattice-Based Post-Quantum Cryptography

by Trong-Hung Nguyen, Cong-Kha Pham and Trong-Thuc Hoang

Cryptography 2023, 7(4), 46; https://doi.org/10.3390/cryptography7040046 - 25 Sep 2023

Cited by 13 | Viewed by 4727

Abstract

The Number Theoretic Transform (NTT) has been widely used to speed up polynomial multiplication in lattice-based post-quantum algorithms. All NTT operands use modular arithmetic, especially modular multiplication, which significantly influences NTT hardware implementation efficiency. Until now, most hardware implementations used Digital Signal Processing [...] Read more.

The Number Theoretic Transform (NTT) has been widely used to speed up polynomial multiplication in lattice-based post-quantum algorithms. All NTT operands use modular arithmetic, especially modular multiplication, which significantly influences NTT hardware implementation efficiency. Until now, most hardware implementations used Digital Signal Processing (DSP) to multiply two integers and optimally perform modulo computations from the multiplication product. This paper presents a customized Lattice-DSP (L-DSP) for modular multiplication based on the Karatsuba algorithm, Vedic multiplier, and modular reduction methods. The proposed L-DSP performs both integer multiplication and modular reduction simultaneously for lattice-based cryptography. As a result, the speed and area efficiency of the L-DSPs are 283 MHz for 77 SLICEs, 272 MHz for 87 SLICEs, and 256 MHz for 101 SLICEs with the parameters q of 3329, 7681, and 12,289, respectively. In addition, the

N^{- 1}

multiplier in the Inverse-NTT (INTT) calculation is also eliminated, reducing the size of the Butterfly Unit (BU) in CRYSTAL-Kyber to about 104 SLICEs, equivalent to a conventional multiplication in the other studies. Based on the proposed DSP, a Point-Wise Matrix Multiplication (PWMM) architecture for CRYSTAL-Kyber is designed on a hardware footprint equivalent to 386 SLICEs. Furthermore, this research is the first DSP designed for lattice-based Post-quantum Cryptography (PQC) modular multiplication. Full article

(This article belongs to the Special Issue Feature Papers in Hardware Security II)

► Show Figures

Figure 1

16 pages, 1206 KiB

Open AccessArticle

Throughput/Area-Efficient Accelerator of Elliptic Curve Point Multiplication over GF(2²³³) on FPGA

by Muhammad Rashid, Omar S. Sonbul, Muhammad Yousuf Irfan Zia, Muhammad Arif, Asher Sajid and Saud S. Alotaibi

Electronics 2023, 12(17), 3611; https://doi.org/10.3390/electronics12173611 - 26 Aug 2023

Cited by 4 | Viewed by 1411

Abstract

This paper presents a throughput/area-efficient hardware accelerator architecture for elliptic curve point multiplication (ECPM) computation over

G F (2^{233})

. The throughput of the proposed accelerator design is optimized by reducing the total clock cycles using a bit-parallel Karatsuba modular [...] Read more.

This paper presents a throughput/area-efficient hardware accelerator architecture for elliptic curve point multiplication (ECPM) computation over

G F (2^{233})

. The throughput of the proposed accelerator design is optimized by reducing the total clock cycles using a bit-parallel Karatsuba modular multiplier. We employ two techniques to minimize the hardware resources: (i) a consolidated arithmetic unit where we combine a single modular adder, multiplier, and square block instead of having multiple modular operators, and (ii) an Itoh–Tsujii inversion algorithm by leveraging the existing hardware resources of the multiplier and square units for multiplicative inverse computation. An efficient finite-state-machine (FSM) controller is implemented to facilitate control functionalities. To evaluate and compare the results of the proposed accelerator architecture against state-of-the-art solutions, a figure-of-merit (FoM) metric in terms of throughput/area is defined. The implementation results after post-place-and-route simulation are reported for reconfigurable field-programmable gate array (FPGA) devices. Particular to Virtex-7 FPGA, the accelerator utilizes 3584 slices, needs 7208 clock cycles, operates on a maximum frequency of 350 MHz, computes one ECPM operation in 20.59

μ

s, and the calculated value of FoM is 13.54. Consequently, the results and comparisons reveal that our accelerator suits applications that demand throughput and area-optimized ECPM implementations. Full article

► Show Figures

Figure 1

17 pages, 591 KiB

Open AccessArticle

A Low-Cost High-Performance Montgomery Modular Multiplier Based on Pipeline Interleaving for IoT Devices

by Hongshuo Li, Shiwei Ren, Weijiang Wang, Jingqi Zhang and Xiaohua Wang

Electronics 2023, 12(15), 3241; https://doi.org/10.3390/electronics12153241 - 27 Jul 2023

Cited by 5 | Viewed by 2065

Abstract

Modular multiplication is a crucial operation in public-key cryptography systems such as RSA and ECC. In this study, we analyze and improve the iteration steps of the classic Montgomery modular multiplication (MMM) algorithm and propose an interleaved pipeline (IP) structure, which meets the [...] Read more.

Modular multiplication is a crucial operation in public-key cryptography systems such as RSA and ECC. In this study, we analyze and improve the iteration steps of the classic Montgomery modular multiplication (MMM) algorithm and propose an interleaved pipeline (IP) structure, which meets the high-performance and low-cost requirements for Internet of Things devices. Compared to the classic pipeline structure, the IP does not require a multiplexing processing element (PE), which helps shorten the data path of intermediate results. We further introduce a disruption in the critical path to complete an iterative step of the MMM algorithm in two clock cycles. Our proposed hardware architecture is implemented on Xilinx Virtex-7 Series FPGA, using DSP48E1, to realize the multiplier. The implemented results show that the modular multiplication of 1024 bits by 2048 bits requires 1.03

μ

s and 2.13

μ

s, respectively. Moreover, our area–time–product analysis reveals a favorable outcome compared to the state-of-the-art designs across a 1024-bit and 2048-bit modulus. Full article

(This article belongs to the Special Issue Computer-Aided Design for Hardware Security and Trust)

► Show Figures

Figure 1

12 pages, 1170 KiB

Open AccessArticle

New Systolic Array Algorithms and VLSI Architectures for 1-D MDST

by Doru Florin Chiper and Arcadie Cracan

Sensors 2023, 23(13), 6220; https://doi.org/10.3390/s23136220 - 7 Jul 2023

Viewed by 1818

Abstract

In this paper, we present two systolic array algorithms for efficient Very-Large-Scale Integration (VLSI) implementations of the 1-D Modified Discrete Sine Transform (MDST) using the systolic array architectural paradigm. The new algorithms decompose the computation of the MDST into modular and regular computational [...] Read more.

In this paper, we present two systolic array algorithms for efficient Very-Large-Scale Integration (VLSI) implementations of the 1-D Modified Discrete Sine Transform (MDST) using the systolic array architectural paradigm. The new algorithms decompose the computation of the MDST into modular and regular computational structures called pseudo-circular correlation and pseudo-cycle convolution. The two computational structures for pseudo-circular correlation and pseudo-cycle convolution both have the same form. This feature can be exploited to significantly reduce the hardware complexity since the two computational structures can be computed on the same linear systolic array. Moreover, the second algorithm can be used to further reduce the hardware complexity by replacing the general multipliers from the first one with multipliers with a constant that have a significantly reduced complexity. The resulting VLSI architectures have all the advantages of a cycle convolution and circular correlation based systolic implementations, such as high-speed using concurrency, an efficient use of the VLSI technology due to its local and regular interconnection topology, and low I/O cost. Moreover, in both architectures, a cost-effective application of an obfuscation technique can be achieved with low overheads. Full article

(This article belongs to the Special Issue Selected Papers from International Symposium on Electronics and Telecommunications ISETC 2022)

► Show Figures

Figure 1

16 pages, 871 KiB

Open AccessArticle

Area-Efficient Realization of Binary Elliptic Curve Point Multiplication Processor for Cryptographic Applications

by Amer Aljaedi, Sajjad Shaukat Jamal, Muhammad Rashid, Adel R. Alharbi, Mohammed Alotaibi and Dalal J. Alanazi

Appl. Sci. 2023, 13(12), 7018; https://doi.org/10.3390/app13127018 - 10 Jun 2023

Cited by 4 | Viewed by 1899

Abstract

This paper proposes a novel hardware design for a compact crypto processor devoted to elliptic-curve point multiplication over

G F (2^{233})

. We focus on minimizing hardware usage, which we obtain using an iterative bit–serial finite field modular multiplier for [...] Read more.

This paper proposes a novel hardware design for a compact crypto processor devoted to elliptic-curve point multiplication over

G F (2^{233})

. We focus on minimizing hardware usage, which we obtain using an iterative bit–serial finite field modular multiplier for polynomial coefficient multiplication. The same multiplier is also used for modular squares and inversion computations, further optimizing the hardware footprint. Our design offers flexibility by permitting users to load different curve parameters and secret keys while keeping a low-area hardware design. To efficiently generate the control signals, we utilize a finite-state-machine-based controller. We have implemented the proposed crypto processor on Virtex-6 and Virtex-7 FPGA devices, and we have evaluated its performance at clock frequencies of 100, 50, and 10 MHz. Specifically, for one point multiplication computation on Virtex-7 FPGA, our crypto processor uses 391 slices, attains a maximum frequency of 161 MHz, has a latency of 4.45 ms, and consumes 77 mW of power. These results, along with a comparison to state-of-the-art designs, clearly demonstrate the practicality of our crypto processor for applications requiring efficient and compact cryptographic computations. Full article

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

► Show Figures

Figure 1

17 pages, 48424 KiB

Open AccessArticle

A Unified Point Multiplication Architecture of Weierstrass, Edward and Huff Elliptic Curves on FPGA

by Muhammad Arif, Omar S. Sonbul, Muhammad Rashid, Mohsin Murad and Mohammed H. Sinky

Appl. Sci. 2023, 13(7), 4194; https://doi.org/10.3390/app13074194 - 25 Mar 2023

Cited by 3 | Viewed by 1817

Abstract

This article presents an area-aware unified hardware accelerator of Weierstrass, Edward, and Huff curves over

G F (2^{233})

for the point multiplication step in elliptic curve cryptography (ECC). The target implementation platform is a field-programmable gate array (FPGA). In order [...] Read more.

This article presents an area-aware unified hardware accelerator of Weierstrass, Edward, and Huff curves over

G F (2^{233})

for the point multiplication step in elliptic curve cryptography (ECC). The target implementation platform is a field-programmable gate array (FPGA). In order to explore the design space between processing time and various protection levels, this work employs two different point multiplication algorithms. The first is the Montgomery point multiplication algorithm for the Weierstrass and Edward curves. The second is the Double and Add algorithm for the Binary Huff curve. The area complexity is reduced by efficiently replacing storage elements that result in a 1.93 times decrease in the size of the memory needed. An efficient Karatsuba modular multiplier hardware accelerator is implemented to compute polynomial multiplications. We utilized the square arithmetic unit after the Karatsuba multiplier to execute the quad-block variant of a modular inversion, which preserves lower hardware resources and also reduces clock cycles. Finally, to support three different curves, an efficient controller is implemented. Our unified architecture can operate at a maximum of 294 MHz and utilizes 7423 slices on Virtex-7 FPGA. It takes less computation time than most recent state-of-the-art implementations. Thus, combining different security curves (Weierstrass, Edward, and Huff) in a single design is practical for applications that demand different reliability/security levels. Full article

► Show Figures

Figure 1

21 pages, 667 KiB

Open AccessArticle

Towards High-Performance Supersingular Isogeny Cryptographic Hardware Accelerator Design

by Guantong Su and Guoqiang Bai

Electronics 2023, 12(5), 1235; https://doi.org/10.3390/electronics12051235 - 4 Mar 2023

Cited by 2 | Viewed by 2196

Abstract

Cryptosystems based on supersingular isogeny are a novel tool in post-quantum cryptography. One compelling characteristic is their concise keys and ciphertexts. However, the performance of supersingular isogeny computation is currently worse than that of other schemes. This is primarily due to the following [...] Read more.

Cryptosystems based on supersingular isogeny are a novel tool in post-quantum cryptography. One compelling characteristic is their concise keys and ciphertexts. However, the performance of supersingular isogeny computation is currently worse than that of other schemes. This is primarily due to the following factors. Firstly, the underlying field is a quadratic extension of the finite field, resulting in higher computational complexity. Secondly, the strategy for large-degree isogeny evaluation is complex and dependent on the elementary arithmetic units employed. Thirdly, adapting the same hardware to different parameters is challenging. Considering the evolution of similar curve-based cryptosystems, we believe proper algorithm optimization and hardware acceleration will reduce its speed overhead. This paper describes a high-performance and flexible hardware architecture that accelerates isogeny computation. Specifically, we optimize the design by creating a dedicated quadratic Montgomery multiplier and an efficient scheduling strategy that are suitable for supersingular isogeny. The multiplier operates on

F_{p^{2}}

under projective coordinate formulas, and the scheduling is tailored to it. By exploiting additional parallelism through replicated multipliers and concurrent isogeny subroutines, our 65 nm SMIC technology cryptographic accelerator can generate ephemeral public keys in 2.40 ms for Alice and 2.79 ms for Bob with a 751-bit prime setting. Sharing the secret key costs another 2.04 ms and 2.35 ms, respectively. Full article

► Show Figures

Figure 1

14 pages, 14164 KiB

Open AccessArticle

Large Field-Size Elliptic Curve Processor for Area-Constrained Applications

by Muhammad Rashid, Omar S. Sonbul, Muhammad Yousuf Irfan Zia, Nadeem Kafi, Mohammed H. Sinky and Muhammad Arif

Appl. Sci. 2023, 13(3), 1240; https://doi.org/10.3390/app13031240 - 17 Jan 2023

Cited by 7 | Viewed by 2214

Abstract

This article has proposed an efficient area-optimized elliptic curve cryptographic processor architecture over

G F (2^{409})

and

G F (2^{571})

. The proposed architecture employs Lopez-Dahab projective point arithmetic operations. To do this, a hybrid Karatsuba multiplier [...] Read more.

This article has proposed an efficient area-optimized elliptic curve cryptographic processor architecture over

G F (2^{409})

and

G F (2^{571})

. The proposed architecture employs Lopez-Dahab projective point arithmetic operations. To do this, a hybrid Karatsuba multiplier of 4-split polynomials is proposed. The proposed multiplier uses general Karatsuba and traditional schoolbook multiplication approaches. Moreover, the multiplier resources are reused to implement the modular squares and addition chains of the Itoh-Tsujii algorithm for inverse computations. The reuse of resources reduces the overall area requirements. The implementation is performed in Verilog (HDL). The achieved results are provided on Xilinx Virtex 7 device. In addition, the performance of the proposed design is evaluated on ASIC 65 nm process technology. Consequently, a figure-of-merit is constructed to compare the FPGA and ASIC implementations. An exhaustive comparison to existing designs in the literature shows that the proposed architecture utilizes less area. Therefore, the proposed design is the right choice for area-constrained cryptographic applications. Full article

► Show Figures

Figure 1

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI