ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH

Kang, Hongjuan; Guo, Bing; Sun, Yufang; Zhao, Mingjie; Chen, Xin; Ye, Kui

doi:10.3390/cryptography10030042

Open AccessArticle

ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH

by

Hongjuan Kang

^1,2,*

,

Bing Guo

^1,*,

Yufang Sun

¹

,

Mingjie Zhao

¹,

Xin Chen

¹ and

Kui Ye

¹

College of Computer Science, Sichuan University, Chengdu 610065, China

²

Sichuan Changhong Electronic Holdings Limited, Changhong (China), Mianyang 621000, China

^*

Authors to whom correspondence should be addressed.

Cryptography 2026, 10(3), 42; https://doi.org/10.3390/cryptography10030042 (registering DOI)

Submission received: 30 April 2026 / Revised: 13 June 2026 / Accepted: 17 June 2026 / Published: 22 June 2026

(This article belongs to the Topic Cybersecurity Symmetry: Encryption, AI, and Attack Patterns)

Download

Browse Figures

Versions Notes

Abstract

In the past decade, the high computational overhead of asymmetric cryptography has remained a central challenge in end-to-end secure communication systems. To mitigate the performance bottlenecks inherent in the full SM2 encryption and decryption workflow, this paper introduces ParaSM2, a parallel restructuring optimization framework tailored for SM2-based cryptographic operations. ParaSM2 exploits the observed 2:1 processing ratio between KDF and HASH to perform cross-component parallel restructuring and applies fixed-prefix reuse together with dynamic task parallelism to eliminate 39.7% of redundant KDF computations. Furthermore, a vectorized reconstruction of the HASH message extension is incorporated to leverage SIMD parallel acceleration, thereby substantially enhancing throughput. Experimental evaluations against SM4-GCM and SM4-CBC on data blocks larger than 64 KB demonstrate that ParaSM2 achieves up to a 5.1× performance improvement on both x86 and ARM architectures, effectively reducing end-to-end latency and providing a scalable pathway for algorithmic optimization in cryptography across heterogeneous platforms.

Keywords:

secure communication; SM2 algorithm; Key Derivation Function (KDF); hash function; SIMD parallel acceleration

1. Introduction

In industrial applications such as the Internet of Things (IoT), Industrial Internet of Things (IIoT), and Internet of Vehicles (IoV), a widely adopted solution for achieving end-to-end secure communication involves the use of a hybrid cryptographic framework that integrates asymmetric and symmetric cryptographic systems. This approach effectively reconciles the requirements of both strong security and high operational efficiency. Specifically, asymmetric cryptosystems represented by Elliptic Curve Cryptography (ECC) and its Chinese national standard implementation SM2 are typically deployed for key establishment and identity authentication, whereas symmetric cryptosystems, such as Advanced Encryption Standard (AES) and SM4, are utilized for efficient data encryption during transmission.

Recent studies have further explored hybrid or improved cryptographic designs to meet the constraints of industrial communication environments. The MRA mode designed by Chang et al. [1] optimizes energy consumption and computational load for low-power LoRaWAN devices by reducing AES encryption rounds and modifying RSA’s prime structure. Lilhore et al. [2] integrated Improved Elliptic Curve Cryptography (IECC), AES-256, and Dynamic Key Management (DKM) to propose a lightweight IoV encryption architecture, achieving a 30% reduction in message transmission time. Karmous et al. [3] proposed a hybrid scheme based on ECC-256 and AES-256, reducing key generation and data transmission latency through elliptic-curve-based key exchange. To address cryptographic efficiency requirements in IoT devices, Zheng et al. [4] developed a co-design approach for SM2, SM3, and SM4 integration, demonstrating substantial improvements in both security and system performance. Many researchers are exploring the use of asymmetric cryptography to complete both key agreement and secure data encryption in an end-to-end manner. To overcome the computational bottleneck of asymmetric encryption and support long-data encryption scenarios [5], current systems commonly adopt the combined use of SM2 and SM4. Existing work includes optimization and application studies of SM2 [6,7], as well as investigations into the KDF used in ECC-based encryption and decryption [8,9]. Li et al. [10] proposed an SM2-based offline/online integrity verification scheme for IoT data storage, which concentrates on upper-layer data authentication rather than low-level KDF/HASH acceleration empowered by SIMD instruction set. May et al. [11] constructed smooth auxiliary curves for 13 standardized elliptic curves including SM2, focusing on pure mathematical derivation of elliptic curves instead of practical runtime optimization via CPU vector instructions. Han et al. [12] developed a two-party SM2 protocol using Beaver’s multiplication to reduce computational overhead. Zhu et al. [13] optimized SM2 implementations for IoT environments. Bhati et al. [14] introduced Skye, a secure and efficient KDF based on the extract-then-expand paradigm, consisting of a deterministic randomness extractor and an expansion function.

Although recent studies have advanced point-operation optimization in SM2, efficiency improvements for ciphertext-generation components remain fragmented. Existing HASH optimizations do not resolve the computational resource contention between KDF and HASH. Through an in-depth examination of SM2 encryption bottlenecks, we observe that prior works overlooked the intrinsic structural and data-dependency linkage between KDF and HASH—both fundamentally rely on the HASH compression function (CF) as their core computation unit. However, traditional serial execution breaks this natural coupling into separate tasks and limits overall efficiency. This paper systematically optimizes the KDF and HASH components within SM2 and introduces a parallel acceleration framework that provides both theoretical advancements and practical engineering value.

We mathematically reveal the universal rule of first-block reuse in KDF computation and globally reuse the initial compression output to eliminate redundant operations.
We design a dual-path acceleration scheme: SIMD-based multi-message compression parallelism, and a pre-computed sparse-matrix representation of the second-block message expansion, reducing storage overhead by 39.7%.
We identify a natural 2:1 computational matching pattern between KDF and HASH and propose a dynamic task-bundling scheduler that groups two KDF blocks with one HASH block into atomic units, enabling cross-component parallel execution via SIMD and significantly improving resource utilization.

The rest of the paper is organized as follows: the SM2 algorithm is described in Section 2, involving KDF, HASH and SIMD. The core contribution, a triple parallel acceleration framework, is proposed in Section 3. In Section 4, a theoretical evaluation model is established to assess the execution efficiency of our modes. Then, we conducted the experiment and analyzed the performance results in Section 5. Lastly, Section 6 summarizes the entire work.

2. Background

SM2 Algorithm. The SM2 public key encryption algorithm [15] deeply integrates the mathematical characteristics of elliptic curves, key derivation mechanisms, and cryptographic hash functions, forming an asymmetric encryption scheme that combines high security and high performance. The detailed process is shown in Figure 1 and Figure 2. The full input set of SM2 encryption consists of standardized domain parameters

(p, a, b, G, n, h)

, the recipient’s public key

P_{B}

, plaintext payload

M

, and cryptographically secure random scalar

k

. In this notation,

G

denotes the prescribed base point of the elliptic curve,

P_{B} = [d_{B}] G

refers to the receiver’s public key generated via private key

d_{B}

,

h

is the curve cofactor, and

[.]

represents standard elliptic curve scalar multiplication. The cryptographic security of SM2 fundamentally relies on the computational hardness of the Elliptic Curve Discrete Logarithm Problem (ECDLP). The standard recommends the use of a 256-bit prime field curve, defining the equation as

y^{2} \equiv x^{3} + a x + b m o d p

, where

p

is a 256-bit generalized Mersenne prime, expressed as

p = 2^{256} - 2^{224} - 2^{96} + 2^{64} - 1

. SM2 encryption adopts a randomized security mechanism. The random number

k

ensures that the encryption results of the same plaintext are different multiple times, and

C_{1}

changes with

k

, resisting selected plaintext attacks. In the ciphertext of SM2,

C_{1}

transmits the temporary public key,

C_{2}

protects the confidentiality of the plaintext, and

C_{3}

implements message integrity. The three-component separated structure takes into account both efficiency and security and supports independent verification of ciphertext components.

Key Derivation Function (KDF). As the core cryptographic component of the SM2, shoulders the important responsibility of extending the finite-length shared key to an infinite-length key stream. Its design directly determines the confidentiality strength and engineering efficiency. KDF is defined as the bit string Z of the input (with a length not exceeding (2³² − 1) hlen) and the expected bit length klen of the derived key stream. The essence of KDF is the iterative extension of the Pseudorandom Function (PRF), which builds a bit stream with variable output length through the hash function. The hash algorithm used at the bottom layer of KDF is based on the collision resistance of 2¹²⁸ security strength, ensuring that Z cannot be inferred from the output. The counter mode makes each hash output independent of each other, meeting the cryptographic requirements for PRF. And the number of iterations ⌈klen/hlen⌉ dynamically expands the output, breaks through the fixed output limit of the hash function, and ensures length adaptability. As a cryptographic bridge connecting elliptic curve operations and symmetric encryption, the efficient implementation of KDF directly affects the practical performance of SM2. As verified by Bhati et al. [14], restructuring KDF with expandable PRFs removes the sequential hash bottleneck of traditional key derivation and reduces computational overhead, offering a feasible route to optimize the standard SM2 implementation.

HASH Algorithm is an algorithm for a message m of length l (l < 2⁶⁴) bits, which generates a 256-bit hash value through padding and iterative compression [16]. As the core cryptographic component of the SM2, it integrates the Merkle–Damgard iterative framework with dedicated Boolean function design, demonstrating unique engineering optimization characteristics while ensuring the 128-bit security strength.

Single Instruction Multiple Data (SIMD) is an important parallel computing architecture. It significantly enhances the performance of computer systems by allowing a single instruction to operate on multiple data elements simultaneously. In practical applications, the development of SIMD technology is closely related to the progress of processor architecture. Intel launched the Streaming SIMD Extensions (SSE) instruction set in 1999. In 2008, Intel introduced the Advanced Vector Extensions (AVX) instruction set, extending the width of SIMD registers to 256 bits and further enhancing the parallel processing capability. In 2017, the release of the AVX512 instruction set further expanded the register width to 512 bits, supporting more complex vector operations and more efficient data processing. In recent years, as hardware vector instruction sets keep evolving, SIMD vectorization has been widely adopted to speed up heavy mathematical computations of cryptographic algorithms, which builds a feasible technical basis for vectorized optimization targeting practical cipher implementations such as [17,18,19,20].

3. Method

This paper proposes a parallel acceleration framework, with the aim of systematically addressing the performance bottlenecks of the KDF and HASH in the SM2 encryption and decryption process.

3.1. Encryption Optimization of ParaSM2 (A5&A7: Component-Level Parallel Scheduling of KDF and HASH)

To alleviate the computing bottleneck, the computing stream fusion technology is proposed in the encryption process, as shown in Figure 3. By deconstructing the intrinsic correlation between KDF and HASH, ParaSM2 realizes cross-component task parallelism scheduling and resource reuse. The core breakthrough lies in discovering the computational isomorphism between the counter block processing of KDF and the message block partitioning of HASH and using the dynamic load balancing mechanism to eliminate the idle resources in existing implementation. Moreover, ParaSM2 allows for implementation even when only basic SSE instructions are available.

Theorem 1.

In the SM2 encryption process, the computational load of the KDF and

C_{3}

generation modules present a natural ratio in a mathematical sense. There is a ratio relationship of approximately 2:1 between the number of executions

T_{1}

of the HASH block processing function CF by KDF-HASH on the second message block and the number

T_{2}

of executions of the HASH block processing function CF when calculating

C_{3}

.

Proof.

During the process of generating the key stream by executing the KDF function, HASH performs CF iteration V⁽²⁾ = CF (V_i⁽¹⁾, B_i⁽¹⁾), i =1, 2, …, ⌈klen/hlen⌉ on the second message block, where the number of CF executions is

T_{1}

= ⌈klen/hlen⌉ = ⌈klen/256⌉. For ⌈klen/256⌉, if let klen = 256a + r, 0 ≤ r < 256. When r > 0, ⌈klen/256⌉ = a + 1; when r = 0, ⌈klen/256⌉ = a. In the process of calculating

C_{3}

= hash (x₂||M||y₂) for the plaintext M of klen bit length, the total number of times HASH performs CF is

T_{2}

= 1 + ⌈(klen + 65)/512⌉. For ⌈(klen + 65)/512⌉, let klen = 512c + r, 0 ≤ r < 511, when 0 ≤ r < 448, ⌈klen/256⌉ = c + 1; when 448 ≤r < 511, ⌈klen/256⌉ = c + 2. Upon comprehensive comparison of the two equations, it can be concluded that if klen = 512t + r, 0 ≤ r < 511, then the relationship is shown in Table 1. It shows that, ⌈⌈klen/256⌉/2⌉ − ⌈(klen + 65)/512⌉ = −1, 0, substitution

T_{2}

= 1 + ⌈(klen + 65)/512⌉ to get ⌈

T_{1}

/2⌉ −

T_{2}

= −2, 1. □

Based on the natural 2:1 ratio of computing load between KDF and HASH components discovered above, ParaSM2 proposes a dynamic task bundling strategy, bundling the two computing tasks of KDF (performing CF operations on the second message block) with one block CF operation processing of HASH into an integrated execution unit. The parallel execution of CF functions across modules is achieved by using SIMD instructions, and the data of these three groups of tasks are processed simultaneously through the registers of SSE instructions. In the unified computing core, three CF functions are executed in parallel, and the multi-launch capability of the superscalar pipeline is utilized to eliminate resource competition. When the CF calculation task of KDF is completed, only a few message blocks remain at the end of the hash data side of

C_{3}

, and the CF function can be executed independently. In summary, this feature stems from the fact that both the counter block processing of KDF and the calculation of

C_{3}

call the CF function of HASH.

3.2. Decryption Optimization of ParaSM2 (B4: Hierarchical Parallel Optimization of KDF)

In the decryption process of the SM2, as shown in Figure 4, the performance bottleneck of the KDF primarily arises from its serial iterative structure and redundant computational operations. ParaSM2 centers on reconstructing the computational logic of KDF, which achieves a significant efficiency improvement through the adoption of fixed prefix reuse and dynamic parallelization techniques. Such a design strategy not only ensures compatibility with existing cryptographic hardware but also specifically targets and mitigates the computing power bottleneck inherent in KDF.

3.2.1. KDF Computational Bottleneck and Optimization Principle

The core task of KDF is to expand the 256-bit shared point coordinates generated by SM2 into a key stream of the same length as the plaintext: KDF(Z,klen) = MSB(klen, hash(Z||[2]₄)||…||hash(Z||[⌈klen/hlen⌉]₄)). Its execution process requires iterative invocation of the hash algorithm for calculation H_i = hash(Z||[i]₄), i = 1, 2, …, ⌈klen/hlen⌉; here, Z = x₂, y₂ represents the coordinates of the shared point, and [i]₄ is the 32-bit big-endian encoding of counter i. The process is executed according to the hash algorithm. The hash algorithm will first fill the above message m_i =x₂||y₂||[i]₄ to an integer multiple of 512 bits, and then split it into two complete message blocks B_i⁽⁰⁾||B_i⁽¹⁾. Here, the first message block is B_i⁽⁰⁾ = x₂||y₂, and the second is B_i⁽¹⁾ = [i]₄||[0x80]₁||[0x00]₅₁||[l]₈, where l represents the length of the hash data m_i, which is 544 bits. Next, execute the CF function on the two message blocks B_i⁽⁰⁾ and B_i⁽¹⁾: V_i^(j+1) = CF (V^(j), B_i^(j)) = COMP (V^(j), EXT (B_i^(j))), j = 0, 1. Finally, output the result V_i⁽²⁾ of the second iteration.

3.2.2. KDF Optimizes the Iteration of the First Message Block

There is a significant drawback in existing scheme when performing the CF iteration of the first message block according to the above process-redundant computation. During the entire KDF execution of the hash, Z = x₂||y₂ are fixed prefixes. In the existing process, the same fixed prefix (i.e., the first message block) is repeatedly calculated in the HASH-CF function in each iteration. The iteration result of B⁽⁰⁾ = B_i⁽⁰⁾ = x₂||y₂ is V⁽¹⁾ = CF (IV, B_i⁽⁰⁾), i = 1, 2, …, ⌈klen/hlen⌉ leads to a large number of redundant CF operations. In response to the above problems, this mode proposes a fixed prefix pre-computation strategy for the first message block. As mentioned earlier, the first message block B_i⁽⁰⁾ is fixed at 64 bytes B_i⁽⁰⁾ = x₂||y₂ throughout the entire KDF calculation process. Therefore, this feature enables the CF function of the first message block to be calculated only once, and the complete calculation of V⁽¹⁾ = CF (V⁽⁰⁾, B⁽⁰⁾) and caching of the result are only done in the first iteration. Among them, V⁽⁰⁾ is the initial value constant IV of hash algorithm; In subsequent iterations, V⁽¹⁾ is directly reused, and the CF function is only executed on the changed counter block B_i⁽¹⁾ and V_i⁽²⁾ = CF (V⁽¹⁾, B_i⁽¹⁾).

3.2.3. KDF Optimizes the Iteration of the Second Message Block

There is a serial dependency issue when the traditional implementation executes the second message block according to the above process. The existing scheme adopts a sequential execution strategy for the second message block B_i⁽¹⁾ = [i]₄||[0 × 80]₁||[0 × 00]₅₁||[l]_8, where counter i is located without taking advantage of the parallel computing concept of modern CPUs.

In response to the above issues, this mode proposes a parallelization approach for the second message block V_i⁽²⁾ = CF (V⁽¹⁾, B_i⁽¹⁾) and designs a dual-travel acceleration scheme. On the one hand, by taking advantage of the counter iteration mode without the feedback dependency structure adopted by SM2-KDF, the HASH-CF calculation task V_i⁽²⁾ = CF (V⁽¹⁾, B_i⁽¹⁾) of the second message block B_i⁽¹⁾ where counter i is located is split into independent sub-tasks, completely decoupling the hash calculation of different counter blocks. The vectorized execution of CF functions for these different message blocks based on the SIMD instruction set lays the foundation for large-scale parallelization. SIMD instructions with a bit width of m (m is usually a power of 2) bits can store m/32 32-bit data in SIMD registers, thereby achieving the parallel implementation of m/32 CFS. For example, the SSE instruction with 128-bit parallel capability can achieve the parallel implementation of four CF computing tasks, the AVX2 instruction with 256-bit parallel capability can achieve the parallel implementation of eight CF computing tasks, and the AVX512 instruction with 512-bit parallel capability can achieve the parallel implementation of 16 CF computing tasks. On the other hand, since counter i is known in advance, the message extension EXT function calculation result of the second message block B_i⁽¹⁾ = [i]₄||[0 × 80]₁||[0 × 00]₅₁||[l]₈ is encoded as a sparse matrix for storage through pre-computation technology. The CF function of the second message block is further simplified to execute the compression function COMP together with the pre-stored EXT encoded value and the reused V⁽¹⁾. This is a solution that trades space for time. Based on the hash extended word calculation scheme, it can be known that not all of the 68 extended words will change with counter i.

Theorem 2.

Message grouping B_i⁽¹⁾ = [i]₄||[0 × 80]₁||[0 × 00]₅₁ ||[l]₈, l = 544, I = 1, 2, …, 2³² − 1, in the execution of the hash extended word scheme, 68 extended words (W₀, W₁, …, W₆₇) were calculated, only 41 characters change with the i value.

Proof.

According to the calculation formula of message extension words, first of all, the message group B_i⁽¹⁾ is divided into the first 16 message extension words W₀, W₁, …, W₁₅ = B⁽ⁱ⁾, expressed in hexadecimal big-endian order. The values of the first 16 message extension words are W₀ = [i]₄, W₁ = 80000000, W₂ = … = W₁₄ = 00000000, W₁₅ = 00000220. Among these first 16 message extension words, only W₀ is related to the i value. Based on the calculation method of the last 52 message extension words, W_j = P₁(W_j₋₁₆ ⊕W_j₋₉ ⊕ (W_j₋₃

≪ <

15)) ⊕ (W_j₋₁₃

< ≪

7) ⊕ W_j₋₆, 16 ≤ j < 68, a table can be established to list the dependencies required for calculating W_j. The specific approach is as follows.

Step 1. Initialize j = 16. Since W₀ is related to the value of i, when W₀ appears in the subsequent steps, it will be marked in bold.
Step 2. Perform the following statistical and coloring operations on j. List the dependent words required for calculating W_j based on the message expansion word calculation formula. For example, in W₁₆, it depends on W₀, W₃, W₇, W₁₀, and W₁₃. Then, verify whether there are bold-marked dependent words in the dependent words of W_j. If there is, mark W_j as bold. For instance, W₁₆ depends on the bold W₀, meaning that W₁₆ is related to the value of i, so W₁₆ is also marked as bold.
Step 3. If j has reached 67, terminate the statistics and coloring operations. Otherwise, after j accumulates by 1, it jumps to step 2.

Table 2 shows the execution results of the above statistics and coloring operations, where the bold part indicates that the value is related to the count value i, and the regular part indicates that it is not related to the technical value i. There are 27 messages in 68 extended words (W₀, W₁, …, W₆₇) that are independent of the value i, namely W₁, …, W₁₅, W₁₇, W₁₈, W₂₀, W₂₁, W₂₃, W₂₄, W₂₆, W₂₇, W₃₀, W₃₃, W₃₆ and W₃₉. Furthermore, the counter block contains 68 message extension words, but only 41 dynamically change with the value of i, while the remaining 27 words are fixed fill templates. By constructing a sparse matrix storage structure, only 41 dynamically changing message extension words are stored, and the intermediate state of the static template is pre-computed. Compared to the full-word storage scheme, the storage space occupation is compressed by 39.7%. □

3.3. Decryption Optimization of ParaSM2 (B6: Vectorized Reconstruction of Hash Message Extension)

Furthermore, ParaSM2 conducts specialized SSE instruction-level optimization for the message extension function (EXT) in hash computing and builds a hierarchical acceleration architecture. It can be used for hashing when SM2 computes

C_{3}

. This design does not conflict with the core advantages of KDF precomputation and parallelization. By reconstructing the computing path of EXT functions, the performance bottleneck of HASH is removed, forming a complete solution for module collaborative optimization. The message expansion function EXT of hash divides the 512-bit input into 68 32-bit words (W₀ to W₆₇) in blocks. Its core bottleneck lies in the strong data dependency feature: W_j = P₁(W_j₋₁₆ ⊕W_j₋₉ ⊕ (W_j₋₃ <<< 15)) ⊕ (W_j₋₁₃ <<< 7) ⊕ W_j₋₆, 16 ≤ j < 68. The existing serial implementation requires 68 rounds of sequential calculation of iterations.

Since four-wheel parallel operation is not feasible, it can be reverted to three-wheel parallel operation. Based on the SSE instruction set, a three-word parallel architecture is adopted to divide the last 52 words (16 ≤ j < 68) into 22 three-word groups. There is no correlation among the three-word groups, and the 128-bit register of SSE can be directly utilized to simultaneously expand three 32-bit words. ParaSM2 adopts this method for the hash when calculating

C_{3}

with SM2. The acceleration of the EXT of HASH reduces the partial time consumption, and the overall performance bottleneck shifts to the function COMP.

4. Experiments

This section systematically demonstrates the feasibility of achieving a performance leap through formal methods and theoretical models and reveals the applicable boundaries in different application scenarios.

4.1. Types of Graphics

The optimization mode only modifies the execution sequence without changing the execution data or main steps. The optimization plan is equivalent to the original one, and its correctness can be guaranteed. The specific description is as follows.

In the encryption process, acceleration is achieved by implementing the CF functions of KDF and

C_{3}

in parallel. As mentioned earlier, any parallel implementation scheme for CF functions across multiple message blocks must ensure the correctness of its computational results. Thus, the parallel implementation of CF does not alter the computed results, and the correctness of the ParaSM2 is thereby guaranteed.

As for decryption process, in B4, only the execution sequence and manner of KDF are adjusted. Specifically, the execution mode of the first message block in KDF remains unchanged; instead, the computational result of this message block is reused. Such an adjustment does not affect the execution result of the first message block. When implementing parallel execution for the second message block, a prerequisite is that any parallel implementation scheme for CF functions across multiple message blocks must ensure the correctness of its computational results, i.e., consistency with the results obtained by executing CF on each message block individually. Similarly, when encoding the computational results of the EXT function for the second message block into a sparse matrix for storage, the correctly computed results of the EXT function are stored without compromising correctness. In summary, the result of KDF computed in ParaSM2 is consistent with that of the original scheme. Moreover, in B6, only the process of hash computation

C_{3}

is adjusted: the ordinary C code implementation is merely replaced with an SSE instruction-level parallel implementation for the message expansion function EXT, without compromising correctness. Therefore, the result of

C_{3}

computed in ParaSM2 is consistent with that of the original scheme.

4.2. Theoretical Analysis

The theoretical model is established based on the execution times of the iterative function CF: T_total = N_CF × T_CF, here, T_CF represents the time consumption of a single CF execution, and N_CF represents the number of CF executions. {T₁, T₄, T₈, T₁₆, T’₃} respectively represent the time consumption T of executing the HASH-CF function once in a scheme, implementing the CF function for grouping four messages using SSE instructions, grouping eight messages using the AVX2 instruction, grouping 16 messages using the AVX512 instruction, and when performing three rounds of parallel SSE instruction-level optimization on the EXT without COMP optimization.

Original implementation scheme. In order to calculate the key stream t using KDF, calculate the hash value H_i = hash(x₂||y₂||[i]₄), i = 1, 2, …, ⌈klen/256⌉, requires a total of 2⌈klen/256⌉ executions of the CF function. Secondly, to calculate

C_{3}

= hash (x₂||M||y₂), the CF function needs to be executed 1 + ⌈(klen + 65)/512⌉ times. So, the original scheme requires the execution of the CF function 1 + ⌈(klen + 65)/512⌉ + 2⌈klen/256⌉ times, and in total, the time consumed by this scheme for calculating

C_{2}

and

C_{3}

is (1 + ⌈(klen + 65)/512⌉ + 2⌈klen/256⌉)T₁.

In the ParaSM2-encryption process, the ordinary CF function is executed once to calculate the first message packet of KDF, and then the parallel CF function ⌈⌈klen/256⌉/2⌉ times is executed to calculate the second message packet of KDF and the hash iteration of

C_{3}

, which takes a time of ⌈⌈klen/256⌉/2⌉T₄ (by SSE implementation); Finally, after the CF calculation required by KDF is completed, the hash iteration of

C_{3}

still has 1 + ⌈(klen + 65)/512⌉ − ⌈⌈klen/256⌉/2⌉ blocks that need to execute the CF function separately. According to theorem 1, it can be proved that 1 + ⌈(klen + 65)/512⌉ − ⌈⌈klen/256⌉/2⌉ = 1, 2, that is, the hash iteration of

C_{3}

requires an additional execution of the ordinary CF function one or two times. In total, the time consumption is T₁ + ⌈⌈klen/256⌉/2⌉T₄ + (2 + ⌈(klen + 65)/512⌉ − ⌈⌈klen/256⌉/2⌉)T₁ ≤ 3T₁⌈⌈klen/256⌉/2⌉T₄.

In the ParaSM2-decryption process, for the KDF function, only optimize the process of calculating the key stream t using KDF. The first message block uses a fixed prefix precomputation mechanism, directly reducing the ⌈klen/256⌉ CF operations required for the first message block to one time. When the second message block of KDF adopts SIMD parallel processing of the CF function, the ⌈klen/256⌉ times CF function of the second message block is reduced to ⌈⌈klen/256⌉/p⌉ times CF parallelism of p message blocks, and the time consumption for each CF parallelism of p message blocks is T_p. In total, the time consumption of KDF is T₁ + ⌈⌈klen/256⌉/p⌉T_p. For the HASH function, the process of calculating

C_{3}

by hash algorithms is an optimized implementation scheme, which requires the execution of the CF function 1 + ⌈(klen65)/512⌉ times. However, the message extension function (EXT) of each CF function undergoes specific SSE instruction level optimization, the time consumption of a single CF function is T’₃. In total, the time consumed by hash algorithms in calculating

C_{3}

is (1 + ⌈(klen + 65)/512⌉) T’₃.

5. Experimental Evaluation

The effectiveness of ParaSM2 through systematic testing within the x86 architecture is evaluated in this part. It was demonstrated that this solution is also applicable to the ARM architecture. Then the optimization mechanism is deeply analyzed in combination with the microarchitecture performance counter. Finally, the impact of AVX* instructions on the performance improvement of ParaSM2 was tested. To ensure the accuracy of the performance result, all experiments were conducted in a strictly isolated environment and taking the median value after the executed results 21 times.

5.1. Efficiency Analysis

The core processor of our testbed is Intel Core i7-11700 (Rocket Lake-S architecture) with eight physical cores and 16 threads. The base frequency is 2.5 GHz, and the dynamic speedup is up to 4.9 GHz on a single core. Each core has its own 512 KB L2 cache and 64 KB L1 cache. Its instruction set covers SSE, AVX2, AVX512, which significantly improves the parallel processing ability of cryptographic operations. The platform is equipped with 32 GB DDR4 memory and runs at 3200 MHz reference frequency in dual-channel mode. The operating system is deployed with Win10 Professional, the development environment is based on Visual Studio 2019, and key compilation options include enabling maximum Instruction Level parallelism optimization (/O2) and enabling Code Speed First (/Ot).

5.2. Experimental Results and Discussion

The performance of the basic components under different implementation methods, including encrypting a message packet with SM4 and iteratively compressing a message packet with hash is tested by using OpenSSL v3.5.1, and the time consumption T_SM₄ for performing a 128-bit block encryption of the SM4 algorithm was 98.467 milliseconds (ms). Taking the performance test of the iterative function CF executed by hash algorithms as an example, the time consumption T₁ for executing the HASH-CF function once on a 512-bit message packet is 318.371 ms. The parallel computing time of the CF function for four message packets implemented by the SSE instruction is T₄ = 367.922 ms. The parallel computing time of the CF function for eight message packets implemented by the AVX2 instruction is T₈ = 389.953 ms. The parallel computing time for the CF function of 16 message packets implemented by the AVX512 instruction is T₁₆ = 414.797 ms. In addition, in the iterative function CF, only the SSE instructions are used in parallel to optimize and implement the EXT function for three rounds of iterations (the remaining Comp is implemented in the ordinary way), which takes T’₃ = 238.531 ms. The hash algorithm uses the SSE instruction to implement the CF function of four packets in parallel (T₄), which is slightly slower than the ordinary implementation of T₁. Therefore, when there is more than one packet to be processed, the parallel implementation scheme is obviously better. The same is true for the other SIMD implementation methods.

Encryption Performance. Given the universality of SSE instructions, we tested the encryption performance results of different modes under SSE instructions with x86 architecture, including SM4 direct encryption (SM4-GCM [21], SM4-CBC [22]), and our ParaSM2, as shown in Figure 5A. It is worth noting that, unlike SM4-GCM, which already includes a HASH process, in SM4-CBC we have supplemented the HASH process for experiments. In addition, we tested the impact of three different instruction sets (SSE, AVX2, AVX512) on encryption performance, as shown in Figure 6A. For example, “ParaSM2-SSE” indicates the use of SSE instructions.

Decryption Performance. Similarly, we conducted decryption performance tests in different modes, as shown in Figure 5B. And the impact of different instruction sets on decryption performance is shown in Figure 6B.

Architectural Adaptability. In order to illustrate the adaptability of our optimization ParaSM2, we also conduct performance tests on the ARM architecture (Kunpeng 920), including the encryption and decryption performance, as shown in Figure 7. The Kunpeng server was chosen due to its gradual rise and important position in the cloud services and end-side services market. In addition, due to the inapplicability of AVX2 and AVX512 instruction sets under ARM architecture, the performance of different instructions of ParaSM2 within ARM is beyond the scope of this paper.

End-to-End Performance. The previous experiments have evaluated the pure encryption and decryption latency of ParaSM2 on both x86 and ARM architectures. This section further conducts end-to-end latency comparison experiments between ParaSM2 and the industrial mainstream SM2 digital envelope, as illustrated in Table 3 and Table 4. The experimental results indicate that the performance improvement of ParaSM2 is limited for small data payloads due to the overhead of task scheduling initialization. In contrast, when the data size reaches 64 KB or above, ParaSM2 fully exploits the advantages of parallel optimization and achieves prominent and stable end-to-end acceleration performance.

Result Discussion. SM2 and SM4 adopt different cryptographic mechanisms, design goals, and application modes, so their performance cannot be directly compared. To fully evaluate the performance of ParaSM2 under practical working conditions, we set up two groups of comparative experiments. On the one hand, we compare the pure encryption and decryption latency of ParaSM2 with industrial mainstream SM4 to verify its performance at the basic cryptographic operation level. On the other hand, we conduct end-to-end tests against the SM2 digital envelope scheme to evaluate its practical service performance. As shown in the results, firstly, whether under the ARM or x86 architecture, our scheme shows a similar performance trend, which indicates that ParaSM2 is adaptive in the current two popular architectures. Secondly, when the data volume is not large (≤1 MB), the 3 modes have similar overhead. However, when the data volume reaches 4 MB, the efficiency improvement of ParaSM2 basically approaches its optimization limit values. And ParaSM2 has very significant performance improvements, outperforming the two models of SM4. Moreover, it is observed that even though the higher-order AVX512 instructions do perform better, the encryption and decryption time of ordinary SSE instructions does not exceed 360 ms. Therefore, the advantages of the ParaSM2 are manifested in a bigger data volume. Based on the hardware configuration and scene requirements, users can adaptively choose our ParaSM2 with different instructions.

6. Conclusions

This research achieves two key advancements in SM2 optimization. Firstly, it provides a new paradigm for cross-component task integration and scheduling at the cryptographic system level. By intelligently bundling KDF and HASH tasks, binding two KDF sub-blocks and one HASH sub-block as atomic units, and using SIMD to parallelize heterogeneous task compression, this technology breaks the performance ceiling of traditional isolated module optimization, boosting encryption/decryption efficiency by 4.3 times and providing a novel methodology for cryptographic system engineering. Secondly, it provides a notable improvement in KDF efficiency. By revealing the fixed prefix reuse theorem in KDF (i.e., the first compression function input of all counter blocks is the shared point coordinates (x₂, y₂), a dual acceleration mechanism combining pre-computation and SIMD parallelism is proposed. This mechanism globally reuses the first block’s compression result, eliminates half of redundant computations, and combine the AVX512 instruction set to implement the parallel computing of the message expansion function EXT and the compression function COMP of the 16-way counter block. Measurements show this optimization increases KDF throughput by 5.1 times (in 16 MB encryption and 4 MB decryption scenarios) and reduces end-to-end latency significantly. ParaSM2 establishes a viable alternative to the prevalent SM2-SM4 hybrid cryptographic paradigm, facilitating standalone end-to-end encryption for data payloads exceeding 64 KB and delivering a novel implementation route for cryptographic engineering in industrial scenarios. Finally, experimental results show that compared with SM4-GCM, SM4-CBC, our ParaSM2 performs better on both x86 and ARM architectures and can be further improved by combining instruction selection.

Moreover, there are still directions that need to be continuously explored. On the one hand, as the current solution relies on general SIMD (e.g., AVX512) and underutilize dedicated cryptographic instructions, future work will consider HASH-specific instruction extensions. On the other hand, there is enhanced side-channel attack resistance. While existing schemes resist basic timing attacks via constant-time design, protection against high-order side channels (e.g., cache timing attacks, fault injection) is insufficient. Integrating multiple mechanisms will be critical, such as reconstructing EXT function branches and table lookups into branch-free masked selections, eliminating microarchitecture dependencies, and dynamically perturbing precomputed base addresses to disrupt cache template attacks. Our work enables a leap in SM2 performance and establishes a new paradigm for cross-module task integration at the cryptographic system level. This framework can be extended to other public-key encryption schemes, such as ECIES, or joint optimization of RSA-OAEP and AES-GCM.

Author Contributions

Conceptualization, H.K. and B.G.; methodology, H.K.; software, H.K.; validation, H.K., Y.S. and M.Z.; formal analysis, X.C.; investigation, K.Y.; resources, H.K.; data curation, H.K.; writing—original draft preparation, H.K.; writing—review and editing, B.G.; visualization, H.K.; supervision, B.G.; project administration, H.K.; funding acquisition, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grant No. U2268204 and 62172061; National Key R&D Program of China under Grant No. 2023YFB3308300; the Science and Technology Project of Sichuan Province under Grant No. 2024ZDZX0012, 2023ZHCG0011, 2021YFG0152.

Data Availability Statement

The data generated and analyzed during this study are not publicly available due to confidentiality restrictions. Any reasonable request for the relevant data can be made by contacting the corresponding author via email.

Conflicts of Interest

Author Hongjuan Kang was employed by the company Sichuan Changhong Electronic Holdings Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chang, Q.; Ma, T.; Yang, W. Low power IoT device communication through hybrid AES-RSA encryption in MRA mode. Sci. Rep. 2025, 15, 14485. [Google Scholar] [CrossRef] [PubMed]
Lilhore, U.K.; Simaiya, S.; Dalal, S.; Sharma, Y.K.; Tomar, S.; Hashmi, A. Secure WSN Architecture Utilizing Hybrid Encryption with DKM to Ensure Consistent IoV Communication. Wirel. Pers. Commun. 2024. [Google Scholar] [CrossRef]
Karmous, N.; Hizem, M.; Dhiab, Y.B.; Aoueileyine, M.O.-E.; Boual-legue, R.; Youssef, N. Hybrid Cryptographic End-to-End Encryption Method for Protecting IoT Devices Against MitM Attacks. Radio Eng. 2024, 33, 583–592. [Google Scholar] [CrossRef]
Zheng, X.; Xu, C.Y.; Hu, X.H.; Zhang, Y.; Xiong, X. The software/hardware co-design and implementation of SM2/3/4 encryption/decryption and digital signature system. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 2055–2066. [Google Scholar] [CrossRef]
Li, P.; Ou, W.; Liang, H.; Han, W.; Zhang, Q.; Zeng, G. A zero trust and blockchain-based defense model for smart electric vehicle chargers. J. Netw. Comput. Appl. 2023, 213, 103599. [Google Scholar] [CrossRef]
Liu, Z.; Liang, T.; Lyu, J.; Lang, D. A security-enhanced scheme for MQTT protocol based on domestic cryptographic algorithm. Comput. Commun. 2024, 221, 1–9. [Google Scholar] [CrossRef]
Hu, A.; Wu, H.; Liu, C. A Novel Weakness of SM2 Algorithm. In Proceedings of the 2024 14th International Conference on Information Technology in Medicine and Education (ITME), Guiyang, China, 13–15 September 2024; pp. 878–880. [Google Scholar]
Backendal, M.; Clermont, S.; Fischlin, M.; Günther, F. Key derivation functions without a grain of salt. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, Madrid, Spain, 4–8 May 2025; pp. 393–426. [Google Scholar]
Nair, V.; Song, D. Multi-Factor Key Derivation Function (MFKDF) for Fast, Flexible, Secure, & Practical Key Management. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 2097–2114. [Google Scholar]
Li, X.; Yi, Z.; Li, R.; Wang, X.-A.; Li, H.; Yang, X. SM2-based offline/online efficient data integrity verification scheme for multiple application scenarios. Sensors 2023, 23, 4307. [Google Scholar] [CrossRef] [PubMed]
May, A.; Schneider, C. Dlog is Practically as Hard (or Easy) as DH-Solving Dlogs via DH Oracles on EC Standards. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2023, 2023, 146–166. [Google Scholar]
Han, G.; Bai, X.; Geng, S.; Qin, B. Efficient two-party SM2 signing protocol based on secret sharing. J. Syst. Archit. 2022, 132, 102738. [Google Scholar] [CrossRef]
Zhu, H.; Li, D.; Sun, Y.; Chen, Q.; Tian, Z.; Song, Y. Optimization of SM2 algorithm based on polynomial segmentation and parallel computing. Electronics 2024, 13, 4661. [Google Scholar] [CrossRef]
Bhati, A.S.; Dufka, A.; Andreeva, E.; Roy, A.; Preneel, B. Skye: An Expanding PRF based Fast KDF and its Applications. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1082–1098. [Google Scholar]
GB/T 32918-2016; Information Security Technology—Public Key Cryptographic Algorithm SM2 Based on Elliptic Curves. General Administration of Quality Supervision, Inspection and Quarantine of P.R.China; Standardization Administration of China. China Standards Press: Beijing, China, 2016.
GB/T 32905-2016; Information Security Techniques—SM3 cryptographic Hash Algorithm. General Administration of Quality Supervision, Inspection and Quarantine of P.R.China, Standardization Administration of China. China Standards Press: Beijing, China, 2016.
Cheng, Y. Study on the Encryption and Decryption of a Hybrid Domestic Cryptographic Algorithm in Secure Transmission of Data Communication. Int. J. Netw. Secur. 2022, 24, 947–952. [Google Scholar] [CrossRef] [PubMed]
Ye, Z.; Song, R.; Zhang, H.; Chen, D.; Cheung, R.C.-C.; Huang, K. A Highly-efficient Lattice-based Post-Quantum Cryptography Processor for IoT Applications. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2024, 2024, 130–153. [Google Scholar] [CrossRef]
Chen, L.; Tang, Y.; Zhao, L.; Gong, Z. SIMD Optimizations of White-Box Block Cipher Implementations with the Self-equivalence Framework. In Proceedings of the International Conference on Information Security and Cryptology, Seoul, Republic of Korea, 20–22 November 2024; pp. 129–149. [Google Scholar]
Polubelova, M.; Bhargavan, K.; Protzenko, J.; Beurdouche, B.; Fromherz, A.; Kulatova, N.; Zanella-B’eguelin, S. HACLxN: Verified Generic SIMD Crypto (for all your favourite platforms). In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2020; pp. 899–918. [Google Scholar]
NIST SP 800-38D[EB/OL]; Recommendation for block cipher modes of operation: Galois/Counter Mode (GCM) and GMAC. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2001. [CrossRef]
NIST SP 800-38A[EB/OL]; Recommendation for Block Cipher Modes of Operation: Methods and Techniques. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2001. [CrossRef]

Figure 1. SM2 encryption process.

Figure 2. SM2 decryption process.

Figure 3. Diagram of the action areas of ParaSM2 optimization in encryption process. By observing the natural ratio of 2:1 between A5 and A7, the block processing is computed in parallel to enhance execution efficiency.

Figure 4. Diagram of the action areas of ParaSM2 optimization in decryption process. Improve the execution efficiency by optimizing the KDF and HASH functions in B4, B6.

Figure 5. The encryption and decryption performance of ParaSM2 upon x86. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.

Figure 6. The encryption and decryption performance of ParaSM2 with SSE, AVX2, AVX512 instruction sets. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.

Figure 7. The encryption and decryption performance of ParaSM2 upon ARM. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.

Table 1. Calculate the ⌈klen/256⌉ and the ⌈(klen + 65)/512⌉.

r	⌈klen/256⌉	⌈(klen + 65)/512⌉
r = 0	2t	t + 1
1 ≤ r ≤ 255	2t + 1	t + 1
256 ≤ r ≤ 447	2t + 2	t + 1
448 ≤ r ≤ 511	2t + 2	t + 1

Table 2. Dependencies for Calculating W_j.

W_j	Characters Required for Calculating W_j
W₁₆	W₀	W₃	W₇	W₁₀	W₁₃
W₁₇	W₁	W₄	W₈	W₁₁	W₁₄
W₁₈	W₂	W₅	W₉	W₁₂	W₁₅
W₁₉	W₃	W₆	W₁₀	W₁₃	W₁₆
W₂₀	W₄	W₇	W₁₁	W₁₄	W₁₇
W₂₁	W₅	W₈	W₁₂	W₁₅	W₁₈
W₂₂	W₆	W₉	W₁₃	W₁₆	W₁₉
W₂₃	W₇	W₁₀	W₁₄	W₁₇	W₂₀
W₂₄	W₈	W₁₁	W₁₅	W₁₈	W₂₁
W₂₅	W₉	W₁₂	W₁₆	W₁₉	W₂₂
W₂₆	W₁₀	W₁₃	W₁₇	W₂₀	W₂₃
W₂₇	W₁₁	W₁₄	W₁₈	W₂₁	W₂₄
W₂₈	W₁₂	W₁₅	W₁₉	W₂₂	W₂₅
W₂₉	W₁₃	W₁₆	W₂₀	W₂₃	W₂₆
W₃₀	W₁₄	W₁₇	W₂₁	W₂₄	W₂₇
W₃₁	W₁₅	W₁₈	W₂₂	W₂₅	W₂₈
W₃₂	W₁₆	W₁₉	W₂₃	W₂₆	W₂₉
W₃₃	W₁₇	W₂₀	W₂₄	W₂₇	W₃₀
W₃₄	W₁₈	W₂₁	W₂₅	W₂₈	W₃₁
W₃₅	W₁₉	W₂₂	W₂₆	W₂₉	W₃₂
W₃₆	W₂₀	W₂₃	W₂₇	W₃₀	W₃₃
W₃₇	W₂₁	W₂₄	W₂₈	W₃₁	W₃₄
W₃₈	W₂₂	W₂₅	W₂₉	W₃₂	W₃₅
W₃₉	W₂₃	W₂₆	W₃₀	W₃₃	W₃₆
W₄₀	W₂₄	W₂₇	W₃₁	W₃₄	W₃₇
W₄₁	W₂₅	W₂₈	W₃₂	W₃₅	W₃₈
W₄₂	W₂₆	W₂₉	W₃₃	W₃₆	W₃₉
W₄₃	W₂₇	W₃₀	W₃₄	W₃₇	W₄₀
W₄₄	W₂₈	W₃₁	W₃₅	W₃₈	W₄₁
W₄₅	W₂₉	W₃₂	W₃₆	W₃₉	W₄₂
W₄₆	W₃₀	W₃₃	W₃₇	W₄₀	W₄₃
W₄₇	W₃₁	W₃₄	W₃₈	W₄₁	W₄₄
W₄₈	W₃₂	W₃₅	W₃₉	W₄₂	W₄₅
W₄₉	W₃₃	W₃₆	W₄₀	W₄₃	W₄₆
W₅₀	W₃₄	W₃₇	W₄₁	W₄₄	W₄₇
W₅₁	W₃₅	W₃₈	W₄₂	W₄₅	W₄₈
W₅₂	W₃₆	W₃₉	W₄₃	W₄₆	W₄₉
W₅₃	W₃₇	W₄₀	W₄₄	W₄₇	W₅₀
W₅₄	W₃₈	W₄₁	W₄₅	W₄₈	W₅₁
W₅₅	W₃₉	W₄₂	W₄₆	W₄₉	W₅₂
W₅₆	W₄₀	W₄₃	W₄₇	W₅₀	W₅₃
W₅₇	W₄₁	W₄₄	W₄₈	W₅₁	W₅₄
W₅₈	W₄₂	W₄₅	W₄₉	W₅₂	W₅₅
W₅₉	W₄₃	W₄₆	W₅₀	W₅₃	W₅₆
W₆₀	W₄₄	W₄₇	W₅₁	W₅₄	W₅₇
W₆₁	W₄₅	W₄₈	W₅₂	W₅₅	W₅₈
W₆₂	W₄₆	W₄₉	W₅₃	W₅₆	W₅₉
W₆₃	W₄₇	W₅₀	W₅₄	W₅₇	W₆₀
W₆₄	W₄₈	W₅₁	W₅₅	W₅₈	W₆₁
W₆₅	W₄₉	W₅₂	W₅₆	W₅₉	W₆₂
W₆₆	W₅₀	W₅₃	W₅₇	W₆₀	W₆₃
W₆₇	W₅₁	W₅₄	W₅₈	W₆₁	W₆₄

Table 3. The end-to-end encryption performance of schemes under x86 and ARM architecture.

Schemes	The End-to-End Delay for Calculating the Size of the Data (ms)
Schemes	64 KB	256 KB	1 MB	4 MB	16 MB	64 MB
SM2 Digital Envelope@x86	1.26	2.923	9.136	33.649	131.882	518.54
ParaSM2@x86	0.946	1.844	5.403	19.779	77.049	301.14
SM2 Digital Envelope@ARM	1.141	2.476	8.857	31.623	123.959	482.85
ParaSM2@ARM	0.869	1.631	4.892	17.596	69.551	271.08

Table 4. The end-to-end decryption performance of schemes under x86 and ARM architecture.

Schemes	The End-to-End Delay for Calculating the Size of the Data (ms)
Schemes	64 KB	256 KB	1 MB	4 MB	16 MB	64 MB
SM2 Digital Envelope@x86	0.962	2.715	8.023	30.776	124.829	511.93
ParaSM2@x86	0.676	1.689	5.87	22.55	88.42	353.68
SM2 Digital Envelope@ARM	0.916	2.501	7.365	28.621	118.019	472.09
ParaSM2@ARM	0.695	1.818	5.898	22.532	88.953	351.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, H.; Guo, B.; Sun, Y.; Zhao, M.; Chen, X.; Ye, K. ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH. Cryptography 2026, 10, 42. https://doi.org/10.3390/cryptography10030042

AMA Style

Kang H, Guo B, Sun Y, Zhao M, Chen X, Ye K. ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH. Cryptography. 2026; 10(3):42. https://doi.org/10.3390/cryptography10030042

Chicago/Turabian Style

Kang, Hongjuan, Bing Guo, Yufang Sun, Mingjie Zhao, Xin Chen, and Kui Ye. 2026. "ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH" Cryptography 10, no. 3: 42. https://doi.org/10.3390/cryptography10030042

APA Style

Kang, H., Guo, B., Sun, Y., Zhao, M., Chen, X., & Ye, K. (2026). ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH. Cryptography, 10(3), 42. https://doi.org/10.3390/cryptography10030042

Article Menu

ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH

Abstract

1. Introduction

2. Background

3. Method

3.1. Encryption Optimization of ParaSM2 (A5&A7: Component-Level Parallel Scheduling of KDF and HASH)

3.2. Decryption Optimization of ParaSM2 (B4: Hierarchical Parallel Optimization of KDF)

3.2.1. KDF Computational Bottleneck and Optimization Principle

3.2.2. KDF Optimizes the Iteration of the First Message Block

3.2.3. KDF Optimizes the Iteration of the Second Message Block

3.3. Decryption Optimization of ParaSM2 (B6: Vectorized Reconstruction of Hash Message Extension)

4. Experiments

4.1. Types of Graphics

4.2. Theoretical Analysis

5. Experimental Evaluation

5.1. Efficiency Analysis

5.2. Experimental Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI