Previous Article in Journal
Mythos-Class Frontier Models and the Compression of Post-Quantum Cryptography Migration Timelines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH

1
College of Computer Science, Sichuan University, Chengdu 610065, China
2
Sichuan Changhong Electronic Holdings Limited, Changhong (China), Mianyang 621000, China
*
Authors to whom correspondence should be addressed.
Cryptography 2026, 10(3), 42; https://doi.org/10.3390/cryptography10030042 (registering DOI)
Submission received: 30 April 2026 / Revised: 13 June 2026 / Accepted: 17 June 2026 / Published: 22 June 2026

Abstract

In the past decade, the high computational overhead of asymmetric cryptography has remained a central challenge in end-to-end secure communication systems. To mitigate the performance bottlenecks inherent in the full SM2 encryption and decryption workflow, this paper introduces ParaSM2, a parallel restructuring optimization framework tailored for SM2-based cryptographic operations. ParaSM2 exploits the observed 2:1 processing ratio between KDF and HASH to perform cross-component parallel restructuring and applies fixed-prefix reuse together with dynamic task parallelism to eliminate 39.7% of redundant KDF computations. Furthermore, a vectorized reconstruction of the HASH message extension is incorporated to leverage SIMD parallel acceleration, thereby substantially enhancing throughput. Experimental evaluations against SM4-GCM and SM4-CBC on data blocks larger than 64 KB demonstrate that ParaSM2 achieves up to a 5.1× performance improvement on both x86 and ARM architectures, effectively reducing end-to-end latency and providing a scalable pathway for algorithmic optimization in cryptography across heterogeneous platforms.

1. Introduction

In industrial applications such as the Internet of Things (IoT), Industrial Internet of Things (IIoT), and Internet of Vehicles (IoV), a widely adopted solution for achieving end-to-end secure communication involves the use of a hybrid cryptographic framework that integrates asymmetric and symmetric cryptographic systems. This approach effectively reconciles the requirements of both strong security and high operational efficiency. Specifically, asymmetric cryptosystems represented by Elliptic Curve Cryptography (ECC) and its Chinese national standard implementation SM2 are typically deployed for key establishment and identity authentication, whereas symmetric cryptosystems, such as Advanced Encryption Standard (AES) and SM4, are utilized for efficient data encryption during transmission.
Recent studies have further explored hybrid or improved cryptographic designs to meet the constraints of industrial communication environments. The MRA mode designed by Chang et al. [1] optimizes energy consumption and computational load for low-power LoRaWAN devices by reducing AES encryption rounds and modifying RSA’s prime structure. Lilhore et al. [2] integrated Improved Elliptic Curve Cryptography (IECC), AES-256, and Dynamic Key Management (DKM) to propose a lightweight IoV encryption architecture, achieving a 30% reduction in message transmission time. Karmous et al. [3] proposed a hybrid scheme based on ECC-256 and AES-256, reducing key generation and data transmission latency through elliptic-curve-based key exchange. To address cryptographic efficiency requirements in IoT devices, Zheng et al. [4] developed a co-design approach for SM2, SM3, and SM4 integration, demonstrating substantial improvements in both security and system performance. Many researchers are exploring the use of asymmetric cryptography to complete both key agreement and secure data encryption in an end-to-end manner. To overcome the computational bottleneck of asymmetric encryption and support long-data encryption scenarios [5], current systems commonly adopt the combined use of SM2 and SM4. Existing work includes optimization and application studies of SM2 [6,7], as well as investigations into the KDF used in ECC-based encryption and decryption [8,9]. Li et al. [10] proposed an SM2-based offline/online integrity verification scheme for IoT data storage, which concentrates on upper-layer data authentication rather than low-level KDF/HASH acceleration empowered by SIMD instruction set. May et al. [11] constructed smooth auxiliary curves for 13 standardized elliptic curves including SM2, focusing on pure mathematical derivation of elliptic curves instead of practical runtime optimization via CPU vector instructions. Han et al. [12] developed a two-party SM2 protocol using Beaver’s multiplication to reduce computational overhead. Zhu et al. [13] optimized SM2 implementations for IoT environments. Bhati et al. [14] introduced Skye, a secure and efficient KDF based on the extract-then-expand paradigm, consisting of a deterministic randomness extractor and an expansion function.
Although recent studies have advanced point-operation optimization in SM2, efficiency improvements for ciphertext-generation components remain fragmented. Existing HASH optimizations do not resolve the computational resource contention between KDF and HASH. Through an in-depth examination of SM2 encryption bottlenecks, we observe that prior works overlooked the intrinsic structural and data-dependency linkage between KDF and HASH—both fundamentally rely on the HASH compression function (CF) as their core computation unit. However, traditional serial execution breaks this natural coupling into separate tasks and limits overall efficiency. This paper systematically optimizes the KDF and HASH components within SM2 and introduces a parallel acceleration framework that provides both theoretical advancements and practical engineering value.
  • We mathematically reveal the universal rule of first-block reuse in KDF computation and globally reuse the initial compression output to eliminate redundant operations.
  • We design a dual-path acceleration scheme: SIMD-based multi-message compression parallelism, and a pre-computed sparse-matrix representation of the second-block message expansion, reducing storage overhead by 39.7%.
  • We identify a natural 2:1 computational matching pattern between KDF and HASH and propose a dynamic task-bundling scheduler that groups two KDF blocks with one HASH block into atomic units, enabling cross-component parallel execution via SIMD and significantly improving resource utilization.
The rest of the paper is organized as follows: the SM2 algorithm is described in Section 2, involving KDF, HASH and SIMD. The core contribution, a triple parallel acceleration framework, is proposed in Section 3. In Section 4, a theoretical evaluation model is established to assess the execution efficiency of our modes. Then, we conducted the experiment and analyzed the performance results in Section 5. Lastly, Section 6 summarizes the entire work.

2. Background

SM2 Algorithm. The SM2 public key encryption algorithm [15] deeply integrates the mathematical characteristics of elliptic curves, key derivation mechanisms, and cryptographic hash functions, forming an asymmetric encryption scheme that combines high security and high performance. The detailed process is shown in Figure 1 and Figure 2. The full input set of SM2 encryption consists of standardized domain parameters p , a , b , G , n , h , the recipient’s public key P B , plaintext payload M , and cryptographically secure random scalar k . In this notation, G denotes the prescribed base point of the elliptic curve, P B = d B G refers to the receiver’s public key generated via private key d B , h is the curve cofactor, and . represents standard elliptic curve scalar multiplication. The cryptographic security of SM2 fundamentally relies on the computational hardness of the Elliptic Curve Discrete Logarithm Problem (ECDLP). The standard recommends the use of a 256-bit prime field curve, defining the equation as y 2 x 3 + a x + b   m o d   p , where p is a 256-bit generalized Mersenne prime, expressed as p = 2 256 2 224 2 96 + 2 64 1 . SM2 encryption adopts a randomized security mechanism. The random number k ensures that the encryption results of the same plaintext are different multiple times, and C 1 changes with k , resisting selected plaintext attacks. In the ciphertext of SM2, C 1 transmits the temporary public key, C 2 protects the confidentiality of the plaintext, and C 3 implements message integrity. The three-component separated structure takes into account both efficiency and security and supports independent verification of ciphertext components.
Key Derivation Function (KDF). As the core cryptographic component of the SM2, shoulders the important responsibility of extending the finite-length shared key to an infinite-length key stream. Its design directly determines the confidentiality strength and engineering efficiency. KDF is defined as the bit string Z of the input (with a length not exceeding (232 − 1) hlen) and the expected bit length klen of the derived key stream. The essence of KDF is the iterative extension of the Pseudorandom Function (PRF), which builds a bit stream with variable output length through the hash function. The hash algorithm used at the bottom layer of KDF is based on the collision resistance of 2128 security strength, ensuring that Z cannot be inferred from the output. The counter mode makes each hash output independent of each other, meeting the cryptographic requirements for PRF. And the number of iterations ⌈klen/hlen⌉ dynamically expands the output, breaks through the fixed output limit of the hash function, and ensures length adaptability. As a cryptographic bridge connecting elliptic curve operations and symmetric encryption, the efficient implementation of KDF directly affects the practical performance of SM2. As verified by Bhati et al. [14], restructuring KDF with expandable PRFs removes the sequential hash bottleneck of traditional key derivation and reduces computational overhead, offering a feasible route to optimize the standard SM2 implementation.
HASH Algorithm is an algorithm for a message m of length l (l < 264) bits, which generates a 256-bit hash value through padding and iterative compression [16]. As the core cryptographic component of the SM2, it integrates the Merkle–Damgard iterative framework with dedicated Boolean function design, demonstrating unique engineering optimization characteristics while ensuring the 128-bit security strength.
Single Instruction Multiple Data (SIMD) is an important parallel computing architecture. It significantly enhances the performance of computer systems by allowing a single instruction to operate on multiple data elements simultaneously. In practical applications, the development of SIMD technology is closely related to the progress of processor architecture. Intel launched the Streaming SIMD Extensions (SSE) instruction set in 1999. In 2008, Intel introduced the Advanced Vector Extensions (AVX) instruction set, extending the width of SIMD registers to 256 bits and further enhancing the parallel processing capability. In 2017, the release of the AVX512 instruction set further expanded the register width to 512 bits, supporting more complex vector operations and more efficient data processing. In recent years, as hardware vector instruction sets keep evolving, SIMD vectorization has been widely adopted to speed up heavy mathematical computations of cryptographic algorithms, which builds a feasible technical basis for vectorized optimization targeting practical cipher implementations such as [17,18,19,20].

3. Method

This paper proposes a parallel acceleration framework, with the aim of systematically addressing the performance bottlenecks of the KDF and HASH in the SM2 encryption and decryption process.

3.1. Encryption Optimization of ParaSM2 (A5&A7: Component-Level Parallel Scheduling of KDF and HASH)

To alleviate the computing bottleneck, the computing stream fusion technology is proposed in the encryption process, as shown in Figure 3. By deconstructing the intrinsic correlation between KDF and HASH, ParaSM2 realizes cross-component task parallelism scheduling and resource reuse. The core breakthrough lies in discovering the computational isomorphism between the counter block processing of KDF and the message block partitioning of HASH and using the dynamic load balancing mechanism to eliminate the idle resources in existing implementation. Moreover, ParaSM2 allows for implementation even when only basic SSE instructions are available.
Theorem 1.
In the SM2 encryption process, the computational load of the KDF and  C 3  generation modules present a natural ratio in a mathematical sense. There is a ratio relationship of approximately 2:1 between the number of executions  T 1  of the HASH block processing function CF by KDF-HASH on the second message block and the number  T 2  of executions of the HASH block processing function CF when calculating  C 3 .
Proof. 
During the process of generating the key stream by executing the KDF function, HASH performs CF iteration V(2) = CF (Vi(1), Bi(1)), i =1, 2, …, ⌈klen/hlen⌉ on the second message block, where the number of CF executions is T 1 = ⌈klen/hlen⌉ = ⌈klen/256⌉. For ⌈klen/256⌉, if let klen = 256a + r, 0 ≤ r < 256. When r > 0, ⌈klen/256⌉ = a + 1; when r = 0, ⌈klen/256⌉ = a. In the process of calculating C 3 = hash (x2||M||y2) for the plaintext M of klen bit length, the total number of times HASH performs CF is T 2 = 1 + ⌈(klen + 65)/512⌉. For ⌈(klen + 65)/512⌉, let klen = 512c + r, 0 ≤ r < 511, when 0 ≤ r < 448, ⌈klen/256⌉ = c + 1; when 448 ≤r < 511, ⌈klen/256⌉ = c + 2. Upon comprehensive comparison of the two equations, it can be concluded that if klen = 512t + r, 0 ≤ r < 511, then the relationship is shown in Table 1. It shows that, ⌈⌈klen/256⌉/2⌉ − ⌈(klen + 65)/512⌉ = −1, 0, substitution T 2 = 1 + ⌈(klen + 65)/512⌉ to get ⌈ T 1 /2⌉ − T 2 = −2, 1. □
Based on the natural 2:1 ratio of computing load between KDF and HASH components discovered above, ParaSM2 proposes a dynamic task bundling strategy, bundling the two computing tasks of KDF (performing CF operations on the second message block) with one block CF operation processing of HASH into an integrated execution unit. The parallel execution of CF functions across modules is achieved by using SIMD instructions, and the data of these three groups of tasks are processed simultaneously through the registers of SSE instructions. In the unified computing core, three CF functions are executed in parallel, and the multi-launch capability of the superscalar pipeline is utilized to eliminate resource competition. When the CF calculation task of KDF is completed, only a few message blocks remain at the end of the hash data side of C 3 , and the CF function can be executed independently. In summary, this feature stems from the fact that both the counter block processing of KDF and the calculation of C 3 call the CF function of HASH.

3.2. Decryption Optimization of ParaSM2 (B4: Hierarchical Parallel Optimization of KDF)

In the decryption process of the SM2, as shown in Figure 4, the performance bottleneck of the KDF primarily arises from its serial iterative structure and redundant computational operations. ParaSM2 centers on reconstructing the computational logic of KDF, which achieves a significant efficiency improvement through the adoption of fixed prefix reuse and dynamic parallelization techniques. Such a design strategy not only ensures compatibility with existing cryptographic hardware but also specifically targets and mitigates the computing power bottleneck inherent in KDF.

3.2.1. KDF Computational Bottleneck and Optimization Principle

The core task of KDF is to expand the 256-bit shared point coordinates generated by SM2 into a key stream of the same length as the plaintext: KDF(Z,klen) = MSB(klen, hash(Z||[2]4)||…||hash(Z||[⌈klen/hlen⌉]4)). Its execution process requires iterative invocation of the hash algorithm for calculation Hi = hash(Z||[i]4), i = 1, 2, …, ⌈klen/hlen⌉; here, Z = x2, y2 represents the coordinates of the shared point, and [i]4 is the 32-bit big-endian encoding of counter i. The process is executed according to the hash algorithm. The hash algorithm will first fill the above message mi =x2||y2||[i]4 to an integer multiple of 512 bits, and then split it into two complete message blocks Bi(0)||Bi(1). Here, the first message block is Bi(0) = x2||y2, and the second is Bi(1) = [i]4||[0x80]1||[0x00]51||[l]8, where l represents the length of the hash data mi, which is 544 bits. Next, execute the CF function on the two message blocks Bi(0) and Bi(1): Vi(j+1) = CF (V(j), Bi(j)) = COMP (V(j), EXT (Bi(j))), j = 0, 1. Finally, output the result Vi(2) of the second iteration.

3.2.2. KDF Optimizes the Iteration of the First Message Block

There is a significant drawback in existing scheme when performing the CF iteration of the first message block according to the above process-redundant computation. During the entire KDF execution of the hash, Z = x2||y2 are fixed prefixes. In the existing process, the same fixed prefix (i.e., the first message block) is repeatedly calculated in the HASH-CF function in each iteration. The iteration result of B(0) = Bi(0) = x2||y2 is V(1) = CF (IV, Bi(0)), i = 1, 2, …, ⌈klen/hlen⌉ leads to a large number of redundant CF operations. In response to the above problems, this mode proposes a fixed prefix pre-computation strategy for the first message block. As mentioned earlier, the first message block Bi(0) is fixed at 64 bytes Bi(0) = x2||y2 throughout the entire KDF calculation process. Therefore, this feature enables the CF function of the first message block to be calculated only once, and the complete calculation of V(1) = CF (V(0), B(0)) and caching of the result are only done in the first iteration. Among them, V(0) is the initial value constant IV of hash algorithm; In subsequent iterations, V(1) is directly reused, and the CF function is only executed on the changed counter block Bi(1) and Vi(2) = CF (V(1), Bi(1)).

3.2.3. KDF Optimizes the Iteration of the Second Message Block

There is a serial dependency issue when the traditional implementation executes the second message block according to the above process. The existing scheme adopts a sequential execution strategy for the second message block Bi(1) = [i]4||[0 × 80]1||[0 × 00]51||[l]8, where counter i is located without taking advantage of the parallel computing concept of modern CPUs.
In response to the above issues, this mode proposes a parallelization approach for the second message block Vi(2) = CF (V(1), Bi(1)) and designs a dual-travel acceleration scheme. On the one hand, by taking advantage of the counter iteration mode without the feedback dependency structure adopted by SM2-KDF, the HASH-CF calculation task Vi(2) = CF (V(1), Bi(1)) of the second message block Bi(1) where counter i is located is split into independent sub-tasks, completely decoupling the hash calculation of different counter blocks. The vectorized execution of CF functions for these different message blocks based on the SIMD instruction set lays the foundation for large-scale parallelization. SIMD instructions with a bit width of m (m is usually a power of 2) bits can store m/32 32-bit data in SIMD registers, thereby achieving the parallel implementation of m/32 CFS. For example, the SSE instruction with 128-bit parallel capability can achieve the parallel implementation of four CF computing tasks, the AVX2 instruction with 256-bit parallel capability can achieve the parallel implementation of eight CF computing tasks, and the AVX512 instruction with 512-bit parallel capability can achieve the parallel implementation of 16 CF computing tasks. On the other hand, since counter i is known in advance, the message extension EXT function calculation result of the second message block Bi(1) = [i]4||[0 × 80]1||[0 × 00]51||[l]8 is encoded as a sparse matrix for storage through pre-computation technology. The CF function of the second message block is further simplified to execute the compression function COMP together with the pre-stored EXT encoded value and the reused V(1). This is a solution that trades space for time. Based on the hash extended word calculation scheme, it can be known that not all of the 68 extended words will change with counter i.
Theorem 2.
Message grouping Bi(1) = [i]4||[0 × 80]1||[0 × 00]51 ||[l]8, l = 544, I = 1, 2, …, 232 − 1, in the execution of the hash extended word scheme, 68 extended words (W0, W1, …, W67) were calculated, only 41 characters change with the i value.
Proof. 
According to the calculation formula of message extension words, first of all, the message group Bi(1) is divided into the first 16 message extension words W0, W1, …, W15 = B(i), expressed in hexadecimal big-endian order. The values of the first 16 message extension words are W0 = [i]4, W1 = 80000000, W2 = … = W14 = 00000000, W15 = 00000220. Among these first 16 message extension words, only W0 is related to the i value. Based on the calculation method of the last 52 message extension words, Wj = P1(Wj−16Wj−9 ⊕ (Wj−3  < 15)) ⊕ (Wj−13  < 7) ⊕ Wj−6, 16 ≤ j < 68, a table can be established to list the dependencies required for calculating Wj. The specific approach is as follows.
  • Step 1. Initialize j = 16. Since W0 is related to the value of i, when W0 appears in the subsequent steps, it will be marked in bold.
  • Step 2. Perform the following statistical and coloring operations on j. List the dependent words required for calculating Wj based on the message expansion word calculation formula. For example, in W16, it depends on W0, W3, W7, W10, and W13. Then, verify whether there are bold-marked dependent words in the dependent words of Wj. If there is, mark Wj as bold. For instance, W16 depends on the bold W0, meaning that W16 is related to the value of i, so W16 is also marked as bold.
  • Step 3. If j has reached 67, terminate the statistics and coloring operations. Otherwise, after j accumulates by 1, it jumps to step 2.
Table 2 shows the execution results of the above statistics and coloring operations, where the bold part indicates that the value is related to the count value i, and the regular part indicates that it is not related to the technical value i. There are 27 messages in 68 extended words (W0, W1, …, W67) that are independent of the value i, namely W1, …, W15, W17, W18, W20, W21, W23, W24, W26, W27, W30, W33, W36 and W39. Furthermore, the counter block contains 68 message extension words, but only 41 dynamically change with the value of i, while the remaining 27 words are fixed fill templates. By constructing a sparse matrix storage structure, only 41 dynamically changing message extension words are stored, and the intermediate state of the static template is pre-computed. Compared to the full-word storage scheme, the storage space occupation is compressed by 39.7%. □

3.3. Decryption Optimization of ParaSM2 (B6: Vectorized Reconstruction of Hash Message Extension)

Furthermore, ParaSM2 conducts specialized SSE instruction-level optimization for the message extension function (EXT) in hash computing and builds a hierarchical acceleration architecture. It can be used for hashing when SM2 computes C 3 . This design does not conflict with the core advantages of KDF precomputation and parallelization. By reconstructing the computing path of EXT functions, the performance bottleneck of HASH is removed, forming a complete solution for module collaborative optimization. The message expansion function EXT of hash divides the 512-bit input into 68 32-bit words (W0 to W67) in blocks. Its core bottleneck lies in the strong data dependency feature: Wj = P1(Wj−16Wj−9 ⊕ (Wj−3 <<< 15)) ⊕ (Wj−13 <<< 7) ⊕ Wj−6, 16 ≤ j < 68. The existing serial implementation requires 68 rounds of sequential calculation of iterations.
Since four-wheel parallel operation is not feasible, it can be reverted to three-wheel parallel operation. Based on the SSE instruction set, a three-word parallel architecture is adopted to divide the last 52 words (16 ≤ j < 68) into 22 three-word groups. There is no correlation among the three-word groups, and the 128-bit register of SSE can be directly utilized to simultaneously expand three 32-bit words. ParaSM2 adopts this method for the hash when calculating C 3 with SM2. The acceleration of the EXT of HASH reduces the partial time consumption, and the overall performance bottleneck shifts to the function COMP.

4. Experiments

This section systematically demonstrates the feasibility of achieving a performance leap through formal methods and theoretical models and reveals the applicable boundaries in different application scenarios.

4.1. Types of Graphics

The optimization mode only modifies the execution sequence without changing the execution data or main steps. The optimization plan is equivalent to the original one, and its correctness can be guaranteed. The specific description is as follows.
In the encryption process, acceleration is achieved by implementing the CF functions of KDF and C 3 in parallel. As mentioned earlier, any parallel implementation scheme for CF functions across multiple message blocks must ensure the correctness of its computational results. Thus, the parallel implementation of CF does not alter the computed results, and the correctness of the ParaSM2 is thereby guaranteed.
As for decryption process, in B4, only the execution sequence and manner of KDF are adjusted. Specifically, the execution mode of the first message block in KDF remains unchanged; instead, the computational result of this message block is reused. Such an adjustment does not affect the execution result of the first message block. When implementing parallel execution for the second message block, a prerequisite is that any parallel implementation scheme for CF functions across multiple message blocks must ensure the correctness of its computational results, i.e., consistency with the results obtained by executing CF on each message block individually. Similarly, when encoding the computational results of the EXT function for the second message block into a sparse matrix for storage, the correctly computed results of the EXT function are stored without compromising correctness. In summary, the result of KDF computed in ParaSM2 is consistent with that of the original scheme. Moreover, in B6, only the process of hash computation C 3 is adjusted: the ordinary C code implementation is merely replaced with an SSE instruction-level parallel implementation for the message expansion function EXT, without compromising correctness. Therefore, the result of C 3 computed in ParaSM2 is consistent with that of the original scheme.

4.2. Theoretical Analysis

The theoretical model is established based on the execution times of the iterative function CF: Ttotal = NCF × TCF, here, TCF represents the time consumption of a single CF execution, and NCF represents the number of CF executions. {T1, T4, T8, T16, T’3} respectively represent the time consumption T of executing the HASH-CF function once in a scheme, implementing the CF function for grouping four messages using SSE instructions, grouping eight messages using the AVX2 instruction, grouping 16 messages using the AVX512 instruction, and when performing three rounds of parallel SSE instruction-level optimization on the EXT without COMP optimization.
Original implementation scheme. In order to calculate the key stream t using KDF, calculate the hash value Hi = hash(x2||y2||[i]4), i = 1, 2, …, ⌈klen/256⌉, requires a total of 2⌈klen/256⌉ executions of the CF function. Secondly, to calculate C 3 = hash (x2||M||y2), the CF function needs to be executed 1 + ⌈(klen + 65)/512⌉ times. So, the original scheme requires the execution of the CF function 1 + ⌈(klen + 65)/512⌉ + 2⌈klen/256⌉ times, and in total, the time consumed by this scheme for calculating C 2 and C 3 is (1 + ⌈(klen + 65)/512⌉ + 2⌈klen/256⌉)T1.
In the ParaSM2-encryption process, the ordinary CF function is executed once to calculate the first message packet of KDF, and then the parallel CF function ⌈⌈klen/256⌉/2⌉ times is executed to calculate the second message packet of KDF and the hash iteration of C 3 , which takes a time of ⌈⌈klen/256⌉/2⌉T4 (by SSE implementation); Finally, after the CF calculation required by KDF is completed, the hash iteration of C 3 still has 1 + ⌈(klen + 65)/512⌉ − ⌈⌈klen/256⌉/2⌉ blocks that need to execute the CF function separately. According to theorem 1, it can be proved that 1 + ⌈(klen + 65)/512⌉ − ⌈⌈klen/256⌉/2⌉ = 1, 2, that is, the hash iteration of C 3 requires an additional execution of the ordinary CF function one or two times. In total, the time consumption is T1 + ⌈⌈klen/256⌉/2⌉T4 + (2 + ⌈(klen + 65)/512⌉ − ⌈⌈klen/256⌉/2⌉)T1 ≤ 3T1⌈⌈klen/256⌉/2⌉T4.
In the ParaSM2-decryption process, for the KDF function, only optimize the process of calculating the key stream t using KDF. The first message block uses a fixed prefix precomputation mechanism, directly reducing the ⌈klen/256⌉ CF operations required for the first message block to one time. When the second message block of KDF adopts SIMD parallel processing of the CF function, the ⌈klen/256⌉ times CF function of the second message block is reduced to ⌈⌈klen/256⌉/p⌉ times CF parallelism of p message blocks, and the time consumption for each CF parallelism of p message blocks is Tp. In total, the time consumption of KDF is T1 + ⌈⌈klen/256⌉/pTp. For the HASH function, the process of calculating C 3 by hash algorithms is an optimized implementation scheme, which requires the execution of the CF function 1 + ⌈(klen65)/512⌉ times. However, the message extension function (EXT) of each CF function undergoes specific SSE instruction level optimization, the time consumption of a single CF function is T’3. In total, the time consumed by hash algorithms in calculating C 3 is (1 + ⌈(klen + 65)/512⌉) T’3.

5. Experimental Evaluation

The effectiveness of ParaSM2 through systematic testing within the x86 architecture is evaluated in this part. It was demonstrated that this solution is also applicable to the ARM architecture. Then the optimization mechanism is deeply analyzed in combination with the microarchitecture performance counter. Finally, the impact of AVX* instructions on the performance improvement of ParaSM2 was tested. To ensure the accuracy of the performance result, all experiments were conducted in a strictly isolated environment and taking the median value after the executed results 21 times.

5.1. Efficiency Analysis

The core processor of our testbed is Intel Core i7-11700 (Rocket Lake-S architecture) with eight physical cores and 16 threads. The base frequency is 2.5 GHz, and the dynamic speedup is up to 4.9 GHz on a single core. Each core has its own 512 KB L2 cache and 64 KB L1 cache. Its instruction set covers SSE, AVX2, AVX512, which significantly improves the parallel processing ability of cryptographic operations. The platform is equipped with 32 GB DDR4 memory and runs at 3200 MHz reference frequency in dual-channel mode. The operating system is deployed with Win10 Professional, the development environment is based on Visual Studio 2019, and key compilation options include enabling maximum Instruction Level parallelism optimization (/O2) and enabling Code Speed First (/Ot).

5.2. Experimental Results and Discussion

The performance of the basic components under different implementation methods, including encrypting a message packet with SM4 and iteratively compressing a message packet with hash is tested by using OpenSSL v3.5.1, and the time consumption TSM4 for performing a 128-bit block encryption of the SM4 algorithm was 98.467 milliseconds (ms). Taking the performance test of the iterative function CF executed by hash algorithms as an example, the time consumption T1 for executing the HASH-CF function once on a 512-bit message packet is 318.371 ms. The parallel computing time of the CF function for four message packets implemented by the SSE instruction is T4 = 367.922 ms. The parallel computing time of the CF function for eight message packets implemented by the AVX2 instruction is T8 = 389.953 ms. The parallel computing time for the CF function of 16 message packets implemented by the AVX512 instruction is T16 = 414.797 ms. In addition, in the iterative function CF, only the SSE instructions are used in parallel to optimize and implement the EXT function for three rounds of iterations (the remaining Comp is implemented in the ordinary way), which takes T3 = 238.531 ms. The hash algorithm uses the SSE instruction to implement the CF function of four packets in parallel (T4), which is slightly slower than the ordinary implementation of T1. Therefore, when there is more than one packet to be processed, the parallel implementation scheme is obviously better. The same is true for the other SIMD implementation methods.
Encryption Performance. Given the universality of SSE instructions, we tested the encryption performance results of different modes under SSE instructions with x86 architecture, including SM4 direct encryption (SM4-GCM [21], SM4-CBC [22]), and our ParaSM2, as shown in Figure 5A. It is worth noting that, unlike SM4-GCM, which already includes a HASH process, in SM4-CBC we have supplemented the HASH process for experiments. In addition, we tested the impact of three different instruction sets (SSE, AVX2, AVX512) on encryption performance, as shown in Figure 6A. For example, “ParaSM2-SSE” indicates the use of SSE instructions.
Decryption Performance. Similarly, we conducted decryption performance tests in different modes, as shown in Figure 5B. And the impact of different instruction sets on decryption performance is shown in Figure 6B.
Architectural Adaptability. In order to illustrate the adaptability of our optimization ParaSM2, we also conduct performance tests on the ARM architecture (Kunpeng 920), including the encryption and decryption performance, as shown in Figure 7. The Kunpeng server was chosen due to its gradual rise and important position in the cloud services and end-side services market. In addition, due to the inapplicability of AVX2 and AVX512 instruction sets under ARM architecture, the performance of different instructions of ParaSM2 within ARM is beyond the scope of this paper.
End-to-End Performance. The previous experiments have evaluated the pure encryption and decryption latency of ParaSM2 on both x86 and ARM architectures. This section further conducts end-to-end latency comparison experiments between ParaSM2 and the industrial mainstream SM2 digital envelope, as illustrated in Table 3 and Table 4. The experimental results indicate that the performance improvement of ParaSM2 is limited for small data payloads due to the overhead of task scheduling initialization. In contrast, when the data size reaches 64 KB or above, ParaSM2 fully exploits the advantages of parallel optimization and achieves prominent and stable end-to-end acceleration performance.
Result Discussion. SM2 and SM4 adopt different cryptographic mechanisms, design goals, and application modes, so their performance cannot be directly compared. To fully evaluate the performance of ParaSM2 under practical working conditions, we set up two groups of comparative experiments. On the one hand, we compare the pure encryption and decryption latency of ParaSM2 with industrial mainstream SM4 to verify its performance at the basic cryptographic operation level. On the other hand, we conduct end-to-end tests against the SM2 digital envelope scheme to evaluate its practical service performance. As shown in the results, firstly, whether under the ARM or x86 architecture, our scheme shows a similar performance trend, which indicates that ParaSM2 is adaptive in the current two popular architectures. Secondly, when the data volume is not large (≤1 MB), the 3 modes have similar overhead. However, when the data volume reaches 4 MB, the efficiency improvement of ParaSM2 basically approaches its optimization limit values. And ParaSM2 has very significant performance improvements, outperforming the two models of SM4. Moreover, it is observed that even though the higher-order AVX512 instructions do perform better, the encryption and decryption time of ordinary SSE instructions does not exceed 360 ms. Therefore, the advantages of the ParaSM2 are manifested in a bigger data volume. Based on the hardware configuration and scene requirements, users can adaptively choose our ParaSM2 with different instructions.

6. Conclusions

This research achieves two key advancements in SM2 optimization. Firstly, it provides a new paradigm for cross-component task integration and scheduling at the cryptographic system level. By intelligently bundling KDF and HASH tasks, binding two KDF sub-blocks and one HASH sub-block as atomic units, and using SIMD to parallelize heterogeneous task compression, this technology breaks the performance ceiling of traditional isolated module optimization, boosting encryption/decryption efficiency by 4.3 times and providing a novel methodology for cryptographic system engineering. Secondly, it provides a notable improvement in KDF efficiency. By revealing the fixed prefix reuse theorem in KDF (i.e., the first compression function input of all counter blocks is the shared point coordinates (x2, y2), a dual acceleration mechanism combining pre-computation and SIMD parallelism is proposed. This mechanism globally reuses the first block’s compression result, eliminates half of redundant computations, and combine the AVX512 instruction set to implement the parallel computing of the message expansion function EXT and the compression function COMP of the 16-way counter block. Measurements show this optimization increases KDF throughput by 5.1 times (in 16 MB encryption and 4 MB decryption scenarios) and reduces end-to-end latency significantly. ParaSM2 establishes a viable alternative to the prevalent SM2-SM4 hybrid cryptographic paradigm, facilitating standalone end-to-end encryption for data payloads exceeding 64 KB and delivering a novel implementation route for cryptographic engineering in industrial scenarios. Finally, experimental results show that compared with SM4-GCM, SM4-CBC, our ParaSM2 performs better on both x86 and ARM architectures and can be further improved by combining instruction selection.
Moreover, there are still directions that need to be continuously explored. On the one hand, as the current solution relies on general SIMD (e.g., AVX512) and underutilize dedicated cryptographic instructions, future work will consider HASH-specific instruction extensions. On the other hand, there is enhanced side-channel attack resistance. While existing schemes resist basic timing attacks via constant-time design, protection against high-order side channels (e.g., cache timing attacks, fault injection) is insufficient. Integrating multiple mechanisms will be critical, such as reconstructing EXT function branches and table lookups into branch-free masked selections, eliminating microarchitecture dependencies, and dynamically perturbing precomputed base addresses to disrupt cache template attacks. Our work enables a leap in SM2 performance and establishes a new paradigm for cross-module task integration at the cryptographic system level. This framework can be extended to other public-key encryption schemes, such as ECIES, or joint optimization of RSA-OAEP and AES-GCM.

Author Contributions

Conceptualization, H.K. and B.G.; methodology, H.K.; software, H.K.; validation, H.K., Y.S. and M.Z.; formal analysis, X.C.; investigation, K.Y.; resources, H.K.; data curation, H.K.; writing—original draft preparation, H.K.; writing—review and editing, B.G.; visualization, H.K.; supervision, B.G.; project administration, H.K.; funding acquisition, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grant No. U2268204 and 62172061; National Key R&D Program of China under Grant No. 2023YFB3308300; the Science and Technology Project of Sichuan Province under Grant No. 2024ZDZX0012, 2023ZHCG0011, 2021YFG0152.

Data Availability Statement

The data generated and analyzed during this study are not publicly available due to confidentiality restrictions. Any reasonable request for the relevant data can be made by contacting the corresponding author via email.

Conflicts of Interest

Author Hongjuan Kang was employed by the company Sichuan Changhong Electronic Holdings Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Chang, Q.; Ma, T.; Yang, W. Low power IoT device communication through hybrid AES-RSA encryption in MRA mode. Sci. Rep. 2025, 15, 14485. [Google Scholar] [CrossRef] [PubMed]
  2. Lilhore, U.K.; Simaiya, S.; Dalal, S.; Sharma, Y.K.; Tomar, S.; Hashmi, A. Secure WSN Architecture Utilizing Hybrid Encryption with DKM to Ensure Consistent IoV Communication. Wirel. Pers. Commun. 2024. [Google Scholar] [CrossRef]
  3. Karmous, N.; Hizem, M.; Dhiab, Y.B.; Aoueileyine, M.O.-E.; Boual-legue, R.; Youssef, N. Hybrid Cryptographic End-to-End Encryption Method for Protecting IoT Devices Against MitM Attacks. Radio Eng. 2024, 33, 583–592. [Google Scholar] [CrossRef]
  4. Zheng, X.; Xu, C.Y.; Hu, X.H.; Zhang, Y.; Xiong, X. The software/hardware co-design and implementation of SM2/3/4 encryption/decryption and digital signature system. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 2055–2066. [Google Scholar] [CrossRef]
  5. Li, P.; Ou, W.; Liang, H.; Han, W.; Zhang, Q.; Zeng, G. A zero trust and blockchain-based defense model for smart electric vehicle chargers. J. Netw. Comput. Appl. 2023, 213, 103599. [Google Scholar] [CrossRef]
  6. Liu, Z.; Liang, T.; Lyu, J.; Lang, D. A security-enhanced scheme for MQTT protocol based on domestic cryptographic algorithm. Comput. Commun. 2024, 221, 1–9. [Google Scholar] [CrossRef]
  7. Hu, A.; Wu, H.; Liu, C. A Novel Weakness of SM2 Algorithm. In Proceedings of the 2024 14th International Conference on Information Technology in Medicine and Education (ITME), Guiyang, China, 13–15 September 2024; pp. 878–880. [Google Scholar]
  8. Backendal, M.; Clermont, S.; Fischlin, M.; Günther, F. Key derivation functions without a grain of salt. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, Madrid, Spain, 4–8 May 2025; pp. 393–426. [Google Scholar]
  9. Nair, V.; Song, D. Multi-Factor Key Derivation Function (MFKDF) for Fast, Flexible, Secure, & Practical Key Management. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 2097–2114. [Google Scholar]
  10. Li, X.; Yi, Z.; Li, R.; Wang, X.-A.; Li, H.; Yang, X. SM2-based offline/online efficient data integrity verification scheme for multiple application scenarios. Sensors 2023, 23, 4307. [Google Scholar] [CrossRef] [PubMed]
  11. May, A.; Schneider, C. Dlog is Practically as Hard (or Easy) as DH-Solving Dlogs via DH Oracles on EC Standards. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2023, 2023, 146–166. [Google Scholar]
  12. Han, G.; Bai, X.; Geng, S.; Qin, B. Efficient two-party SM2 signing protocol based on secret sharing. J. Syst. Archit. 2022, 132, 102738. [Google Scholar] [CrossRef]
  13. Zhu, H.; Li, D.; Sun, Y.; Chen, Q.; Tian, Z.; Song, Y. Optimization of SM2 algorithm based on polynomial segmentation and parallel computing. Electronics 2024, 13, 4661. [Google Scholar] [CrossRef]
  14. Bhati, A.S.; Dufka, A.; Andreeva, E.; Roy, A.; Preneel, B. Skye: An Expanding PRF based Fast KDF and its Applications. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1082–1098. [Google Scholar]
  15. GB/T 32918-2016; Information Security Technology—Public Key Cryptographic Algorithm SM2 Based on Elliptic Curves. General Administration of Quality Supervision, Inspection and Quarantine of P.R.China; Standardization Administration of China. China Standards Press: Beijing, China, 2016.
  16. GB/T 32905-2016; Information Security Techniques—SM3 cryptographic Hash Algorithm. General Administration of Quality Supervision, Inspection and Quarantine of P.R.China, Standardization Administration of China. China Standards Press: Beijing, China, 2016.
  17. Cheng, Y. Study on the Encryption and Decryption of a Hybrid Domestic Cryptographic Algorithm in Secure Transmission of Data Communication. Int. J. Netw. Secur. 2022, 24, 947–952. [Google Scholar] [CrossRef] [PubMed]
  18. Ye, Z.; Song, R.; Zhang, H.; Chen, D.; Cheung, R.C.-C.; Huang, K. A Highly-efficient Lattice-based Post-Quantum Cryptography Processor for IoT Applications. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2024, 2024, 130–153. [Google Scholar] [CrossRef]
  19. Chen, L.; Tang, Y.; Zhao, L.; Gong, Z. SIMD Optimizations of White-Box Block Cipher Implementations with the Self-equivalence Framework. In Proceedings of the International Conference on Information Security and Cryptology, Seoul, Republic of Korea, 20–22 November 2024; pp. 129–149. [Google Scholar]
  20. Polubelova, M.; Bhargavan, K.; Protzenko, J.; Beurdouche, B.; Fromherz, A.; Kulatova, N.; Zanella-B’eguelin, S. HACLxN: Verified Generic SIMD Crypto (for all your favourite platforms). In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2020; pp. 899–918. [Google Scholar]
  21. NIST SP 800-38D[EB/OL]; Recommendation for block cipher modes of operation: Galois/Counter Mode (GCM) and GMAC. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2001. [CrossRef]
  22. NIST SP 800-38A[EB/OL]; Recommendation for Block Cipher Modes of Operation: Methods and Techniques. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2001. [CrossRef]
Figure 1. SM2 encryption process.
Figure 1. SM2 encryption process.
Cryptography 10 00042 g001
Figure 2. SM2 decryption process.
Figure 2. SM2 decryption process.
Cryptography 10 00042 g002
Figure 3. Diagram of the action areas of ParaSM2 optimization in encryption process. By observing the natural ratio of 2:1 between A5 and A7, the block processing is computed in parallel to enhance execution efficiency.
Figure 3. Diagram of the action areas of ParaSM2 optimization in encryption process. By observing the natural ratio of 2:1 between A5 and A7, the block processing is computed in parallel to enhance execution efficiency.
Cryptography 10 00042 g003
Figure 4. Diagram of the action areas of ParaSM2 optimization in decryption process. Improve the execution efficiency by optimizing the KDF and HASH functions in B4, B6.
Figure 4. Diagram of the action areas of ParaSM2 optimization in decryption process. Improve the execution efficiency by optimizing the KDF and HASH functions in B4, B6.
Cryptography 10 00042 g004
Figure 5. The encryption and decryption performance of ParaSM2 upon x86. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.
Figure 5. The encryption and decryption performance of ParaSM2 upon x86. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.
Cryptography 10 00042 g005
Figure 6. The encryption and decryption performance of ParaSM2 with SSE, AVX2, AVX512 instruction sets. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.
Figure 6. The encryption and decryption performance of ParaSM2 with SSE, AVX2, AVX512 instruction sets. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.
Cryptography 10 00042 g006
Figure 7. The encryption and decryption performance of ParaSM2 upon ARM. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.
Figure 7. The encryption and decryption performance of ParaSM2 upon ARM. They should be listed as: (A) Encryption Performance; (B) Decryption Performance.
Cryptography 10 00042 g007
Table 1. Calculate the ⌈klen/256⌉ and the ⌈(klen + 65)/512⌉.
Table 1. Calculate the ⌈klen/256⌉ and the ⌈(klen + 65)/512⌉.
rklen/256⌉⌈(klen + 65)/512⌉
r = 02tt + 1
1 ≤ r ≤ 2552t + 1t + 1
256 ≤ r ≤ 4472t + 2t + 1
448 ≤ r ≤ 5112t + 2t + 1
Table 2. Dependencies for Calculating Wj.
Table 2. Dependencies for Calculating Wj.
WjCharacters Required for Calculating Wj
W16W0W3W7W10W13
W17W1W4W8W11W14
W18W2W5W9W12W15
W19W3W6W10W13W16
W20W4W7W11W14W17
W21W5W8W12W15W18
W22W6W9W13W16W19
W23W7W10W14W17W20
W24W8W11W15W18W21
W25W9W12W16W19W22
W26W10W13W17W20W23
W27W11W14W18W21W24
W28W12W15W19W22W25
W29W13W16W20W23W26
W30W14W17W21W24W27
W31W15W18W22W25W28
W32W16W19W23W26W29
W33W17W20W24W27W30
W34W18W21W25W28W31
W35W19W22W26W29W32
W36W20W23W27W30W33
W37W21W24W28W31W34
W38W22W25W29W32W35
W39W23W26W30W33W36
W40W24W27W31W34W37
W41W25W28W32W35W38
W42W26W29W33W36W39
W43W27W30W34W37W40
W44W28W31W35W38W41
W45W29W32W36W39W42
W46W30W33W37W40W43
W47W31W34W38W41W44
W48W32W35W39W42W45
W49W33W36W40W43W46
W50W34W37W41W44W47
W51W35W38W42W45W48
W52W36W39W43W46W49
W53W37W40W44W47W50
W54W38W41W45W48W51
W55W39W42W46W49W52
W56W40W43W47W50W53
W57W41W44W48W51W54
W58W42W45W49W52W55
W59W43W46W50W53W56
W60W44W47W51W54W57
W61W45W48W52W55W58
W62W46W49W53W56W59
W63W47W50W54W57W60
W64W48W51W55W58W61
W65W49W52W56W59W62
W66W50W53W57W60W63
W67W51W54W58W61W64
Table 3. The end-to-end encryption performance of schemes under x86 and ARM architecture.
Table 3. The end-to-end encryption performance of schemes under x86 and ARM architecture.
SchemesThe End-to-End Delay for Calculating the Size of the Data (ms)
64 KB256 KB1 MB4 MB16 MB64 MB
SM2 Digital Envelope@x861.262.9239.13633.649131.882518.54
ParaSM2@x860.9461.8445.40319.77977.049301.14
SM2 Digital Envelope@ARM1.1412.4768.85731.623123.959482.85
ParaSM2@ARM0.8691.6314.89217.59669.551271.08
Table 4. The end-to-end decryption performance of schemes under x86 and ARM architecture.
Table 4. The end-to-end decryption performance of schemes under x86 and ARM architecture.
SchemesThe End-to-End Delay for Calculating the Size of the Data (ms)
64 KB256 KB1 MB4 MB16 MB64 MB
SM2 Digital Envelope@x860.9622.7158.02330.776124.829511.93
ParaSM2@x860.6761.6895.8722.5588.42353.68
SM2 Digital Envelope@ARM0.9162.5017.36528.621118.019472.09
ParaSM2@ARM0.6951.8185.89822.53288.953351.21
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, H.; Guo, B.; Sun, Y.; Zhao, M.; Chen, X.; Ye, K. ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH. Cryptography 2026, 10, 42. https://doi.org/10.3390/cryptography10030042

AMA Style

Kang H, Guo B, Sun Y, Zhao M, Chen X, Ye K. ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH. Cryptography. 2026; 10(3):42. https://doi.org/10.3390/cryptography10030042

Chicago/Turabian Style

Kang, Hongjuan, Bing Guo, Yufang Sun, Mingjie Zhao, Xin Chen, and Kui Ye. 2026. "ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH" Cryptography 10, no. 3: 42. https://doi.org/10.3390/cryptography10030042

APA Style

Kang, H., Guo, B., Sun, Y., Zhao, M., Chen, X., & Ye, K. (2026). ParaSM2: Enhancing SM2 Cryptographic Performance via Parallel Restructuring of KDF and HASH. Cryptography, 10(3), 42. https://doi.org/10.3390/cryptography10030042

Article Metrics

Back to TopTop