Next Article in Journal
A Martingale-Free Introduction to Conditional Gaussian Nonlinear Systems
Next Article in Special Issue
Adaptive Trust Evaluation Model Based on Entropy Weight Method for Sensing Terminal Process
Previous Article in Journal
A Model and Quantitative Framework for Evaluating Iterative Steganography
Previous Article in Special Issue
An MLWE-Based Cut-and-Choose Oblivious Transfer Protocol
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Secure and Efficient White-Box Implementation of SM4

1
Beijing Smart-Chip Microelectronics Technology Co., Ltd., Beijing 102299, China
2
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
*
Authors to whom correspondence should be addressed.
Entropy 2025, 27(1), 1; https://doi.org/10.3390/e27010001
Submission received: 19 October 2024 / Revised: 4 December 2024 / Accepted: 12 December 2024 / Published: 24 December 2024
(This article belongs to the Special Issue Information-Theoretic Cryptography and Security)

Abstract

:
Differential Computation Analysis (DCA) leverages memory traces to extract secret keys, bypassing countermeasures employed in white-box designs, such as encodings. Although researchers have made great efforts to enhance security against DCA, most solutions considerably decrease algorithmic efficiency. In our approach, the Feistel cipher SM4 is implemented by a series of table-lookup operations, and the input and output of each table are protected by affine transformations and nonlinear encodings generated randomly. We employ fourth-order non-linear encoding to reduce the loss of efficiency while utilizing a random sequence to shuffle lookup table access, thereby severing the potential link between memory data and the intermediate values of SM4. Experimental results indicate that the DCA procedure fails to retrieve the correct key. Furthermore, theoretical analysis shows that the techniques employed in our scheme effectively prevent existing algebraic attacks. Finally, our design requires only 1.44 MB of memory, significantly less than that of the known DCA-resistant schemes—Zhang et al.’s scheme (24.3 MB), Yuan et al.’s scheme (34.5 MB) and Zhao et al.’s scheme (7.8 MB). Thus, our SM4 white-box design effectively ensures security while maintaining a low memory cost.

1. Introduction

In traditional cryptographic analysis, it is assumed that attackers can only access the input and output of the cryptographic procedure, which is executed in a secure environment, known as the black-box attack model. However, due to the diverse deployment environments of digital products, cryptographic algorithms are often executed in untrusted settings, resulting in the potential for secure information leakage. In 2002, Chow et al. introduced the concept of a white-box attack environment [1], in which an attacker has full access to memory data during the software’s execution. To mitigate the risks posed by white-box attackers, they developed white-box implementations of the Data Encryption Standard (DES) [1] and the Advanced Encryption Standard (AES) [2]. In their white-box implementation of AES, the operations within the round function were implemented using table lookup operations. Invertible linear transformations and nonlinear encodings were employed to obfuscate the input and output of each table, thereby preventing the leakage of intermediate data during the encryption process.

1.1. Related Work

(1) The following research has been conducted on AES white-box schemes. In 2004, Billet et al. introduced the BGE attack method [3], successfully extracting the key from the AES white-box scheme proposed by Chow et al. The BGE method is an algebraic analysis, necessitating the attacker to understand the detailed implementation steps through complex reverse engineering.
In 2019, Bos et al. introduced Differential Computation Analysis (DCA) to extract keys from white-box schemes [4]. DCA utilizes tools to capture traces of memory information during software execution, significantly reducing the workload of reverse engineering. Currently, many white-box schemes, including those submitted to the WhibOx 2016 white-box cryptography competition, have been successfully compromised using DCA. Consequently, DCA poses a significant challenge to white-box implementations.
In response to DCA attacks, several schemes have been proposed. In 2018, Bock et al. evaluated common protection methods in white-box implementations, such as linear transformations and 4-bit nonlinear encodings [5], concluding that neither was effective in resisting DCA. In 2020, Lee et al. improved their AES white-box implementation [6] by using linear Boolean masking to obfuscate the results of all lookup tables [7], successfully thwarting DCA. However, this scheme required significant memory to mask and unmask the genuine internal data. In the same year, Biryukov et al. applied nonlinear masking combined with Boolean masking to obfuscate intermediate values [8], which also effectively prevented DCA. Despite that, this approach significantly increased memory consumption, and its security against algebraic attacks remains unverified.
(2) The following research has been conducted on SM4 white-box schemes. SM4 is a Feistel cipher that was published in 2006 as a Chinese National Standard. In 2021, it was officially published as an ISO/IEC international standard. It has been integrated into the ARMv8.4-A, and support for the RISC-V architecture was ratified in 2021.
In 2009, Xiao et al. introduced the first white-box implementation of SM4 [9], referred to as the Xiao–Lai scheme. This approach utilized external encoding to protect both plaintext and ciphertext, as well as affine transformations to secure intermediate data during the encryption process. Although the scheme is designed to thwart BGE attacks, Lin et al. demonstrated in 2013 that they could successfully extract the key [10] from the Xiao–Lai scheme using a combination of the differential analysis and BGE attacks, with a time complexity of 2 47 , referred to as Lin–Lai analysis.
In 2015, Bai et al. introduced the Bai–Wu scheme, which employed complex internal encodings to enhance security [11] against Lin–Lai analysis. However, generating a white-box instance of this scheme required 32.5 MB of memory. In 2018, Pan et al. introduced a new analysis technique [12], denoted as Pan analysis, which demonstrated that the complex internal encodings of the Bai–Wu scheme provided only limited security benefits. Also, in 2015, Shi et al. proposed an SM4 white-box scheme utilizing dual ciphers and random obfuscation to protect lookup tables [13], claiming it could resist Lin–Lai analysis. However, since the author of this scheme did not provide an open-source procedure, no analytical results regarding its security are presently accessible. In 2020, Yao introduced a new SM4 white-box scheme that employed internal state expansion in combination with random numbers to obfuscate the keys [14]. This approach significantly increased the difficulty of key extraction through algebraic analysis methods. So far, no analytical results regarding its security have been published.
In addition to algebraic analysis methods, DCA also poses a significant threat to SM4 white-box designs. In 2022, Zhang et al. [15] introduced intermediate-value mean differential analysis (IVMDA), a technique based on DCA, which successfully extracted the key from the Xiao–Lai scheme. In 2023, Yuan et al. [16] proposed an enhanced DCA technique that successfully compromised the Bai–Wu scheme.
To counter DCAs, several SM4 white-box schemes have been proposed in recent years. In 2022, Zhang et al. introduced an SM4 white-box implementation that enhanced the Xiao–Lai scheme by incorporating 8-bit nonlinear encodings [15], referred to as Zhang’s scheme. Experimental results from IVMDA demonstrated that the scheme could resist DCA. However, the use of 8-bit nonlinear encodings significantly increased the memory consumption to 24.3 MB. In 2023, Yuan et al. [17] proposed improvements to the Bai–Wu scheme, referred to as Yuan’s scheme. They applied protection only to the first and last rounds of the algorithm to reduce memory usage, as DCA primarily targets key-related lookup tables in these iteration rounds. Despite this optimization, this scheme still required 34.5 MB of memory. In 2024, Zhao et al. introduced an SM4 white-box scheme based on the Xiao–Lai approach, utilizing Boolean masking techniques [18], referred to as Zhao’s scheme. This scheme employs nonlinear permutations to reuse random mask values, reducing memory consumption. Two versions were proposed: a simplified version that applies masking to only the first and last four rounds, requiring 1.62 MB of memory, and an enhanced version that applies masking to all rounds, requiring 7.8 MB. The latter was shown to resist both DCA and existing algebraic attacks.
At present, three schemes are capable of resisting both DCA and algebraic analysis, i.e., the schemes proposed by Zhang et al. [15], Yuan et al. [16], and Zhao et al. [18], respectively. However, if the three schemes implement full-round defense, their memory usage tends to be high, resulting in significantly high implementation costs.

1.2. Our Contribution

This paper introduces an SM4 white-box scheme that is implemented through a series of table lookup operations. The scheme employs affine transformations and fourth-order nonlinear encodings to protect the input and output of each table. To further enhance security, random sequences are used to shuffle the execution order of the table lookups during the encryption process.
As shown in Table 1, the scheme requires a total of 1.44 MB of memory, which is significantly less than other DCA-resistant methods: one-twelfth of the memory required by Zhang’s scheme [15], one-eighteenth of the memory required by Yuan’s scheme [17], and one-fourth of the memory required by Zhao’s scheme [18]. Furthermore, it takes 44 ms to generate a white-box encryption instance and only 2 ms to encrypt a plaintext block on a personal computer.
Experimental results using the open-source tool Deadpool confirm that the proposed scheme is resistant to DCA. Additionally, theoretical analysis demonstrates that it withstands known algebraic attacks, such as BGE analysis [3], Lin–Lai analysis [10], and Pan analysis [12]. As shown in Table 1, while Zhang’s, Yuan’s, Zhao’s, and our schemes are all secure against both algebraic attacks and DCA, our scheme achieves this with the lowest memory consumption.

1.3. Organization

The rest of this paper is organized as follows. The preliminaries are introduced in Section 2. Section 3 explains the basic idea and detailed steps of our SM4 white-box algorithm. Section 4 evaluates the performance of the scheme and compares it with other SM4 white-box algorithms. Algebraic analysis and DCA analysis are conducted in Section 5. Finally, we conclude the paper in Section 6.

2. Preliminaries

We modify the Xiao–Lai solution to resist DCA. Here, we briefly introduce the SM4 algorithm and the Xiao–Lai SM4 white-box algorithm.

2.1. SM4 Algorithm

SM4 is a Feistel cipher in which the block size and the key length are 128 bits. The encryption process consists of 32 rounds of iterations, and each round requires a 32-bit round key.
The 128-bit plaintext is divided into four 32-bit words ( X 0 , X 1 , X 2 , X 3 ) . The round function F takes four intermediate state words and the round key and returns a new word. The round function for the i t h round iteration is computed as follows:
X i + 4 = F ( X i , X i + 1 , X i + 2 , X i + 3 , r k i ) = X i T ( X i + 1 X i + 2 X i + 3 r k i ) .
Here, T consists of a nonlinear transformation τ and a linear transformation L. Let A = X i + 1 X i + 2 X i + 3 r k i represent the input word of τ , and let B be the output. τ involves four independent S-box substitutions, i.e.:
B = ( b 0 , b 1 , b 2 , b 3 ) = τ ( A ) = ( S ( a 0 ) , S ( a 1 ) , S ( a 2 ) , S ( a 3 ) ) .
Here, each a j or b j ( j { 0 , 1 , 2 , 3 } ) is a byte.
Both the input and output of the linear transformation L are 32-bit values, and the transformation is defined as the following formula:
X i + 4 = L ( B ) = B ( B 2 ) ( B 10 ) ( B 18 ) ( B 24 ) .
Here, ⋘ represents a cyclic left shift. The flowchart for the ith round iteration is shown in Figure 1.
After the final round, the result undergoes a simple reverse transformation to produce the final ciphertext ( X 35 , X 34 , X 33 , X 32 ) .

2.2. Xiao–Lai Scheme

In the Xiao–Lai scheme, the standard SM4 round function is divided into three parts. As shown in Figure 2, the first part computes the XOR of three state words. The second part includes the addition of the round key and the T transformation. Finally, the third part calculates the sum of the current intermediate state word and the state word X i .
Unlike the original SM4, each intermediate state word is protected by a reversible affine transformation P defined as follows:
P ( x ) = l P × x c P .
Here, l P is a 32-dimension invertible matrix over G F ( 2 ) , and c P is a 32-dimension constant vector over G F ( 2 ) . Consequently, each part of the process also involves removing the previous affine transformation and applying a new one.
Part 1: Computing Y i = X i + 1 X i + 2 X i + 3 .
Part 1 consists of three affine transformations and two XOR operations. As mentioned before, the state words are protected by affine transformations, denoted by X i + 1 , X i + 2 , X i + 3 , so the inverse transformation P i + j 1 ( j = 1 , 2 , 3 ) should be applied first. To avoid the leakage of P i + j 1 ( j = 1 , 2 , 3 ) , the same affine transformation E i 1 is separately merged into P i + 1 1 , P i + 2 1 , and P i + 3 1 . Thereby, a compounded transformation E i 1 P i + j 1 is applied to state word X i + j ( j { 1 , 2 , 3 } , called the encoding unification operation. It is notable that E i = diag ( E i 0 , E i 1 , E i 2 , E i 3 ) , where each E i j is an 8-order reversible affine transformation over G F ( 2 ) , and E i 1 is the inverse transformation of E i .
Now, the three words are secured by the same transformation E i 1 , the XOR addition can be computed directly, and the result of part one is protected by the affine transformation E i 1 . The overall computation process is as follows:
Y i = E i 1 ( X i + 1 X i + 2 X i + 3 ) = ( E i 1 P i + 1 1 ) X i + 1 ( E i 1 P i + 2 1 ) X i + 2 ( E i 1 P i + 3 1 ) X i + 3 .
Part 2: The round key addition and T transformation.
All the operations included in this part are implemented using four table lookups and three XOR operations. Since T is a 32-bit-to-32-bit transformation, creating a single lookup table would consume too much memory. Therefore, it is split into four 8-bit-to-32-bit lookup tables. Each table is created according to the following equation:
Z i , j = Q i L S b o x ( ( E i , j · y i , j ) r k i , j ) .
Here, y i , j is the jth byte of the output of Part 1, and r k i , j is the the jth byte of the ith round key. Similarly, decoding the protection E i 1 and adding new protection Q i are necessary separately before and after the round operations.
Also, the output values of the four tables are protected by the same affine transformation Q i . Thus, the results of the four table lookups are XORed directly to obtain the output Z i of Part 2.
Z i = Q i ( T ( X i + 1 X i + 2 X i + 3 r k i ) ) = Z i , 0 Z i , 1 Z i , 2 Z i , 3 .
Part 3: Adding X i + 4 .
This part consists of two affine transformations and one XOR operation. X i and Z i are protected by different affine transformations, so the encoding unification operation should be executed before the XOR operation. However, if two values are protected by the same affine transformation, the XOR sum will only be protected by the linear component of the affine transformation. Therefore, two affine transformations P i + 4 ( x ) = l P i + 4 × x c P i + 4 and P i + 4 ( x ) = l P i + 4 × x c P i + 4 are chosen. The linear components of P i + 4 and P i + 4 are the same as those of P i + 4 , but the constant components differ and satisfy c P i + 4 + c P i + 4 = c P i + 4 .
Consequently, transformations P i + 4 P i 1 and P i + 4 Q i 1 are separately applied to X i and Z i . The output of Part 3 is protected by P i + 4 , i.e.,
X i + 4 = P i + 4 ( X i T ( X i + 1 X i + 2 X i + 3 r k i ) ) .

3. Improved SM4 White-Box Scheme

Our proposal builds upon the design of the Xiao–Lai scheme, with the round function similarly divided into three parts, as illustrated in Figure 3. We begin by defining the notations used throughout the paper (see Table 2), followed by a brief introduction to our design concept. Lastly, we provide a detailed explanation of the construction of each part.

3.1. Design Ideas

According to the algebraic analyses presented separately by Pan et al. [12] and Lin et al. [10], the combination of the last two parts and Part 1 of the next round of iteration in the Xiao–Lai scheme may expose intermediate affine transformations, allowing an attacker to recover the key. Hence, we add nonlinear encodings to the input and output of intermediate state words to reduce the potential correlations between the genuine and observed values of state words, thus mitigating the previous algebraic attacks. In this context, the nonlinear encoding is a randomly generated table representing a permutation of the set { 0 , 1 , , 15 } .
However, nonlinear encoding makes directly computing the XOR sum of two state words infeasible, although the affine transformations protecting the two words are unified. Thereby, an XOR table is utilized to achieve XOR operations between the two words. If an 8-to-8-bit nonlinear encoding is used, the XOR table consumes 2 8 × 2 8 × 8 bits = 2 6 KB of memory, while a 4-to-4-bit nonlinear encoding requires only 2 4 × 2 4 × 4 bits = 0.125 KB. As a result, we adopt eight independent 4-to-4-bit nonlinear encodings to secure each intermediate state word throughout the encryption process, minimizing memory consumption.
Furthermore, the success of DCA relies on aligned memory traces, so it would be great if we could intentionally perturb the trace alignment. Meanwhile, the four lookup tables within Part 2 of the Xiao–Lai scheme are crucial because the round key is hidden in the table. Especially, the calculation order of the four lookups can be adjusted. Hence, this scheme introduces a random sequence to shuffle the access order of those tables, dynamically varying data flow processing orders during multiple encryptions.

3.2. Construction of Our Scheme

Part 1: Computing Y i = X i + 1 X i + 2 X i + 3 .
As previously noted, each state word is protected by an affine transformation and eight nonlinear encodings, so the XOR operation is executed through table lookups. However, adding two 32-bit words together would require a table occupying 2 69 ( = 2 32 × 2 32 × 32 ) bits of memory, which is impractical due to excessive memory demands. To address this, as illustrated in Figure 3, we divide the word-level computation into four independent byte-level computations during the encoding unification process. Then, the addition of two nibbles is computed using an XOR table.
(1) Encoding unification.
In the Xiao–Lai scheme, encoding unification operation applies an affine transformation E i 1 P i + j 1 to the input X i + j . Also, because of non-linear encodings, the affine transformation combined with the inverse of the non-linear encodings is transferred to table lookup operations. To save memory, each 32-bit input X i + j (where j { 1 , 2 , 3 } ) is split into four concatenated bytes ( x i + j , 0 , x i + j , 1 , x i + j , 2 , x i + j , 3 ) . Each byte is processed through an 8-to-32-bit table lookup operation, and then, the resulting four words are added together as follows:
X i + j = k = 0 3 o u t i + j , k , 2 k 4 o u t i + j , k , 2 k + 1 4 E [ k ] i 1 P [ k ] i + j 1 i n i + j , k , 0 0 i n i + j , k , 1 0 x i + j
Here, P i + j [ k ] ( · ) ( k { 0 , 1 , 2 , 3 } ) refers to the partial computation of affine transformation ( P i + j ). Specifically, the k t h eight columns of l P ( P i + j ) are multiplied by the byte vector x i + j , k , followed by the addition of the k t h eight rows of c P ( P i + j ). The partial computation of affine transformation E i 1 P i + j 1 , along with the associated nonlinear encodings and decodings, is consolidated into a table, referred to as TableM. The process for creating this table is shown in Figure 4.
The following XOR operations are performed using nine XOR tables, which take two 4-bit inputs and produce a 4-bit output, as illustrated in Figure 5. After removing the nonlinear encoding, the two nibbles are protected by the same affine transformation, ensuring that the XOR sum remains protected by this transformation. Finally, a new nonlinear encoding o u t i + 1 , k , t 1 is applied. Let X i + 1 , k + 1 , t represent the t t h nibble of word X i + 1 , k + 1 . Taking X i + 1 , k + 1 , t X i + 1 , k + 1 , t as an example, the process for generating the XOR table includes the following operations:
( o u t i + 1 , k , t 2 ) ( ( i n i + 1 , k , t 1 ( X i + 1 , k , t ) ) ( i n i + 1 , k + 1 , t 1 ( X i + 1 , k + 1 , t ) ) ) .
(2) XOR operation.
The XOR operation of Part 1 after the encoding unification also uses lookup tables. Taking X i + 1 and X i + 2 as an example, let X i + 1 , , t and X i + 2 , , t represent the two t t h 4-bit inputs to the XOR table. The table is created according to the following equation.
X i + 1 , , t = o u t i + 1 , t 2 i n i + 1 , , t 1 ( X i + 1 , , t ) i n i + 1 , k + 1 , t 1 ( X i + 2 , , t )
(3) Space complexity.
Part 1 involves two types of lookup tables: the TableM table and the XOR table. Each X i + j ( j = 1 , 2 , 3 ) can be represented as four concatenated bytes, with each byte serving as the input of a TableM table. Therefore, each round requires 3 × 4 = 12 TableM tables, and for 32 rounds of iterations, a total of 12 × 32 = 384 tables are needed. A TableM table takes an 8-bit input and returns a 32-bit value. Thus, each TableM table occupies 1 KB ( = 2 8 × 32 bits ) of memory, so all TableM tables require 384 KB of memory in a white-box instance.
Each 32-bit word X i + j , k is split into eight concatenated 4-bit segments X i + j , k , t ( t = 0 , 1 , , , 7 ) . Two corresponding 4-bit segments from two words are the input to one XOR table. Thus, the addition of two 32-bit words requires eight XOR tables. There are a total of twelve words that require eleven XOR operations. Therefore, Part 1 of each round requires 11 × 8 = 88 XOR tables, and for 32 rounds of iterations, a total of 88 × 32 = 2816 XOR tables are needed. Each XOR table occupies 2 4 × 2 4 × 4 bits = 128 B , so in a white-box instance, the XOR tables in Part 1 occupy 32 × 11 × 8 × 128 B = 352 KB of memory.
Part 2: The round key addition and T transformation.
The T transformation T ( X i + 1 X i + 2 X i + 3 rk i ) is implemented by four table lookups followed by XOR operations in the Xiao–Lai scheme. The round key is embedded during the process of generating the lookup tables. We inherit the method of implementation of the T transformation but add non-linear encodings to further protect intermediate data. Moreover, the order of access to the four tables is randomly shuffled. Also, because of the non-linear encodings, the addition of the four output words from four table lookups is conducted using XOR tables.
(1) Tables embedded with the round key.
The output Y i from Part 1 is divided into four 8-bit segments: y i , 0 , y i , 1 , y i , 2 , and y i , 3 . Each byte is then used as input for a lookup table, referred to as a TableT table. As shown in Figure 6, in addition to the operations required to create a table in the Xiao–Lai scheme, nonlinear decoding ( i n i , k , 0 5 , i n i , k , 1 5 ) T and nonlinear encoding ( o u t i , k , 0 6 , , o u t i , k , 7 6 ) T are separately applied before and after these operations. The process for generating a TableT table involves the following operations:
Y i , k = ( ( o u t i , k , 0 6 , , o u t i , k , 7 6 ) T Q i ( L S b o x ( ( E i , k ( i n i , k , 0 5 , i n i , k , 1 5 ) T ( y i , k ) ) r k i , k ) ) .
(2) Shuffling.
To prevent DCAs, the query order of the four TableT lookups are randomized using a randomly generated sequence. Typically, the computation order follows y i , 0 , y i , 1 , y i , 2 , y i , 3 . However, after shuffling based on the random sequence j 0 , j 1 , j 2 , j 3 , the access order is updated to y i , j 0 , y i , j 1 , y i , j 2 , y i , j 3 , and the access order of the corresponding TableT tables is adjusted accordingly.
When generating a white-box encryption instance, we use a TableR table that stores all 24 permutations of { 0 , 1 , 2 , 3 } (since 4 ! = 24 ). During encryption, a random number t e m p is generated, and t e m p ( mod 24 ) is used as an index to select a permutation from the TableR table. The bytes of Y i are then reordered based on the selected permutation, and the four TableT tables are queried in the updated order.
For instance, in the i t h round, if the permutation { 0 , 3 , 2 , 1 } is selected by random number t e m p ( mod 24 ) , the byte order of Y i is rearranged to y i , 0 , y i , 3 , y i , 2 , y i , 1 , and the TableT tables are queried in that new order—first, fourth, third, and second.
(3) XOR operations.
Each TableT lookup returns a 32-bit word. By performing lookups on four tables, four words are obtained. These words are then XORed using XOR tables to produce the output Z i for Part 2.
(4) Space complexity.
Part 2 involves three types of lookup tables, TableT tables, XOR tables, and a TableR table for permutations. The input of Part 2 is split into four bytes and each byte y i , k corresponds to a TableT table. Therefore, each round requires 4 TableT tables, and for 32 rounds, a total of 4 × 32 = 128 tables are needed. Each TableT table takes an 8-bit input and returns a 32-bit output, so it occupies 2 8 × 32 bits = 1 KB . In total, in a white-box instance, the TableT tables occupy 128 × 1 KB = 128 KB of memory.
The four TableT lookups generate four 32-bit words, which require three XOR operations. Thus, Part 2 of each round needs 3 × 8 = 24 XOR tables. For 32 rounds of iterations, 24 × 32 = 768 XOR tables are needed. Each XOR table occupies 2 4 × 2 4 × 4 bits = 128 B of memory, so in a white-box instance, the XOR tables in Part 2 require 768 × 128 = 96 KB of memory.
The TableR table stores all 24 permutations of the set { 0 , 1 , 2 , 3 } . Each permutation requires 1 byte, resulting in a total storage requirement of 24 bytes for TableR table.
Part 3: Adding X i + 4 .
(1) Encoding unification and XOR operation.
Part 3 computes the XOR of X i and Z i . Similar to Part 1, affine transformation unification is first applied to both words. As a result, four 8-to-32-bit tables are generated for X i , called TableC tables, and four for Z i , called TableD tables. The process for generating a TableC table involves the steps shown in Figure 7, and the relationship between the input and output of a TableC table is defined as follows:
X i , k = ( o u t i , k , 0 1 , , o u t i , k , 7 1 ) T ( P i + 4 P i 1 ) ( i n i , k , 0 0 , i n i , k , 1 0 ) T ( x i , k ) .
Similarly, the process of generating a TableD table is demonstrated in Figure 8, and the relationship between the input and output of the table is represented as follows:
Z i , k = ( o u t i , k , 0 9 , , o u t i , k , 7 9 ) T ( P i + 4 Q i 1 ) ( i n i , k , 0 8 , i n i , k , 1 8 ) T ( z i , k ) .
As in the Xiao–Lai scheme, the affine transformations P i + 4 and P i + 4 share the same linear component, while the sum of their constant components equals the constant component of P i + 4 .
Finally, the eight words X i , k , Z i , k ( k = 0 , 1 , 2 , 3 ) are added up with the assistance of XOR tables.
(2) Space complexity.
Part 3 consists of three types of tables: TableC tables, TableD tables, and XOR tables. Both TableC and TableD tables take an 8-bit input and return a 32-bit output. Therefore, each table occupies 2 8 × 32 bits = 1 KB . Each round requires four TableC tables and four TableD tables. Over 32 rounds, this totals 4 × 32 + 4 × 32 = 256 tables. Consequently, in a white-box instance, the TableC and TableD tables together consume 256 × 1 KB = 256 KB of memory.
Adding up the eight words X i , k , Y i , k ( k = 0 , 1 , 2 , 3 ) requires seven XOR operations, so 8 × 7 = 56 XOR tables are used per round. For 32 rounds of iterations, a total of 32 × 56 = 1792 XOR tables are needed. The memory required for the XOR tables of Part 3 in a white-box instance is 1792 × 128 B = 224 KB .

4. Performance

The SM4 white-box scheme proposed in this paper utilizes five types of lookup tables: TableM, TableT, TableC, TableD, and XOR tables, along with a permutation table, TableR. The memory usage and quantity of each type of table were calculated in the previous section. As summarized in Table 3, the total memory required to generate an instance of our SM4 white-box scheme is 1.44 MB.
32 KB + 384 KB + 128 KB + 128 KB + 128 KB + 672 KB = 1472 KB = 1.44 MB .
The round functions in our scheme are executed through a series of table lookup operations, resulting in high-speed encryption performance. We tested the runtime of our proposed white-box scheme on a personal computer with a 12th-generation AMD Ryzen 7 5700U processor (1.80 GHz) with Radeon Graphics and 24 GB of RAM and compared it to other publicly available white-box schemes. The results, presented in Table 4, show that the average runtime to generate a white-box instance of our scheme is approximately 44 ms while encrypting a single block of plaintext takes about 2 ms.
Compared to other SM4 white-box schemes resistant to DCA attacks, our scheme consumes significantly less memory. The schemes proposed by Yuan et al. [17] and Zhang et al. [15] use 8th-order nonlinear encodings, which results in a large size of XOR tables. Similarly, the scheme by Zhao et al. [18] employs a masking technique that necessitates memory for storing lots of randomly generated masking values. As a result, their memory consumption is 23.96 times, 16.86 times, and 5.42 times, respectively, higher than our scheme.

5. Security Analysis: White-Box Obfuscation

5.1. Algebraic Attacks

5.1.1. BGE Attack

The BGE analysis first combines the lookup tables in the white-box scheme [3]. Therefore, the encodings protected the output of the previous table and the input of the current table are the inverse of each other, and the compound of them are the identity function. The input and output encodings of the non-linear transformation are converted into affine transformations, and then algebraic methods are used to solve the affine transformation. Since the output encoding of each round is the inverse of the input encoding of the next round, the input encoding of all rounds except the first can be obtained by calculating the inverse of the previous round’s output encoding. Therefore, the attacker can solve the hidden key after deducing the encodings protecting the input and output of the combined table and already knowing the round function of the algorithm.
Similar to the BGE analysis of the Xiao–Lai scheme, the lookup table TableT and XOR tables from Part 2, the lookup table TableC and XOR tables from Part 3, and the lookup table TableM from Part 1 of the subsequent round iteration are combined, as illustrated in Figure 9. In an ideal case, the affine transformation and non-linear encoding protecting the output of one part, such as Q i 1 and ( o u t i , k , 2 k 8 , o u t i , k , 2 k + 1 8 ) T , are the inverses of the non-linear encoding and affine transformation safeguarding the input of the next part, such as Q i and ( i n i , k , 0 8 , i n i , k , 1 8 ) T . Consequently, only known operations like the S box and L functions remain within the combined table. The attacker then attempts to convert the non-linear encodings protecting the table’s input and output, specifically ( i n i , k , 0 0 , i n i , k , 1 0 ) T and ( o u t i , k , 2 k 11 , o u t i , k , 2 k + 1 11 ) T , k { 0 , 1 , 2 , 3 } , into affine transformations and solve them.
In the current white-box scheme, the affine transformation and nonlinear encoding used to protect the outputs of Part 3 are not inverse operations for securing the inputs of Part 1 in the next round of iteration. Specifically,
( P i + 4 1 [ k ] ) i n i , k , 0 0 i n i , k , 1 0 o u t i , k , 2 k 11 o u t i , k , 2 k + 1 11 ( P i + 4 [ k ] ) , k = 0 , 1 , 2 , 3 ;
This means the nonlinear encodings do not cancel each other out, and the combined operation ( P i + 4 [ k ] ( o u t i , k , 2 k 11 , o u t i , k , 2 k + 1 11 ) T ) ( ( i n i , k , 0 0 , i n i , k , 0 0 ) T P i + 4 1 [ k ] ) , k { 0 , 1 , 2 , 3 } remains unknown to the attacker. As a result, the BGE attack is effectively thwarted.

5.1.2. Lin–Lai Analysis

The Lin–Lai analysis improves upon the BGE attack by using differential analysis to eliminate the unknown constant. In Lin–Lai analysis against the Xiao–Lai scheme, the operations from Part 2, the encoding unification for Z i from Part 3, and the encoding unification for X i + 4 from Part 1 of the subsequent round iteration are combined. As shown in Figure 2, Q i and Q i 1 can be canceled out, while P i + 4 and P i + 4 1 are compounded, leaving only the unknown constant A i + 4 after the combination. Thus, X i + 4 = E i + 1 1 ( k = 0 3 [ L S E i , k ( y i , k ) ] A i + 4 ) . Furthermore, E i + 1 are decomposed into operations on individual bytes. The equation can be split into four separate equations as follows:
X i + 4 , 0 = l E i + 1 , 0 1 ( k = 0 3 [ L S E i , k ( y i , k ) ] g i + 4 , 0 ) , X i + 4 , 1 = l E i + 1 , 1 1 ( k = 0 3 [ L S E i , k ( y i , k ) ] g i + 4 , 1 ) , X i + 4 , 2 = l E i + 1 , 2 1 ( k = 0 3 [ L S E i , k ( y i , k ) ] g i + 4 , 2 ) , X i + 4 , 3 = l E i + 1 , 3 1 ( k = 0 3 [ L S E i , k ( y i , k ) ] g i + 4 , 3 ) .
Here, g i + 4 , t ( t [ 0 , 3 ] ) is the sum of c E i + 1 , t (the constant component of E i + 1 , t ) and A i + 4 , t .
Since X i + 4 , s and X i + 4 , t ( s , t [ 0 , 3 ] ) are affine-related, the affine transformation can first be determined, enabling the recovery of linear components of E i + 1 , s 1 and E i + 1 , t 1 , i.e., l E i + 1 , s 1 and l E i + 1 , t 1 . Similarly, the linear component of E i , k can be recovered, followed by the linear component of Q i . Finally, differential analysis is used to determine the constant components of E i , k and Q i , ultimately recovering the key byte.
In our scheme, however, nonlinear encoding is introduced for protection. As previously mentioned, the unknown ( P i + 4 [ k ] ( o u t i , k , 2 k 11 , o u t i , k , 2 k + 1 11 ) T ) ( ( i n i , k , 0 0 , i n i , k , 0 0 ) T P i + 4 1 [ k ] ) after the combination is not equal to a constant but involves nonlinear and affine computations. This prevents the decomposition of the mapping from y i , k to X i + 4 into the four affine-related equations as per Equation (17). As a result, our scheme is resistant to Lin–Lai analysis.

5.1.3. Pan et al.’s Analysis

Pan et al.’s [12] analysis reduces the complexity of the Lin–Lai analysis by rearranging the recovery order of unknowns. In the Xiao–Lai scheme, Lin et al. [10] first recover the linear component l E i , k and then determine the constant c E i , k using differential analysis, ultimately forming key-related equations to recover the key. Pan et al. [12], however, begin by recovering the constant of each affine transformation and then deduce the linear components using known information.
Whatever analysis method is applied, the three-part combination should first result in four affine-related maps such as Equation (17). As with Lin–Lai analysis, since we added the protection of nonlinear encoding, the compound operations cannot cancel each other out to a constant. Finally, the mapping from y i , k to X i + 4 cannot be reduced to four affine-related equations. Consequently, Pan et al.’s [12] analysis is thwarted.

5.2. DCA Experiments

Our proposed scheme employs a shuffling strategy to randomize the execution order of the lookup tables. To evaluate the security of our scheme against DCA, we performed a DCA analysis using the publicly available tool Deadpool [19]. We separately selected 500 and 1000 traces for our experiments. The experimental results, shown in Figure 10, indicate that no significant peaks were observed in the differential traces when analyzing all possible values for each byte of the first-round key. Furthermore, Deadpool failed to return the correct key byte value. These results confirm the security of our scheme against DCA attacks.

6. Conclusions

This paper presents an improved SM4 white-box algorithm that addresses the high memory requirements necessary to resist various security threats. The proposed scheme integrates affine and nonlinear encodings to safeguard intermediate data, while a shuffling strategy is employed to prevent the alignment of memory traces during the encryptions of blocks. We evaluated the security of the scheme through existing algebraic attack methods and conducted DCA experiments. The results confirm that the scheme is secure against both algebraic attacks and DCA. Notably, our scheme requires only 1.44 MB of memory, significantly less than other DCA-resistant schemes.
Given ongoing advancements in side-channel analysis techniques, improving the classical DCA approach poses an interesting problem for future research. Additionally, further optimizing the SM4 white-box algorithm for stronger security and greater efficiency remains an open challenge.

Author Contributions

Conceptualization, S.Z. and S.C.; data curation, X.H.; investigation, Y.T. and Y.Y.; methodology, S.Z. and J.W.; project administration, X.H.; software, Y.B. and T.Z.; validation, Y.Y., Y.T. and J.W.; formal analysis, S.Z. and Y.B.; resources, X.H.; writing—original draft preparation, T.Z. and Y.X.; writing—review and editing, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Laboratory Specialized Scientific Research Projects of Beijing Smart-chip Microelectronics Technology Co., Ltd, grant number SGSC0000AQQT2400701.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request to the corresponding authors.

Conflicts of Interest

Authors Xiaobo Hu, Yanyan Yu, Yinzi Tu and Jing Wang were employed by the company Beijing Smart-Chip Microelectronics Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Chow, S.; Eisen, P.; Johnson, H.; Van Oorschot, P.C. A white-box DES implementation for DRM applications. In Proceedings of the ACM Workshop on Digital Rights Management, Washington, DC, USA, 18 November 2002; pp. 1–15. [Google Scholar]
  2. Chow, S.; Eisen, P.; Johnson, H.; Van Oorschot, P.C. White-box cryptography and an AES implementation. Sel. Areas Cryptogr. 2003, 9, 250–270. [Google Scholar]
  3. Billet, O.; Gilbert, H.; Ech-Chatbi, C. Cryptanalysis of a White Box AES Implementation. In Proceedings of the International Workshop on Selected Areas in Cryptography, Waterloo, ON, Canada, 9–10 August 2004. [Google Scholar]
  4. Alpirez Bock, E.; Bos, J.W.; Brzuska, C.; Hubain, C.; Michiels, W.; Mune, C.; Sanfelix Gonzalez, E.; Teuwen, P.; Treff, A. White-box cryptography: Don’t forget about grey-box attacks. J. Cryptol. 2019, 32, 1095–1143. [Google Scholar] [CrossRef]
  5. Alpirez Bock, E.; Brzuska, C.; Michiels, W.; Treff, A. On the Ineffectiveness of Internal Encodings—Revisiting the DCA Attack on White-Box Cryptography. In Proceedings of the International Conference on Applied Cryptography and Network Security, Leuven, Belgium, 2–4 July 2018. [Google Scholar]
  6. Lee, S.; Kim, T.; Kang, Y. A masked white-box cryptographic implementation for protecting against Differential Computation Analysis. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2602–2615. [Google Scholar] [CrossRef]
  7. Lee, S.; Kim, M. Improvement on a masked white-box cryptographic implementation. IEEE Access 2020, 8, 90992–91004. [Google Scholar] [CrossRef]
  8. Biryukov, A.; Udovenko, A. Attacks and Countermeasures for White-box Designs. In Proceedings of the 24th International Conference on the Theory and Application of Cryptology and Information Security, Brisbane, Australia, 2–6 December 2018. [Google Scholar]
  9. Xiao, Y.Y.; Lai, X.J. White-Box cryptography and implementations of SMS4. In Proceedings of the 2009 CACR Annual Meeting, Denver, CO, USA, 18–22 April 2009. [Google Scholar]
  10. Lin, T.T.; Lai, X.J. Efficient Attack to White-Box SMS4 Implementation. J. Softw. 2013, 24, 2238–2249. [Google Scholar] [CrossRef]
  11. Bai, K.; Wu, C. A secure white-box SM4 implementation. Secur. Commun. Netw. 2015, 9, 996–1006. [Google Scholar] [CrossRef]
  12. Pan, W.L.; Qin, T.H.; Jia, Y.; Zhang, L.T. Cryptanalysis of two white-box SM4 implementations. J. Cryptologic Res. 2018, 5, 651–670. [Google Scholar]
  13. Shi, Y.; Wei, W.; He, Z. A Lightweight White-Box Symmetric Encryption Algorithm against Node Capture for WSNs. Sensors 2015, 15, 11928–11952. [Google Scholar] [CrossRef] [PubMed]
  14. Yao, S.; Chen, J. A new method for white-box implementation of SM4 algorithm. J. Cryptologic Res. 2020, 7, 358–374. [Google Scholar]
  15. Zhang, Y.Y.; Xu, D.; Chen, J. Analysis and Improvement of White-box SM4 Implementation. J. Electron. Inf. Technol. 2022, 44, 2903–2913. [Google Scholar]
  16. Yuan, Z.Q.; Chen, J. Differential Computation Analysis of White-box SM4 Scheme. J. Softw. 2022, 34, 3891–3904. [Google Scholar]
  17. Yuan, Z.Q.; Chen, J. A white-box SM4 scheme against Differential Computation Analysis. J. Cryptologic Res. 2023, 10, 386–396. [Google Scholar]
  18. Zhao, D.Y.; Wang, Y.B.; Li, Y.; Hu, X.B.; Yu, Y.Y.; Chen, S.; Zheng, S.H. An Efficient Masked White-Box Implementation of SM4. Electronics 2024, 13, 2326. [Google Scholar] [CrossRef]
  19. SideChannelMarvels/Deadpool. Available online: https://github.com/SideChannelMarvels (accessed on 1 October 2024).
Figure 1. The flowchart of i t h round iteration of SM4.
Figure 1. The flowchart of i t h round iteration of SM4.
Entropy 27 00001 g001
Figure 2. The i t h round function of the Xiao–Lai scheme.
Figure 2. The i t h round function of the Xiao–Lai scheme.
Entropy 27 00001 g002
Figure 3. Round function of our SM4 white-box scheme.
Figure 3. Round function of our SM4 white-box scheme.
Entropy 27 00001 g003
Figure 4. Generating a TableM table.
Figure 4. Generating a TableM table.
Entropy 27 00001 g004
Figure 5. Generating an XOR Table.
Figure 5. Generating an XOR Table.
Entropy 27 00001 g005
Figure 6. Generating a TableT table.
Figure 6. Generating a TableT table.
Entropy 27 00001 g006
Figure 7. Generating a TableC table.
Figure 7. Generating a TableC table.
Entropy 27 00001 g007
Figure 8. Generating a TableD table.
Figure 8. Generating a TableD table.
Entropy 27 00001 g008
Figure 9. BGE analysis in our scheme.
Figure 9. BGE analysis in our scheme.
Entropy 27 00001 g009
Figure 10. The differential traces related to the first roundkey.
Figure 10. The differential traces related to the first roundkey.
Entropy 27 00001 g010
Table 1. Comparison of white-box implementations.
Table 1. Comparison of white-box implementations.
SchemeBGE AnalysisLin–Lai AnalysisPan AnalysisDCAMemory
Xiao–Lai scheme [9]YesNoNoNo148.625 KB
Bai–Wu scheme [11]YesYesNoNo32.5 MB
Yuan’s scheme [17]YesYesYesYes34.5 MB
Zhang’s scheme [15]YesYesYesYes24.3 MB
Zhao’s scheme [18]YesYesYesYes7.8 MB
Our SchemeYesYesYesYes1.44 MB
“Yes” indicates the scheme can resist this kind of attack; “No” means it cannot.
Table 2. Symbols.
Table 2. Symbols.
SymbolDescription
iThe index of the current round of iteration, i = 0 , , 31 .
jThe index of a 32-bit word within a 128-bit state input to the round function, j = 0 , , 3 .
kThe index of a byte within a state word, k = 0 , , 3 .
tThe index of a nibble within a state word, t = 0 , , 7 .
X i + j The j t h word input into the i t h round of iteration.
X i + j X i + j protected by encodings.
x i + j , k The k t h byte of the word X i + j .
r k i , k The k t h byte of the i t h round key.
X i + 4 The output word of the i t h round of iteration.
Y i The output word of Part 1 during the i t h round of iteration.
y i , k The k t h byte of the word Y i .
Z i The output word of Part 2 during the i t h round of iteration.
z i , k The k t h byte of the word Z i .
P i + i A 32-dimensional invertible affine transformation to protect word X i + j .
l P i + j The linear component of the affine transformation P i + j .
c P i + j The constant component of the affine transformation P i + j .
P i + j 1 The inverse of the affine transformation P i + j .
E i A 32-dimensional affine transformation generated by d i a g ( E i , 0 , E i , 1 , E i , 2 , E i , 3 ) .
E i , k An 8-dimensional reversible affine transformation.
E i 1 P i + k 1 The compound affine transformation combining E i 1 and P i + k 1 .
Q i A 32-dimensional invertible affine transformation.
mThe index of non-linear encoding, m = 0 , , 11 .
out i , k , t m The t t h 4-order nonlinear encoding to protect the output word of the current table lookup operation.
in i , k , t m The t t h 4-order nonlinear decoding to offset the protection of the previous table lookup operation.
Table 3. Memory required for tables in our scheme.
Table 3. Memory required for tables in our scheme.
TableMemoryNumber ofMemory
(Single)Tables(Total)
TableSE1 KB 16 × 1 16 KB
TableFE1 KB 16 × 1 16 KB
TableM1 KB 4 × 3 × 32 384 KB
TableT1 KB 4 × 32 128 KB
TableC1 KB 4 × 32 128 KB
TableD1 KB 4 × 32 128 KB
XOR0.125 KB 32 × ( 8 × 11 + 8 × 3 + 8 × 7 ) 672 KB
TableR0.375 KB10.375 KB
TotalN/AN/A1472 KB
Table 4. Performance comparison of various SM4 white-box schemes.
Table 4. Performance comparison of various SM4 white-box schemes.
SchemeMemoryGeneration TimeTotal TablesTotal XORsAffine
Transformation
Encryption Time
(One WB Instance)(s)(8-to-32-bit)(32-bit) (ms)
Xiao–Lai Scheme [9]148.625 KB0.0211281921600.06 [18]
Bai–Wu Scheme [11]32.5 MB3.9764064000.001 [18]
Yao’s Scheme [14]276.625 KB0.09212896 + 96 (64-bit)1600.06 [18]
Zhang’s Scheme [15]24.3 MB640192128
Yuan’s Scheme [17]34.5 MB6725360
Zhao’s Scheme [18]7.8 MB2.661922082160.08 [18]
Our Scheme1.44 MB0.04480067202
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, X.; Yu, Y.; Tu, Y.; Wang, J.; Chen, S.; Bao, Y.; Zhang, T.; Xing, Y.; Zheng, S. A Secure and Efficient White-Box Implementation of SM4. Entropy 2025, 27, 1. https://doi.org/10.3390/e27010001

AMA Style

Hu X, Yu Y, Tu Y, Wang J, Chen S, Bao Y, Zhang T, Xing Y, Zheng S. A Secure and Efficient White-Box Implementation of SM4. Entropy. 2025; 27(1):1. https://doi.org/10.3390/e27010001

Chicago/Turabian Style

Hu, Xiaobo, Yanyan Yu, Yinzi Tu, Jing Wang, Shi Chen, Yuqi Bao, Tengyuan Zhang, Yaowen Xing, and Shihui Zheng. 2025. "A Secure and Efficient White-Box Implementation of SM4" Entropy 27, no. 1: 1. https://doi.org/10.3390/e27010001

APA Style

Hu, X., Yu, Y., Tu, Y., Wang, J., Chen, S., Bao, Y., Zhang, T., Xing, Y., & Zheng, S. (2025). A Secure and Efficient White-Box Implementation of SM4. Entropy, 27(1), 1. https://doi.org/10.3390/e27010001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop