The Design of Compact SM4 Encryption and Decryption Circuits That Are Resistant to Bypass Attack

: In order to achieve the purpose of defending against side channel attacks, a compact SM4 circuit was designed based on the mask and random delay technique, and the linear transformation module was designed with random insertion of the pseudo operation method. By analyzing the glitch data generated by the S-box of SM4 with di ﬀ erent inputs, the security against glitch attacks was conﬁrmed. Then, the DPA (Di ﬀ erential Power Analysis) was performed on the designed circuit. The key could not be successfully obtained even in the case of 100,000 power curves, so that the safety of SM4 against DPA is veriﬁed. Finally, using Synopsys DC (Design Compiler, Mountain View, CA94043DC, USA) to synthesize the designed circuit, the results show that the area of the designed circuit in the SMIC 0.18 process is 82,734 µ m 2 , which is 48% smaller than results reported in other papers.


Introduction
The SM4 algorithm is a block symmetric cipher algorithm announced by Chinese National Cipher Management Committee Office in January 2006 and it has been widely used in various fields of information security in China, such as wireless local area network (WLAN), WLAN Authentication and Privacy Infrastructure (WAPI), storage device and the smart card system. As the SM4 algorithm is mostly used in high-speed and resource constrained applications, it is very necessary to design and implement the compact circuit of SM4. As a standard cipher algorithm, SM4 has been widely used in the field of information security for its short build time and low memory requirements [1][2][3]. However, Differential Power Analysis (DPA), which has been developed in recent years, has brought great challenges to the security of SM4 circuits. DPA is a typical SCA (side channel attack) method which performs a correlation analysis by collecting the power consumption of the operation. According to the correlation between sensitive information in the operation and the instantaneous power consumption of the CMOS circuit, DPA can quickly recover the key of SM4. It has the advantages of simple implementation, high efficiency, and short attack time. Therefore, it has posed a serious threat to the security of an integrated circuit. The goal of this work is to study compact SM4 circuits resistant to SCA for resource constrained applications.
Since Kocher proposed PA (Power Analysis, USA) technology in 1998, the research on PA and defense measures of cryptographic circuits have increased [4]. In recent years, ML (machine learning, USA) and PCA (principal component analysis) have also been applied to PA because of the large amount of data needed to be statistically analyzed [5,6]. Based on the principle of PA, the correlation between

Randomization Method
The randomization method aims to destroy the correlation between the median value and power consumption by increasing the redundant power consumption or random noise. Randomization methods usually include random pseudo operations inserting and random delay. Herbst et al. [12] designed an AES cryptographic circuit based on randomization method. Since the AES algorithm has a width of 128 bits and the execution order of the bytes is not related, the operation process can be randomized during the operation. Inserting pseudo-operations disturbs the power consumption information of the circuit and achieves the purpose of resisting power consumption attacks. Kocher et al. [16] adopted a random insertion delay method for the clock of the cryptographic circuit, which destroyed the fixedness of the median operation time point and prevented the attacker from finding the power consumption data position corresponding to the attack point. This randomization method has the advantages of small circuit area and simple implementation, and is commonly used for resisting power consumption attacks.
In general, the masking method is strong in security and easy to implement, but it has the disadvantage of being unable to resist glitch attacks. Although the randomization method has a weak Electronics 2020, 9,1102 3 of 16 ability to defend against attacks, its resource overhead is small. In order to achieve a circuit resistant to DPA for resource-constrained applications, this work will design and implement a compact SM4 encryption and decryption circuit based on masking and randomization method.
The rest of this manuscript is organized as follows: Section 2 introduces the overall structure design of SM4. Section 3 details the sub-module design of SM4. Section 4 analyzes the security of the proposed SM4 to resist glitch attacks and DPA. In Section 5, we study the synthesized results of the SM4 circuit. Section 6 summarizes the conclusions of this work.

Overall Structure Design
Because the attacker can use the power consumption information of the cryptographic circuit at any time to crack the original key of the circuit, it is not safe for the entire SM4 encryption and decryption circuit to take defensive measures against only some modules of the circuit. The security design should protect SM4 circuit during each operation of the algorithm. From [1], the attacked positions of SM4 are the input/output of the S-box, and the output of the linear transformation. Thus, the attackers should choose the input/output of the S-box as the position of Power Analysis. Reference [1,10] respectively selected the input and output of the S-box to attack the SM4 encryption circuit, and successfully obtained the key. This paper intends to use a masking method to protect the T-box. Because the composite field mask S-box is easy attacked by glitch attacks, this paper further proposes the random delay method to change the time delay of the input data of the composite field mask S-box. The linear transformation performs different cyclic shifts on the output of the S-box to achieve the purpose of diffusion. Therefore, the attack of this position, a byte distinguishing function method cannot be adopted, only a word (32 bit) attack method can be used to implement the attack. In this case, the attacker needs to try at least 2 32 data guesses, and the attack is very difficult. Therefore, we use other parallel structures in the circuit which do not need to work and insert random operations to generate randomized power consumption to disrupt the overall power of the circuit. Because the key expansion and round transformation module of the SM4 encryption and decryption circuit designed in this paper reuses the T-box part, there is no need to protect the key expansion separately. Using the above design ideas, the SM4 encryption and decryption circuit based on the random mask and randomization method is designed, as shown in Figure 1.

Sub-Module Design
This section will study the design and optimization of the mask T-box, random linear transformation L1, random linear transformation L2, and random mask generation module based on random delay, mask, and random insertion pseudo-operation method.

Design of the Mask T-Box
The designed circuit structure of the mask T-box is shown in Figure 2, which is mainly divided into three parts: XOR operation module, random delay module and mask S-box. Among them, at the beginning of the T-box operation, the mask MI and the key rk or the fixed parameter CK are operated to mask the intermediate value result. Then the mask is input into the random delay module along  Figure 1 includes a mask T-box, a linear transformation L1, a linear transformation L2, and a random mask generation module. Among them, the mask T-box occupies more than 40% of the combined logic overhead of the total circuit, and is an important part of round transformation and key expansion in the SM4 encryption and decryption circuit. The composite field mask S-box is also an important design for the T-box module. Based on the above design ideas, the following section will design a compact SM4 encryption and decryption circuit based on the mask and randomization method.

Sub-Module Design
This section will study the design and optimization of the mask T-box, random linear transformation L1, random linear transformation L2, and random mask generation module based on random delay, mask, and random insertion pseudo-operation method.

Design of the Mask T-Box
The designed circuit structure of the mask T-box is shown in Figure 2, which is mainly divided into three parts: XOR operation module, random delay module and mask S-box. Among them, at the beginning of the T-box operation, the mask M I and the key rk or the fixed parameter CK are operated to mask the intermediate value result. Then the mask is input into the random delay module along with the round data D gained by the exclusive XOR operation to delay the data path. The delayed data enters the mask S-box for calculation. Finally, the mask S-box outputs the non-linearly transformed data S and updated mask M S .

Sub-Module Design
This section will study the design and optimization of the mask T-box, random linear transformation L1, random linear transformation L2, and random mask generation module based on random delay, mask, and random insertion pseudo-operation method.

Design of the Mask T-Box
The designed circuit structure of the mask T-box is shown in Figure 2, which is mainly divided into three parts: XOR operation module, random delay module and mask S-box. Among them, at the beginning of the T-box operation, the mask MI and the key rk or the fixed parameter CK are operated to mask the intermediate value result. Then the mask is input into the random delay module along with the round data D gained by the exclusive XOR operation to delay the data path. The delayed data enters the mask S-box for calculation. Finally, the mask S-box outputs the non-linearly transformed data S and updated mask MS. Since the composite field mask S-box cannot resist glitch attacks, we use a composite field mask method based on random delay to improve the security of the S-box. From the principle of the glitch attack, it is known that the number of glitches in the circuit is related to the power consumption of the circuit. When different values are inputted in the S-box, the number of glitches in the S-box Since the composite field mask S-box cannot resist glitch attacks, we use a composite field mask method based on random delay to improve the security of the S-box. From the principle of the glitch attack, it is known that the number of glitches in the circuit is related to the power consumption of the circuit. When different values are inputted in the S-box, the number of glitches in the S-box operation circuit is different. Thus, the attacker can establish the connection between the input value of the S-box and the power consumption. The key of the circuit was successfully attacked through a differential Power Analysis [13]. Therefore, as long as the correlation between the number of glitches and the input of the S-box is destroyed, the purpose of resisting glitch attacks can be achieved. Based on the above analysis, the structure of the designed random delay module is shown in Figure 3.
In Figure 3, each triangle represents a buffer unit, which is made up of four identical NOT gates connected in a series. The selector determines the delay of the data path by several units and outputs it according to the mask input of the random delay module. Because the round data D to S-box is 32-bit data, it needs 32 random delay modules as shown in Figure 3. Among them, since each selector requires a 3-bit control signal, the 32-bit mask MI can only be used to control 10 selectors. The glitch attack is an attack against a single S-box. Even if the delay between the S-box is the same, the security of the S-box against the glitch attack will not be affected. We use the lower 24 bits of the mask MI to control the outputs of the 32 selectors of the data path. connected in a series. The selector determines the delay of the data path by several units and outputs it according to the mask input of the random delay module. Because the round data D to S-box is 32-bit data, it needs 32 random delay modules as shown in Figure 3. Among them, since each selector requires a 3-bit control signal, the 32-bit mask MI can only be used to control 10 selectors. The glitch attack is an attack against a single S-box. Even if the delay between the S-box is the same, the security of the S-box against the glitch attack will not be affected. We use the lower 24 bits of the mask MI to control the outputs of the 32 selectors of the data path.  The most complicated part of the T-box is the mask S-box. Based on the composite field mask method, the mask S-box module structure is shown in Figure 4. Because the round data to S-box is 32-bit data, it needs 4 mask S-box modules, as shown in Figure 4.  In Figure 4, MA indicates 8-bit mask input, A indicates 8-bit input data, and E indicates 8-bit output data. After the mask affine transformation module, mask field mapping module, composite field mask inversion module, and mask composite affine transformation module of the SM4 S-box, the masks are MB, MC, MD, and ME, respectively. In the case of determining prime polynomials over GF((2 4 ) 2 ) and GF(2 4 ), the mask design and optimization of each module in Figure 4 will be studied next. The most complicated part of the T-box is the mask S-box. Based on the composite field mask method, the mask S-box module structure is shown in Figure 4. Because the round data to S-box is 32-bit data, it needs 4 mask S-box modules, as shown in Figure 4.

Mask
connected in a series. The selector determines the delay of the data path by several units and outputs it according to the mask input of the random delay module. Because the round data D to S-box is 32-bit data, it needs 32 random delay modules as shown in Figure 3. Among them, since each selector requires a 3-bit control signal, the 32-bit mask MI can only be used to control 10 selectors. The glitch attack is an attack against a single S-box. Even if the delay between the S-box is the same, the security of the S-box against the glitch attack will not be affected. We use the lower 24 bits of the mask MI to control the outputs of the 32 selectors of the data path.

Round data D(1bit)
Random delay module output(1bit) The most complicated part of the T-box is the mask S-box. Based on the composite field mask method, the mask S-box module structure is shown in Figure 4. Because the round data to S-box is 32-bit data, it needs 4 mask S-box modules, as shown in Figure 4.  In Figure 4, MA indicates 8-bit mask input, A indicates 8-bit input data, and E indicates 8-bit output data. After the mask affine transformation module, mask field mapping module, composite field mask inversion module, and mask composite affine transformation module of the SM4 S-box, the masks are MB, MC, MD, and ME, respectively. In the case of determining prime polynomials over GF((2 4 ) 2 ) and GF(2 4 ), the mask design and optimization of each module in Figure 4 will be studied next. In Figure 4, M A indicates 8-bit mask input, A indicates 8-bit input data, and E indicates 8-bit output data. After the mask affine transformation module, mask field mapping module, composite field mask inversion module, and mask composite affine transformation module of the SM4 S-box, the masks are M B , M C , M D , and M E , respectively. In the case of determining prime polynomials over GF((2 4 ) 2 ) and GF(2 4 ), the mask design and optimization of each module in Figure 4 will be studied next.

Optimization Design of Mask Affine Transformation Module
Suppose A is the input of the mask affine transformation module, B is the output of the mask affine transformation module, and M A and M B are the mask input and output of the mask affine transformation module, respectively. B and M B can be expressed in the forms shown in Equation (1).
where X is the input data without mask operation.
Electronics 2020, 9, 1102 6 of 16 The DACSE (delay-aware common sub-expression elimination) optimization method is used to substitute Equation (2) into Equation (1), and the circuit logic expressions of the mask output and output data of the optimized mask affine transformation module are shown in (3) and (4), respectively.
where the symbol ⊕ represents XOR operation, the symbol represents XNOR operation.

Optimized Design of Mask Field Mapping Module
Assume that B is the input of the mask field mapping module, C is the output of the mask field mapping module, and M B and M C are the mask input and output of the mask field mapping module, respectively. According to (5), C can be expressed as the Equation (6).
The mask field mapping module performs the same field mapping operation on the masked data B and the mask M B , so it can be directly implemented using two field mapping modules. and output data E of the mask composite affine transformation module can be deduced, as shown in the Equation (7) M Substituting (8) into (7) and optimizing with the DACSE optimization method, the circuit logic expressions of the mask operation and data operation of the composite affine transformation module are shown in (9) and (10), respectively. Since the mask operation is consistent with the masked data operation, only two GF(2 4 ) add operations can be used to achieve the masked GF(2 4 ) add operation. The output data and mask of the masked GF(2 4 ) add operation are shown in (11) and (12), respectively.
Electronics 2020, 9, 1102 8 of 16 Among them, X and Y represent the input data of the mask GF(2 4 ) add operation, and M X and M Y represent the input masks of the mask GF(2 4 ) add operation. Z and M Z represent the output data and output mask of the masked GF(2 4 ) add operation.

) Multiplication
Suppose that X' and Y' represent the input data of the mask GF(2 4 ) multiplication operation after demasking, and Z' is the output data of the mask GF(2 4 ) multiplication operation after demasking, the Equation (13) can be obtained.
If the output mask of the module is M Z and the output data is Z, M Z = Z ⊕ Z . In order to avoid exposing the intermediate results with direct operation of the input data and the mask, the mask operation equation and data operation equation of the mask GF(2 4 ) multiplication operation are obtained, as shown in (14).

) Squared Constant Operation
Suppose that X and X' represent the masked and non-mask input data of the masked GF(2 4 ) squared constant operation, M X is the input mask. The calculation of Z', which is the output data of Mask GF(2 4 ) squared constant operation, is shown in (15).
It can be known from Equation (15) that this operation is composed of two GF(2 4 ) squared constant operations. When a constant matrix is used, the GF(2 4 ) squared constant operation has no area overhead, so the mask GF(2 4 ) squared constant operation also requires no resource overhead.

(d) Mask GF(2 4 ) Inversion Operation
Suppose that X and X' represent the masked and non-mask input data of the masked GF(2 4 ) inversion operation, M X is the input mask. The calculation of Z', which is the output data of Mask GF(2 4 ) inversion operation, is shown in (16).
If the operation is directly implemented in the manner of Equation (16), multiple GF(2 4 ) multiplication and addition modules need to be consumed. This not only causes a huge area overhead, but also increases the critical path delay of the circuit. If the mask output of the module is M Z and the data output is Z, then exists. After analysis, in order to avoid the direct operation of the input data and input mask to expose the intermediate results, the mask operation formula and data operation formula of the inversion operation of the mask GF(2 4 ) are obtained, as shown in Equation (17).
Since the expression in (17) still requires multiple GF(2 4 ) multiplication operations, the expression can be simplified and set to obtain the simplified expression, as shown in (18).
Let T = X 8 , M T = M X 8 , we can get the optimized Q output, as shown in Equation (19).
It can be known from Equation (19) that the mask GF(2 4 ) multiplication operation requires two GF(2 4 ) multiplication operations in addition to the above operations.
After optimized design, the area overhead of the mask GF((2 4 ) 2 ) inversion circuit implemented in this section is 231A XOR + 159A AND + 1AX NOR , which is 8298.81 µm 2 , and its critical path delay is 19T XOR + 4T AND . Before optimization, 414XOR gates and 322 AND gates are required, with an area of 15,302.36 µm 2 , and the critical path delay is 21T XOR + 4T AND . Compared with the situation before optimization, the critical path delay of the circuit is reduced by 8.3%, and the circuit area is reduced by about 45.83%.
Based on the above analysis and the SMIC 0.18 µm process, the mask S-box implemented in this section is synthesized. The comparison between before and after the optimization is shown in Table 2. The SM4 S-box based on composite field mask implemented in this paper consumes 322 AND gates and 580 XOR gates before optimization. The area is 19,719.62 µm 2 and the critical path delay is 30T XOR + 4T AND . After optimization, the critical path delay is reduced by 12.1%, and the circuit area is 7A XNOR + 159A AND + 322A XOR , which is equal to 10,870.98 µm 2 , which is a reduction of 44.87%.

Design of Random Linear Transformation
According to the analysis above, the attack point of the SM4 cryptographic circuit is generally input or output of the S-box. Since the linear transformation part needs to be in the form of a word attack, the attack is difficult. Therefore, we use the method of randomly inserting pseudo operations to defend against power consumption attacks on the module. Since the round transformation and key expansion are performed serially, the linear transformation L2 of the round transformation is idle when the key expansion part of the calculation is performed. When the round transformation module is performed, the key expansion linear transformation L1 is also idle. Considering the idle modules, the structure of the designed random linear transformation module is shown in Figure 5. to defend against power consumption attacks on the module. Since the round transformation and key expansion are performed serially, the linear transformation L2 of the round transformation is idle when the key expansion part of the calculation is performed. When the round transformation module is performed, the key expansion linear transformation L1 is also idle. Considering the idle modules, the structure of the designed random linear transformation module is shown in Figure 5.   , and (b) shows the random linear transformation L2 of the round transformation. C1 and C2 come from the random mask generation module. When the round transformation operation is being performed, C1 is 0 and C2 is 32′hffffffff. At the same time, the linear transformation with mask data is performed in the random linear transformation L1, which disturbs the power consumption generated by the random linear transformation L2 of the round transformation being performed. Similarly, when the key expansion operation is performed to the random linear transformation operation, the random linear transformation module of the round transformation will also generate corresponding random power consumption, thereby increasing the difficulty of power consumption attack.

Design of Random Mask Generation Module
The random mask generation module is used to generate the random mask required by the mask T-box, the operands C1 and C2 required by the mask linear transformation module. Therefore, the random mask generation module designed in this section is shown in Figure 6. C1 and C2 come from the random mask generation module. When the round transformation operation is being performed, C1 is 0 and C2 is 32 hffffffff. At the same time, the linear transformation with mask data is performed in the random linear transformation L1, which disturbs the power consumption generated by the random linear transformation L2 of the round transformation being performed. Similarly, when the key expansion operation is performed to the random linear transformation operation, the random linear transformation module of the round transformation will also generate corresponding random power consumption, thereby increasing the difficulty of power consumption attack.

Design of Random Mask Generation Module
The random mask generation module is used to generate the random mask required by the mask T-box, the operands C1 and C2 required by the mask linear transformation module. Therefore, the random mask generation module designed in this section is shown in Figure 6.
Electronics 2020, 9,   Because LFSR (linear feedback shift register) is a common method for generating pseudo-random numbers, a 32-bit LFSR is used in the random mask generation module. Figure 6 contains the ring oscillator operation unit, metastable processing unit, 32-bit LFSR, and selection unit. Among them, the loop oscillator operation unit is used to generate random numbers based on circuit characteristics. The metastable processing module is made up of two D flip-flops connected in series. It can synchronize the data generated by the loop oscillator operation unit to the clock field of the SM4 encryption and decryption circuit and eliminate the metastable state. LFSR relies on the input single-bit random data based on the characteristics of the circuit, so that the output random number tends to be true random in order to ensure the security of the mask output. When the Key_valid signal is 0, the C1 output mask is selected and C2 to output 0; otherwise, C1 outputs 0 and C2 outputs the mask.

Security Analysis
This section first performs a security analysis on the random delay mask S-box to verify its ability to resist glitch attacks, and then verifies the anti-DPA attack performance of the compact SM4 encryption and decryption circuit designed in this paper.

Security Analysis of Random Delay Mask S-box
The principle of the glitch attack is to attack the key based on the correlation between the number of glitches in the S-box operation and the input of the S-box. Therefore, as long as the number of glitches in the S-box operation is random, the glitch attack can be resisted. Table 3 lists the relationship between the input value of the random delay mask S-box and the number of glitches in the case of some mask inputs. Because LFSR (linear feedback shift register) is a common method for generating pseudo-random numbers, a 32-bit LFSR is used in the random mask generation module. Figure 6 contains the ring oscillator operation unit, metastable processing unit, 32-bit LFSR, and selection unit. Among them, the loop oscillator operation unit is used to generate random numbers based on circuit characteristics. The metastable processing module is made up of two D flip-flops connected in series. It can synchronize the data generated by the loop oscillator operation unit to the clock field of the SM4 encryption and decryption circuit and eliminate the metastable state. LFSR relies on the input single-bit random data based on the characteristics of the circuit, so that the output random number tends to be true random in order to ensure the security of the mask output. When the Key_valid signal is 0, the C1 output mask is selected and C2 to output 0; otherwise, C1 outputs 0 and C2 outputs the mask.

Security Analysis
This section first performs a security analysis on the random delay mask S-box to verify its ability to resist glitch attacks, and then verifies the anti-DPA attack performance of the compact SM4 encryption and decryption circuit designed in this paper.

Security Analysis of Random Delay Mask S-box
The principle of the glitch attack is to attack the key based on the correlation between the number of glitches in the S-box operation and the input of the S-box. Therefore, as long as the number of glitches in the S-box operation is random, the glitch attack can be resisted. Table 3 lists the relationship between the input value of the random delay mask S-box and the number of glitches in the case of some mask inputs.  0xedcba9  35  33  30  26  0xff24dc  33  39  31  32  0x156894  29  32  21  31  0xffffff  22  24  26  29  0x002456  34  29  23  26  0x380cd9  42  32  28  48  0x6ad0c3  31  24  28  29 In Table 3, SboxIn represents the input of the random delay mask S-box. MaskIn is used to control the random delay module to generate different delays in different data paths. It can be seen from Table 3 that, under different MaskIn, for the same SboxIn, the number of glitches generated by the S-box operation is different and tends to be random. The random delay mask S-box can effectively resist glitch attacks.

Security Analysis of SM4 Encryption and Decryption Circuit
To better evaluate the security of SM4 encryption and decryption circuit, there are two sets of tests performed on FPGA and ASIC, separately.
Firstly, the circuit designed in this paper is implemented on FPGA. The differential Power Analysis platform designed in [17] is used for data collection and Power Analysis. During the attack, the middle value of the selected attack is the high 8-bit output of the first round of the byte replacement operation, and the corresponding round key rk0 is 32 h15263748. After analyzing the collected 100,000 power consumption curves through Matlab software, the output Power Analysis results are shown in Figure 7.   MaskIn  0xff  0xaa  0x44  0x56   0xedcba9  35  33  30  26  0xff24dc  33  39  31  32  0x156894  29  32  21  31  0xffffff  22  24  26  29  0x002456  34  29  23  26  0x380cd9  42  32  28  48  0x6ad0c3  31  24  28  29 In Table 3, SboxIn represents the input of the random delay mask S-box. MaskIn is used to control the random delay module to generate different delays in different data paths. It can be seen from Table 3 that, under different MaskIn, for the same SboxIn, the number of glitches generated by the S-box operation is different and tends to be random. The random delay mask S-box can effectively resist glitch attacks.

Security Analysis of SM4 Encryption and Decryption Circuit
To better evaluate the security of SM4 encryption and decryption circuit, there are two sets of tests performed on FPGA and ASIC, separately.
Firstly, the circuit designed in this paper is implemented on FPGA. The differential Power Analysis platform designed in [17] is used for data collection and Power Analysis. During the attack, the middle value of the selected attack is the high 8-bit output of the first round of the byte replacement operation, and the corresponding round key rk0 is 32′h15263748. After analyzing the collected 100,000 power consumption curves through Matlab software, the output Power Analysis results are shown in Figure 7.  At the time of the intermediate value calculation output, the key Kguess corresponding to the DPA curve with the highest peak value is 8′hf7, and the guess key is wrong. Therefore, in the case of collecting 100,000 power consumption curves, the method of using a differential Power Analysis cannot crack the key of the circuit.
Secondly, Synopsys VCS, DC, Prime Time-PX and the other EDA software are used to simulate the operation process of the cryptographic circuit, and calculate the corresponding power consumption data according to the turnover rate of the cryptographic circuit in the simulation At the time of the intermediate value calculation output, the key K guess corresponding to the DPA curve with the highest peak value is 8 hf7, and the guess key is wrong. Therefore, in the case of collecting 100,000 power consumption curves, the method of using a differential Power Analysis cannot crack the key of the circuit.
Secondly, Synopsys VCS, DC, Prime Time-PX and the other EDA software are used to simulate the operation process of the cryptographic circuit, and calculate the corresponding power consumption data according to the turnover rate of the cryptographic circuit in the simulation process. We take the low 8-bit output of the second round of byte replacement operation as an example. The second round key rk1 is 32 h2937ac24. After analyzing the collected 100,000 power consumption curves, the output Power Analysis results are shown in Figure 8.
Electronics 2020, 9, x FOR PEER REVIEW 15 of 17 process. We take the low 8-bit output of the second round of byte replacement operation as an example. The second round key rk1 is 32′h2937ac24. After analyzing the collected 100,000 power consumption curves, the output Power Analysis results are shown in Figure 8. At the time of the intermediate value calculation output, the key Kguess corresponding to the DPA curve with the highest peak value is 8′h43, and the guess key is also wrong. Therefore, it is proved again that the random mask and randomization scheme proposed in this paper ensured the security of the compact SM4 encryption and decryption circuit against DPA attacks.

Synthesize Results
Based on the Xilinx Zynq-7000 (XC7Z020CLG484) FPGA platform, the circuit designed in this paper is synthesized in the software of Vivado 2017.4, and then the circuit is implemented after adding constraints. Table 4 shows the resource consumption and performance evaluation of the circuit designed in this paper on FPGA. The SM4 encryption and decryption circuit based on random mask and randomization method designed in this paper achieved a throughput of 99.56Mbps with a resource overhead of 968 LUTs and 536 FFs. In addition, tools such as Synopsys DC and Prime Time-PX were used to synthesize and sequence the compact SM4 encryption and decryption circuit designed in this paper. Under the SMIC 0.18 μm process, the circuit area consumption was 82,734 μm 2 . Considering that the area of the same circuit synthesized in different SIMC processes is different, we take NAND gate (9.9792 μm 2 ) as the standard gate and convert the circuit area to the number of NAND gates. The number of NAND gates in this circuit is 8290, and the critical path delay is 8.93 ns. A comparison of the characteristic parameters and resource overhead of the circuits designed in this paper with other circuits is shown in Table 5. At the time of the intermediate value calculation output, the key K guess corresponding to the DPA curve with the highest peak value is 8 h43, and the guess key is also wrong. Therefore, it is proved again that the random mask and randomization scheme proposed in this paper ensured the security of the compact SM4 encryption and decryption circuit against DPA attacks.

Synthesize Results
Based on the Xilinx Zynq-7000 (XC7Z020CLG484) FPGA platform, the circuit designed in this paper is synthesized in the software of Vivado 2017.4, and then the circuit is implemented after adding constraints. Table 4 shows the resource consumption and performance evaluation of the circuit designed in this paper on FPGA. The SM4 encryption and decryption circuit based on random mask and randomization method designed in this paper achieved a throughput of 99.56 Mbps with a resource overhead of 968 LUTs and 536 FFs. In addition, tools such as Synopsys DC and Prime Time-PX were used to synthesize and sequence the compact SM4 encryption and decryption circuit designed in this paper. Under the SMIC 0.18 µm process, the circuit area consumption was 82,734 µm 2 . Considering that the area of the same circuit synthesized in different SIMC processes is different, we take NAND gate (9.9792 µm 2 ) as the standard gate and convert the circuit area to the number of NAND gates. The number of NAND gates in this circuit is 8290, and the critical path delay is 8.93 ns. A comparison of the characteristic parameters and resource overhead of the circuits designed in this paper with other circuits is shown in Table 5. As can be seen from the table, the SM4 encryption circuit designed in this paper has better throughput/gate parameters. At the same time, the circuit in the paper [1] cannot resist glitch attacks, and only supports SM4 encryption operations, so it is not as good as the circuit designed in this paper in terms of security and practicability.

Conclusions
This paper focuses on the compact implementation of SM4 encryption and decryption circuits that are resistant to bypass attacks. In view of the inability to resist differential Power Analysis, a SM4 encryption and decryption circuit based on mask and randomization method is proposed. A mask S-box is designed using a composite field masking technique, so that the composite field inverse operation in the mask S-box can be truly masked. The random delay method is used to control the delay of each bit in the input signal to resist glitch attacks. The random linear transformation module is implemented by using a random insertion pseudo operation, which increases the difficulty of DPA to this module. Next, the security of the SM4 S-box against glitch attack is analyzed, and two bypass attack verifications of the designed circuit are performed using Power Analysis platform based on FPGA and ASIC. The attack cannot be successful with 100,000 curves. Finally, based on the SMIC 0.18 µm process, Synopsys DC are used to synthesize the design circuit. The area consumption is 82,734 µm 2 , which is 48% smaller than other papers. The compact SM4 encryption and decryption circuit based on the inverse operation comparison mechanism implemented in this paper has lower circuit resource overhead and higher security, and is a better implementation solution.