Symmetric Cryptography on RISC-V: Performance Evaluation of Standardized Algorithms

: The ever-increasing need for securing computing systems using cryptographic algorithms is spurring interest in the efﬁcient implementation of common algorithms. While the algorithms can be implemented in software using base instruction sets, there is considerable potential to reduce memory cost and improve speed using specialized instructions and associated hardware. However, there is a need to assess the beneﬁts and costs of software implementations and new instructions that implement key cryptographic algorithms in fewer cycles. The primary aim of this paper is to improve the understanding of the performance and cost of implementing cryptographic algorithms for the RISC-V instruction set architecture (ISA) in two cases: software implementations of the algorithms using the rv32i instruction set and using cryptographic instructions supported by dedicated hardware in additional functional units. For both cases, we describe a RISC-V processor with cryptography hardware extensions and hand-optimized RISC-V assembly language implementations of eleven cryptographic algorithms. Compared to implementations with only the rv32i instruction set, implementations with the cryptography set extension provide a 1.5 × to 8.6 × faster execution speed and 1.2 × to 5.8 × less program memory for ﬁve of the eleven algorithms. Based on our performance analyses, a new instruction is proposed to increase the implementation efﬁciency of the algorithms.


Introduction
The ever-increasing need for securing computing systems using cryptographic algorithms is spurring interest in the efficient implementation of common algorithms. While the algorithms can be implemented in software using the base instruction set of processors, there is considerable potential to reduce memory cost and improve speed using specialized instructions and associated hardware. However, there is a need to assess the relative benefits and costs of software implementations and new instructions that implement key cryptographic algorithms in fewer cycles. With the growing popularity of the extensible RISC-V ISA, there is a need to improve the understanding of the cost of implementing cryptographic algorithms for both implementations of the algorithms in software using the rv32i instruction set and with the implementation of instructions as hardware in additional functional units.
RISC-V is an open-source RISC ISA, which was developed starting in 2010 at UC-Berkeley [1]. In 2011, the team published Volume 1 of the RISC-V ISA manual [2]. In 2015, the RISC-V International Foundation [3] was founded to build a RISC-V ISA community. RISC-V has 32-bit and 64-bit versions, and currently, it has eight ratified instruction set extensions (-M, -A, -F, -D, -Q, -C, -Zicsr, -Zifencei) and the base instruction set (-I) [4]. However, there are several upcoming extensions.
One of these upcoming extensions is the RISC-V cryptography extension (-crypto), which has two sets of instructions-Scalar & Entropy Source [5] and Vector [6]. The scalar extension has groups of instructions. The first group includes instructions that are borrowed from the upcoming bitmanip extension [7], and five other groups are designed to accelerate particular cryptography algorithms: AES [8], SHA-256 [9], SHA-512 [9], SM3 [10], and SM4 [11]. The scalar extension provides both 32-bit and 64-bit versions of the instructions. Similarly, the vector cryptography extension provides vector versions of some of the bitmanip instructions and vector instructions to accelerate the AES and the SHA2 [9] algorithms. The RISC-V cryptography extension task group published version v1.0.0-rc6 of the RISC-V Cryptography Extensions Volume 1-Scalar & Entropy Source Instructions [5]. There is a slight difference between the instructions explained in version v1.0.0-rc6 and the instructions that are evaluated in this work. In this work, we implemented grev[i], shlf[i], and unshlf[i], which were removed from the cryptography extension in version 0.7.2 [12]. These instructions are replaced with brev8, rev8, zip, and unzip instructions, which are simplified versions of grev[i], shlf[i], and unshlf [i].
This work presents software-only algorithms for eleven cryptographic algorithms using the RISC-V rv32i ISA and compares the performance of these algorithms to the performance of a RISC-V processor with additional hardware modules that implement specialized instructions for single-cycle execution of cryptographic primitives.
In our study, we have not implemented any of the cryptographic algorithms fully in hardware. Instead, we wanted to observe their software performance without and with the aid of hardware acceleration via application-specific instruction set extensions. The only hardware modules we implemented are those added into the processor datapath in order to realize those additional instructions.
On the other hand, our software implementations reflect a balanced approach between execution speed and code size, with a greater emphasis on execution speed. To this end, we used loop unrolling where it was most effective and did not increase the program memory significantly. Furthermore, all SBOX tables were pre-computed and stored in memory rather than being calculated on the fly unless there is a specialized instruction that calculates the SBOX value. We obtained the clock cycle count, program memory, and static memory requirements of the software implementations and performed the comparison.
We grouped the cryptography instructions based on their structure and designed each group as a hardware module. The modular approach allows integration of any subset of modules with the RISC-V processor that we also implemented. Since each algorithm uses only a subset of the 32-bit scalar cryptography instructions, to ensure a valid performance comparison, we determined the module usage of each of the algorithms and evaluated the implementation cost accordingly.

Previous and Related Work
RISC-V ISA cryptography extensions have been the subject of several recent studies. In [13], two new cryptography instructions are proposed for the RISC-V ISA. The instructions accelerate the AES and SM4 block ciphers. Using the AES instructions, one round of AES can be implemented using only 16 instructions instead of 80. The instructions lead to a similar reduction for the SM4 algorithm. An efficient way to implement these instructions in hardware is proposed in [14]. An ISE [15] for RISC-V is proposed to accelerate a stream-cipher called ChaCha [16]. Compared to OpenSSL baseline and ISA-based optimized implementations, ISE-assisted ChaCha speeds up at least 5.4× and 3.4×, respectively.
The core building block of ASCON is implemented as an instruction extension for RISC-V ISA [17]. Comparing the results with efficient C implementations, the accelerator sped up the implementations by about a factor of 50 for ASCON and 80 for ASCON-HASH. Furthermore, the extensions lead to significant binary size reduction.
A set of hardware accelerators and 29 new instructions for lattice-based cryptography are proposed in [18], with implementations of a RISC-V-based processor with the proposed accelerators and instructions in ASIC and FPGA. Compared to pure software implementations, accelerators and the instructions lead to a speed-up factor of 11.4 for NewHope, 9.6 for Kyber, and 2.7 for Saber. On the other hand, the cell count of the CPU is increased by a factor of 1.6 compared to the original RISC-V design due to additional instructions and accelerators.
An energy-efficient crypto-coprocessor is designed for the AES, ECC, and SHA-256 algorithms and integrated with an open-source RISC-V core in [19], using a conditionally charged flip-flop to implement the crypto-coprocessor. The pipelined design achieved a 10.3% power reduction on average for cryptography tasks. Reference [20] developed an optimized RISC-V assembly the implementation of the table-based AES, bitsliced-AES, ChaCha, and Keccak-f [1600] algorithms.
The RISC-V cryptography extensions task group published RISC-V Cryptography Extensions Volume 1-Scalar & Entropy Source Instructions version v1.0.0-RC6 [5]. The extension provides details about the cryptography instructions; however, it does not analyze software implementations using these instructions. Although an increase in software implementation efficiency using the extension is expected, the possible gain is not provided.
A versatile RISC-V Galois Field ISA extension is proposed in [21]. The researchers achieved 5X acceleration for AES, Reed-Solomon codes, and Classic McEliece by increasing the logic utilization by 1.27%.
The performance of the AES, RC6, Twofish, SPECK128, and ChaCha20-Poly1305 algorithms is presented in [22]. The researchers included execution time, throughput, and power consumption as performance parameters. Based on the obtained performance parameters, the researchers decided that ChaCha20-Poly is a very good option for resourceconstrained devices, along with SPECK128 and LEA.

Objectives
In this paper, we implement 11 cryptographic algorithms in both RISC-V assembly code using the 32-bit base RISC-V instructions (rv32i) and using the 32-bit scalar cryptography instruction set in addition to base instructions (rv32i+crypto). We analyze the performance of the two approaches using clock cycle count, program memory, and static memory requirements and extract area requirements of the hardware implementations. We then use these results to analyze the benefit vs. cost in terms of the acceleration of execution times as a function of the additional hardware cost relative to that for the rv32i datapath implementation. Finally, based on our results, we propose a new instruction to increase the software implementation efficiency of the cryptographic algorithms and analyze its benefits and costs.
We implemented a simple assembly language simulator for RISC-V, which allowed us to easily modify the simulator to display the required outputs (clock cycle count, program memory, static memory). We also built a simple assembler in the Python programming language to convert RISC-V assembly language code to machine code. We developed de novo a 32-bit RISC-V processor and cryptography instruction modules using Verilog HDL. Designs were simulated on Xilinx's Vivado Design Suite. The Yosys [31] open-source framework for Verilog RTL synthesis was used to extract the hardware area requirements for the different implementations.

Cryptographic Algorithms
This section summarizes the eleven implemented cryptographic algorithms: eight symmetric block ciphers and three hash functions. All of the algorithms are standardized algorithms. Tables 1 and 2 list the block ciphers and hash functions, respectively. To limit the scope while retaining the ability to develop insights, we implemented only one key size of the listed block ciphers.

Software Implementations
This section summarizes the software implementation process. In order to have a flexible development environment, we implemented a RISC-V assembly language simulator using MATLAB and implemented the algorithms using this simulator. First, the algorithms were implemented by only using the 32-bit RISC-V integer instructions, and then, the cryptography instructions were used to accelerate them. The following sections will give more details about how the cryptography instructions are used to accelerate the initial implementations.

Software Implementation of AES
In this study, we implemented the AES-128 encryption algorithm using the T-table approach [34]. We used the T-table approach because the AES instructions are designed according to this approach [13]. The T-table approach accelerates the software implementation by pre-calculating the "SubBytes" and "MixColumns" layers. The approach is explained in detail below. Figure 1 shows the state matrices of the AES-128 encryption algorithm after the "Sub-Bytes" (S(i)), "ShiftRows", and "MixColumns" layers. The "MixColumns" layer multiplies each column of matrix B to calculate matrix C. Figure 2 shows the formulas used in the "MixColumns" layer to calculate matrix C. If we re-arrange the formulas, we can use the 32-bit ak (k = 1, 2, . . . , 15) values to calculate matrix C as shown in Figure 2.
The T-table implementation pre-calculates and stores the 32-bit ak values for every possible 8-bit b value. Therefore, it is possible to calculate matrix C directly using matrix A. Figure 3 shows the four T-tables used for the implementation. It is also possible to store only one T-table and calculate the other three T-tables using logical rotate operations.   We can use the aes32esmi instruction to calculate the 32-bit ak values and XOR it with the rs1 to implement the "MixColumns" layer [13]. The aes32esi instruction is used to perform AES SBOX transformation for key scheduling and the last round of the algorithm.

Software Implementation of Seed
The function F and function G use modular 2 32 addition and subtraction. These functions are explained in the SEED documentation [29]. These operations can be implemented using ADD[I] and SUB instructions, respectively. The RISC-V ISA does not have a carry flag, so the carry bit should be calculated when needed.
SEED provides two versions of the algorithm. Both versions generate the same output when the same input is given. The only difference is that Version 1 is slower than Version 2. However, Version 1 uses less data memory space for SBOX table storage than Version 2. Version 1 and Version 2 use two 8 × 8 SBOX tables and four 8×32 SBOX tables, respectively. Version 2 of the algorithm accelerates the G function implementation by sacrificing data memory space (explained in [29]).
One of the most time-consuming operations for the block ciphers is to read the SBOX table from memory. For simplicity, the SBOX tables are stored into the memory such that the SBOX value of the input x is stored into the memory location (x + offset), where the offset is zero in our implementation. Assembly code for memory address calculation of the Version 1 SBOX table is shown in Figure 4. The state values are stored in 32-bit registers, but the SBOX transformation is applied for each individual byte for the state value. Therefore, we shift the input byte to the least significant position and mask it. Then, we add the offset address to the masked value. The RD3 register has the 32-bit memory address of the SBOX table value.
To accelerate this operation, we can use one of the crossbar permutation instructions (xperm4 or xperm8). The RS1 register holds the required value to shift the desired byte to the least significant byte position.

Software Implementation of CAMELLIA
In this work, we implemented CAMELLIA-128. The CAMELLIA key scheduling uses 128-bit logical left rotation (≪). A 128-bit value is stored in four 32-bit registers. Therefore, to perform the 128-bit rotate operation, we need to modify all four registers (RS1, RS2, RS3, RS4), as shown by the assembly code in Figure 5. The imm values are calculated based on the rotation amount, and the new results are stored in the registers T1, T2, T3, and T4. For each 128-bit rotate operation, we need 12 instructions. The algorithm has two versions according to the SBOX tables. Version 1 stores four 8 × 8 SBOX tables in the memory. Version 2 calculates three of these tables using Equations (1)-(3). Therefore, Version 1 of the algorithm computes the result faster by sacrificing data memory space.
(3) Version 2 of the algorithm uses 8-bit rotation to calculate the SBOX tables using the equations above. The 8-bit left rotation is performed as shown by the assembly code in Figure 6. Suppose RS1 stores a zero-extended 8-bit value. The imm1 value is the rotate amount, where imm2 is equal to 8 minus imm1. The FL and invFL functions contain 32-bit left rotate operations, which can be accelerated using the rotate instructions. The FL and invFL functions are described in the CAMELLIA documentation [28].

Software Implementation of CAST
Compared to the other block ciphers in this thesis, CAST-128 has the biggest SBOX tables that occupy 8KB memory space. The algorithm has eight 8 × 32 SBOX tables. The rv32i provides byte-addressed 32-bit address space. Therefore, we need to leave the most significant two bits of the address zero when we read the 32-bit value with the rv32i LW instruction. Figure 7 shows the assembly codes of the address calculation operation for the 8 × 32 SBOX table.

Software Implementation of SHA-256 and SHA-512
The RISC-V cryptography instruction set extension has specific instructions that are designed to accelerate SHA-256 and SHA-512. The instructions implement the SIGMA and SUM transformation functions (explained in [9]) of these algorithms. The following section will describe the benefits of using the cryptography extension instructions.
Each transformation function of SHA-256 is implemented as a single instruction in the cryptography extension. Figure 8 shows the three different software implementations of the SHA-256 σ0 (SIGMA0) function. All three implementations receive the register x as the input and store the 32-bit result into the register RD. The SHA-512 transformation functions receive a 64-bit input and generate a 64-bit result. Therefore, each transformation function takes two SHA-512 instructions to calculate in a 32-bit RISC-V processor, and it is not possible to use 32-bit rotate instructions to perform a 64-bit rotate operation. Figure 9 shows the software implementations of the SHA-512 σ1 (SIGMA1) function.

Software Implementation of TDEA
The algorithm has several bit permutations. The assembly code, shown in Figure 10, shows a basic way to perform single-bit permutation using rv32i. First, we shift a bit of the RS1 to the least significant bit position and mask it. Then, we shift the bit to its new position. It is possible to perform any bit permutation just by repeating these four assembly instructions.

Software Implementation of MISTY1
MISTY1 [24,32] is a 64-bit block cipher with a 128-bit key. The algorithm has a variable number of rounds, but the eight-round version of the algorithm is recommended. The algorithm uses two SBOX tables named S7 and S9, which are 7 × 7 and 9 × 9 SBOX tables, respectively. The MISTY1 key scheduling algorithm (Algorithm 1) uses the primary key (K) to generate 16-bit round keys (EK). Each element of K (K[i]) holds a byte of the key.  Figure 11 shows the assembly implementation of the Misty FI and FL functions. The functions are described in [24].

Software Implementation of HIGHT
HIGHT [26,32] is a 64-bit block cipher with a 128-bit key size. The algorithm uses 32 rounds with no SBOX table. Algorithm 2 describes the HIGHT key scheduling algorithm. The 128-bit subkeys are generated at each round.

Software Implementation of PRESENT
PRESENT is a 64-bit block cipher that uses a substitution-permutation network of 31 rounds. "sBoxLayer" and "pLayer" are the two key parts of this algorithm. The assembly code in Figure 13 shows the "SBoxLayer" implementation with crossbar permutation instructions. The code performs the PRESENT SBOX transformation on each nibble of the registers S0 and S1.
We used the lower part of SBOX (B1 register) to transform each nibble using the xperm4 instruction. If a nibble is greater than seven, the corresponding nibble will be zero in register T0. Then, we used a masking value (B2 = 88,888,888) to flip the most significant bit of each nibble and transform the nibbles using the upper part of the PRESENT SBOX table and save the result to T1. Then, we XORed T0 and T1 to obtain the result that holds the transformed version of nibbles in register S0. The same operation is applied to S1, and the result is saved to Y1. The "pLayer" of the algorithm performs bit permutation defined by the "pLayer" table in [27]. Figure 14 shows the assembly code that implements the PRESENT "pLayer" using shuffle instructions.

Hardware Implementations
This section describes the hardware module designs that are implemented in this project. The hardware design contains eight modules: Bit Re-positioning Instructions 3.
Logic With Negate Instructions 6.
AES and SM4 Instructions The first module is a 32-bit 5-stage pipelined RISC-V processor that only supports base integer instructions. The remaining seven modules implement cryptography extension instructions. The instruction extension modules share logic to optimize the hardware area. Any combination of the seven extension modules can be used to extend the processor. The following sections will explain some of the hardware design highlights we used to reduce the area of the modules. All the modules are designed as fully combinational to allow one clock cycle execution.  Figure 15 shows the hardware block diagram of the bit re-positioning instructions.
We used matrix multiplication for the hardware design of these instructions. To clarify the bit re-positioning with matrix multiplication, see Figure 16, which demonstrates a 4-bit matrix multiplication. It is possible to choose a matrix B such that when we multiply it with matrix A, matrix C will have the bits of A in the desired position. For example, if we set b03, b12, b21, and b30 to 1 and the rest of the B matrix elements to 0, we reverse the A matrix elements' order and save the result to the C matrix.
The architecture contains five multiplexers and five 32-bit matrix multipliers. Each multiplexer chooses one of the six 32×32 B matrices based on the instruction and rs2. Then multipliers perform the matrix multiplication to re-position the bits of rs1. Finally, the output of the fifth matrix multiplier gives the result.

Hardware Architecture of Carry-Less Multiply Instructions
The carry-less multiplication is multiplication in GF(2). The carry-less multiplication can be implemented with logical AND (&) and XOR gates. Figure 17a demonstrates 4-bit carry-less multiplication.
The method of performing the multiplication in a fully combinational circuit is to calculate the result directly as shown in Figure 17a. In this case, we will have a logic circuit (circuit1, . . . , circuit8) for each of the eight bits, where circuit8 is hardwired to zero. Instead of using all eight circuits, we used the similarity between lower (circuit0,. . . ,circuit3) and upper (circuit4, . . . , circuit6) parts of the logic circuits to reduce the circuit area. It is possible to use only the lower part of the circuits to calculate both the lower (c0, . . . , c3) and higher (c4, . . . , c7) halves of the result. It is necessary to change the inputs of the circuits as shown in Figure 17b in order to calculate the higher order half of the result. In this study, we implemented 32-bit carry-less multiplication. The hardware architecture takes advantage of the similarity between the upper and lower part of the full carry-less multiplication circuit to reduce the area cost of the circuit, as explained above. Figure 18 shows the block diagram of the hardware architecture of the carry-less multiplication circuit. The multiplexers are used to select the correct input for CLMULH and CLMUL. The "Shift Left" block performs a logical 1-bit left shift. The "Multiplication Unit" contains the lower part of the full 32-bit carry-less multiplication circuit. The "Reverse Order" blocks reverse the order of the 32-bit values.

Hardware Architecture of 32-bit Algorithm Specific Cryptography Instructions
This section describes the hardware architecture of the algorithm-specific cryptography instructions. The instructions are described in the RISC-V cryptography extension [5].

Hardware Architecture of Hash Instructions
The SHA-256 and sm3 instructions consist of 32-bit bitwise rotate and logical XOR operations. Each bit of the result is obtained by XORing a subset of bits of rs1. This operation can be implemented by a matrix multiplication, described in Figure 16. In the case of SHA-512 instructions, each bit is calculated by XORing a subset of bits of rs1 and rs2. In this case, two matrix multiplications and a 32-bit XOR operation are needed for the calculation. Figure 19 describes the hardware architecture of the hash instructions. The multiplexers choose two 32 × 32 matrices to be multiplied with rs1 and rs2 and the results are XORed to generate the output.

Hardware Implementation of AES and SM4 Instructions
This Section describes the hardware architecture of the AES and SM4 instructions. The instructions are described in RISC-V cryptography extension [5]. Figure 20 demonstrates the overall architecture of the AES and SM4 instructions. Multiplexer 1 chooses one of the four bytes of rs1 as the SBOX module input. The selection value bs is a 2-bit value specified in the instruction encoding. The SBOX module, described in detail later on in this section, performs a one-to-one transformation. The module performs the AES encryption (AES), AES decryption (AES −1 ) or SM4 algorithm SBOX transformation based on the selection bits. The following multiplier performs the forward or inverse mixcolumn multiplication of the AES. The SM4 operations module, described in Figure 21, calculates the sm4_ed and sm4_ks values. After that, multiplexers three, four, and five choose one of the values and send the result to the rotate modules. After the value is rotated according to the bs value, the result is XORed with rs1 to generate rd. Figure 21 shows the hardware architecture of the "SM4 operations module".  The SBOX transformations of the algorithms can be performed using finite field arithmetic. The SBOX transformation of AES, AES −1 , and SM4 algorithms are defined by (4), (5), and (6), respectively. The SBOX module chooses one of these three equations to calculate its output. The equations use affine transformations and require inverse operation in GF(2 8 ) ( I(x) ). The affine transformations for the AES, AES −1 , and SM4 algorithms are shown in Figures 22 and 23. In the affine transformation, the B matrix is multiplied with an 8×8 matrix, and the result is added to the C matrix. The AES and SM4 algorithms use the irreducible polynomials (7) and (8) for the inverse operation, respectively.
Calculating the inverse of a seventh-degree polynomial modulo an eighth-degree polynomial in GF (2 8 ) is not easy. Therefore, we used the composite field approach to simplify the operations.
Two fields {GF(2 n ), Q(y) = y n + ∑ q i + y i } and {GF((2 n ) m ), P(x) = x m + ∑ p i + x i } are called a composite field if GF(2 n ) is constructed from GF(2) by Q(y) and GF((2 n ) m ) is constructed from GF(2 n ) by P(x). A field GF(2 nm ) is isomorphic to the field GF((2 n ) m ), therefore mapping between fields is possible using an isomorphic transformation [35]. In our case, we mapped the field GF(2 8 ) to the field GF((2 4 ) 2 ), where P(x) = x 2 + x + 9 and Q(y) = y 4 + y + 1. After the the mapping, it is possible to perform the operations in GF (2 4 ) and re-map the composite field to the field GF(2 8 ).
The isomorphic transformation is a single matrix multiplication. In our case, the 8-bit element in GF (2 8 ) is multiplied by an 8×8 matrix and the result is an 8-bit element in GF((2 4 ) 2 ).
Although the same mapping, GF(2 8 ) to GF((2 4 ) 2 ) with P(x) = x 2 + x + 9 and Q(y) = y 4 + y + 1 , is performed for both algorithms, we use different ISO matrices for AES and SM4 because they require irreducible polynomials. Since there are no calculated ISO matrices for this operation, we used an algorithm to determine the binary ISO matrices. Figures 24 and 25 show the binary ISO matrices of the AES and the SM4 algorithms, respectively. To find the ISO matrices, we used the algorithm described in Section 2.2 of [35].     Figure 27 demonstrates the inverse in the GF(2 8 ) module. As explained before, the inverse operation is performed in the composite field, GF((2 4 ) 2 ). First, the elements in GF (2 8 ) are mapped to the composite field, then the inverse operation is performed in GF((2 2 ) 4 ), and finally, the elements are mapped back to GF(2 8 ). Equation (9) [36] shows the inversion in GF((2 4 ) 2 ), whereÂ0 andÂ1 are the lower and upper 4 bits of the 8-bit inputÂ, respectively. w0 is constant nine due to P(x) = x 2 + x + 9.

Multiplication in GF(2 4 ):
Multiplication is implemented as polynomial multiplication mod Q(y). We implement the modulus operation by replacing every occurrence of y 4 with y + 1.

Square in GF(2 4 ):
To square a number in GF(2 4 ), we multiply the polynomial of degree four by itself and apply the mod Q(y) operation. When we take the square of the polynomial A, (Equation (10)), we find the polynomial A 2 , (Equation (11)).
Addition in GF (2 4 ): The addition operation is implemented as a logical XOR operation.
Inverse in GF (2 4 ): The inverse of each element in GF (2 4 ) is calculated by applying modulus Q(y) to y i (i = 0 to 15). By doing this, we find the 16 elements listed in Table 3. The inverse of each element y i is y j , where i + j = 15. If we multiply a polynomial with its multiplicative inverse, the product is 1. Table 3. Inverse of Elements in GF(2 4 ).

Results
In this section, we analyze the software and the hardware implementation results and propose an instruction to improve performance. We evaluated the performance of the eleven cryptography algorithms described in Section 3. The algorithms are implemented first using the 32-bit base integer instruction set (rv32i) and then the extended base instruction set (rv32i+crypto). From now on, the "rv32i" and "rv32i+crypto" implementations will be called base and crypto software implementations, respectively. The base software implementations use only the base instruction set [4]. The crypto software implementations use the base instruction set and also a subset of the cryptography instructions. The crypto instructions used for each of the crypto software implementations are specified. The analysis includes clock cycle count (CC), program memory, and static memory requirements.
Using these results, we analyze the gain in performance as a function of the hardware cost for the rv32i+crypto hardware, measured as the acceleration vs. hardware cost for the processor with the cryptography instructions and the new proposed instruction relative to the rv32i hardware. Finally, based on the performance of the software and hardware implementations, we propose a new instruction.

Clock Cycle Count
The clock cycle count is the total number of clock cycles required to process one block of data with each algorithm on the 5-stage RISC-V processor. The processor is a singleissue, in-order processor and the execution stage requires one cycle for all instructions. Therefore, the clock cycle count is almost equal to the total number of executed instructions. Tables 4-6 show the clock cycle count count and the acceleration for the software implementations. The acceleration is computed as the ratio of CC for the base and crypto software implementations. Figures 29 and 30 graphically compare the CC for the block ciphers and hash algorithms, respectively. The results show that the TDEA algorithm requires the highest, and MISTY1 requires the lowest CC among block ciphers. PRESENT has the highest acceleration. MISTY1, CAST-128, HIGHT, and CAMELLIA V1 have almost no acceleration. Some of the algorithms do not have a crypto implementation because the cryptography instructions could not be used to accelerate the algorithms. Clock cycle count and acceleration for these algorithms are stated as NC (Not Calculated).  Table 6 shows that SHA3-256 has the highest CC among hash functions.   Table 7 shows the CC of the SBOX address calculation for the SEED implementations. Version-1 and Version-2 of the algorithm use 8×8 and 8×32 SBOX tables, respectively. Section 4.2 shows how we used the xperm4 instruction to accelerate the address calculation for 8×8 SBOX tables. We did not implement the crypto implementation for the 8×32 SBOX address calculation. The CC of the address calculation occupies 38% and 18% of the CC of the Version-1 and Version-2 base software implementation of the algorithm, respectively, with a corresponding significant reduction in executed instructions. The CAMELLIA block cipher contains several rotate operations and address calculation for the 8×8 SBOX table. Section 4.3 explains the software implementations of the rotate operations. Table 8 shows the CC for the rotate and address calculation operations for the CAMELLIA-128 encryption algorithm. The clock cycle count of the address calculation occupies 24% and 20% of the CC of Version-1 and Version-2 base software implementations of the algorithm, respectively. The CAST-128 block cipher reads the SBOX table 384 times for encrypting one block of data. Address calculation for each of the SBOX read operations requires three instructions in the base software implementation. Section 4.4 explains the SBOX table address calculation for CAST-128. The algorithm uses an 8×32 SBOX table, so the address calculation is not accelerated using crypto instructions. Table 9 shows the instruction counts for 32-bit rotate and SBOX address calculations for CAST-128. One of the key operations for the HIGHT block cipher is the 8-bit rotation. Section 4.3 explains the software implementation of 8-bit rotation. The rv32i+crypto implementation requires 9% fewer instructions (Table 10).  Table 11 displays the CC of key operations in TDEA. The CC of the SBOX table read operation includes address calculation and the load instructions that are used to read the data memory. The clock cycle count of the SBOX table read operation occupies 22% of the TDEA base software implementation. It is costly to implement the bit permutation operations with a 32-bit architecture; the clock cycle count of the permutation operations occupies 73% of the TDEA base software implementation. Finally, as shown in Table 12, the rv32i+cypto instruction set greatly reduces the CC for the sBoxLayer and pLayer of the PRESENT block cipher.

Program Memory
Tables 13 and 14 provide program memory requirements of the software implementations of the block ciphers, where the the ratio of program memory requirement of the base software implementation to that for the rv32i+crypto software implementation is shown as Reduction. The program memory is calculated by multiplying the number of instructions in the assembly language program by four (since each instruction is four bytes), with verification by examining the output of the RISC-V GNU Compiler Toolchain. CAMELLIA and SEED-V1 respectively have the highest and lowest program memory requirement. With the rv32i+crypto implementations, PRESENT and AES have low program memory requirements and hence a high reduction ratio compared to the other block ciphers. SHA3-256 requires 6.3 times more program memory than SHA-256, although they generate the same size message digest, as shown in Table 15.

Static Memory
The static memory required by an implementation is another important factor, especially for embedded applications where memory size is limited. A program's static memory is allocated at compile time, unlike dynamic memory, which is allocated during execution. In our analyses, the static memory used is calculated by adding up the memory required to store constant values, including, but not limited to, SBOX tables and initialization vectors. The static memory requirement for an algorithm can change based on the implementation. In this study, the AES-128 is implemented with T-tables as described Section 4.1. This approach provides significant acceleration for the algorithm by sacrificing static memory. In general, we attempted to minimize the CC of of the software implementation. Note that the rv32i+crypto implementations do not provide a reduction in static memory requirements: the same constant values are required regardless of the ISA used. Tables 16-18 provide the static memory requirements of the algorithms. Due to its 8KB SBOX table, CAST-128 has the largest static memory requirement, while HIGHT and PRESENT have the lowest static memory requirements.

Analysis for Cryptography Instructions
In Section 5, we grouped the cryptography instructions and described the hardware architecture of each group. Each group of instructions is designed as a separate hardware module, and modules can be added to the RISC-V core in any combination. Table 19 lists the hardware modules with their gate equivalent (GE) area.  Table 20 shows the cryptography instructions used for each crypto implementation. The AES, SHA-256, and SHA-512 crypto implementations only use the algorithm-specific instructions. The rest of the implementations use only a subset of the 32-bit cryptography bitmanip instructions. The rightmost column of the table shows the cryptography instruction modules that are used by the crypto implementations. The table shows that some of the cryptography instructions are not used by any crypto implementation. This does not mean that these instructions are not used in other cryptography applications; for example, Shay and Micheal [37] show the usage of a Carry-Less multiply instruction for computing AES in GCM mode. The benefit vs. cost of the additional specialized hardware for rv32i+crypto can be visualized by plotting the acceleration (reduction in clock cycle count) against the relative hardware cost. The relative hardware cost is the area of extended RISC-V core divided by the RISC-V core area. The extended RISC-V core includes modules that are needed by the crypto implementation, specified in Table 20. To calculate the implementations' hardware cost, we summed the gate equivalent area of each module used in each implementation. For example, to calculate the hardware cost of the AES crypto implementation, we added gate equivalent area of the RISC-V core and the modules for AES and SM4 Instructions. The hardware cost of the base software implementations includes only the area of the RISC-V core module. Figure 31 shows the acceleration vs. the relative hardware cost of the crypto implementations. Compared to implementations with only the base rv32i instruction set, implementations with the cryptography set extension provide 1.5× to 8.6× faster execution speed, at additional hardware cost of less than 9%.

Proposed New Instruction for SBOX Address Calculation
Section 7.1 shows that address calculation for data memory to read SBOX values is one of the most time-consuming operations. Therefore, we developed an instruction to accelerate the address calculation for 8-bit input SBOX tables. Figure 32 shows the assembly code for the address calculation of 8-bit input SBOX tables using rv32i instructions, where one of the 4 bytes of the 32-bit RS1 register is the SBOX input byte, and the input byte bx is stored in the memory location (bx + offset). The new instruction performs the address calculation operation. Therefore, instead of three instructions, implementation with the new instruction uses one. The implementation with the new instruction (rv32i+PI) uses the 32-bit base integer instructions and the new instruction only. Figure 33 shows the hardware architecture of the address calculation instruction. Similar to the assembly language implementation, the architecture selects one of the four bytes, zero extends the byte and shifts the result left before adding the offset. Table 21 shows the GE area of the architecture.   Figure 34 compares the clock cycle count for the five software implementations that are implemented with the three different instruction sets. Figure 35 shows the cost-benefit analysis of five software implementations that use the new address calculation instruction. The instruction provides a 1.2× to 1.6× faster execution time for the implementations with an incremental hardware cost of <1.5%.
In summary, our results show that additional, application-specific hardware for cryptographic primitives can significantly improve execution time performance at a small additional areal cost. We also grouped the cryptographic instructions into modules that enable precise matching of capability to specific sets of algorithms. Moreover, we have demonstrated that just one additional instruction can provide significant improvement based on a careful analysis of algorithmic requirements.

Hardware Module AREA (GE)
Address Calculation Instruction 220 Figure 35. Acceleration vs. hardware cost of crypto implementations with new address calculation instruction.

Conclusion
In this work, we developed software-only algorithms for eleven key cryptographic algorithms using the RISC-V rv32i ISA, and compared the performance of these algorithms to the performance of a RISC-V processor with additional hardware modules that implement specialized instructions for single-cycle execution of cryptographic primitives. Our software implementations reflect a balanced approach between execution speed and code size, with a greater emphasis on execution speed. To this end, we used loop unrolling where it was most effective and did not increase the program memory significantly. Furthermore, all SBOX tables were pre-computed and stored memory rather than calculated on the fly unless there is a specialized instruction that calculates the SBOX value.
For the RISC-V processor augmented with cryptographic hardware, the cryptography instructions were grouped based on their structure, with each group designed as a hardware module. The modular approach allows integration of any subset of modules with the processor. Since each algorithm uses only a subset of the 32-bit scalar cryptography instructions, to assure a valid performance comparison, we determined the module usage of each of the algorithms and evaluated the implementation cost accordingly.
The key technical contributions of this work are: • Compared to implementations using only the base rv32i instruction set, implementations with the cryptography set extension provide 1.5× to 8.6× faster execution speed and 1.2× to 5.8× less program memory for five of the eleven algorithms. For the remaining six algorithms, the increase in execution speed and reduction in program memory requirement is less than 6%. • The hardware crypto implementations have an additional hardware complexity of 0.3% to 7.7% over the software implementations using the rv32i ISA. • The benefit-cost analysis in Figure 31 graphically shows the acceleration of execution time as a function of the relative hardware cost, summarizing the gains in execution time as a function of the costs in terms of hardware complexity for each algorithm. As one illustration of the benefit vs. cost, we see that for the SHA algorithms, we achieve an acceleration of approximately 1.7× at a hardware cost increase of less than 7.5%. • Based on our analysis of execution times, we proposed a new instruction to accelerate the memory address calculation operations for the 8-bit input SBOX table, which is dominant in the execution time for four of the eleven algorithms. This new instruction provided a 1.2× to 1.6× faster execution time for the four algorithms with only a 1.1% additional hardware cost, as shown in Figure 35.
Our work differs from the previous studies done on cryptographic hardware acceleration support for RISC-V in mainly its completeness. Instead of implementing support for only cryptographic algorithms as in [38], we have implemented support also for permutation instructions, thereby providing a more complete solution for the implementation of any cryptographic algorithm. This approach allows software support and hardware acceleration not only for any symmetric cryptographic algorithm including future standards, but also asymmetric algorithms. The additional cost for permutation instructions is only 4K GE, which is only 8% higher than the area cost of 3.7K GE required for cryptographic instruction support (see Table 19). In comparison, a synthesis of the solution proposed in [38], would require close to 5K GE for cryptographic instructions alone.
The eleven cryptographic algorithms evaluated here have similar structures. For example, none of the algorithms use multiply/divide instructions or floating-point instructions. Some of the cryptography instructions are not used for any of the implementations. However, this does not mean that these instructions are not helpful for other cryptographic algorithms and applications. Therefore, future research should investigate the usability of these instructions for cryptographic algorithms not considered here. Future efforts could also examine implementations of different ISAs for the same set of algorithms, revealing the pros and cons of the RISC-V ISA compared to other ISAs.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: