Exploiting Structural Symmetry of SM4 for an Asymmetric Hardware Architecture: Design and Open-Source Verification on the RISC-V LicheePi 4A Platform
Abstract
1. Introduction
1.1. Background
1.2. Problem Statement
- Q1.
- The encryption path and the key-expansion path of SM4 exhibit highly asymmetric invocation frequencies and area sensitivities. Existing open RTL designs either adopt a uniform iterative structure—sacrificing throughput—or a uniform fully unrolled structure—wasting area on the low-frequency key expansion. How can one achieve a tighter throughput–area trade-off while preserving algorithmic symmetry?
- Q2.
- The TH1520 SoC on LicheePi 4A does not expose the C910’s RoCC interface, rendering physical insertion of a hardware coprocessor infeasible at the silicon level. Under this constraint, how can one complete end-to-end verification of custom instruction semantics, software API, and hardware protocol on the board without sacrificing engineering rigour?
- Q3.
- Current public SM4 implementations are generally tied to proprietary EDA toolchains, which makes their throughput and area numbers difficult for an independent reader to reproduce. How can one build a one-command reproducible verification environment using only open-source tools, so that every reported result is bit-for-bit reproducible on commodity hardware?
1.3. Contributions
- C1 (Verified). Building on an open-source asymmetric dual-channel SM4 RTL implementation as our reference baseline datapath, we contribute an extended verification campaign and the methodological infrastructure needed to evaluate it end-to-end on a commodity RISC-V platform. On LicheePi 4A, a deterministic 1040-vector testbench (16 hand-picked edge cases plus 1024 pseudo-random vectors produced by an xorshift64 PRNG with the fixed seed 0xdeadbeefcafebabe) drives both the standalone datapath (Experiment 1A) and the full RoCC-wrapped system (Experiment 1B); both report 1040/1040 pass. T2 on the real C910 passes 10/10 GB/T 32907-2016 standard vectors with a measured software baseline throughput of 291.9 Mbps at 1.85 GHz.
- C2 (Implemented, not yet silicon-measured). We design the sm4_rocc RoCC wrapper, five custom RISC-V instructions (SM4.LDKEY/ENC.HI/ENC.LO/DEC.HI/DEC.LO), and their inline-assembly C API. The 1040-vector BFM verification (C1) confirms funct7 decoding, cmd/resp handshake, and HI/LO write-back ordering. Using the open-source Yosys + Sky130 flow we additionally report a measured post-synthesis area (133 kGE for sm4_top) and a switching-dominated OpenSTA power/energy-efficiency estimate (≈0.28 W, ≈22 pJ/bit, ≈46 Gbps/W at 100 MHz); the on-board acceleration ratio, by contrast, awaits soft-core FPGA prototyping because the TH1520 RoCC port is closed.
- C3 (Verified). We provide a three-tier reproducible verification flow (RTL co-simulation/software reference model/illegal-instruction trap-and-emulate) orchestrated by a single shell driver run_all_experiments.sh, depending solely on open-source software (Icarus Verilog 12.0, Python 3.11, optionally Yosys + Sky130). The entire flow has been verified to run end-to-end on the LicheePi 4A target platform itself, with no x86 host required in the measurement loop.
- C4 (Conceptual). We abstract the above into an algorithmic symmetry → workload asymmetry → hardware asymmetry design pattern, which is independent of SM4 and applies to any block cipher whose forward and inverse round functions share structure (e.g., SM3 compression, SM2 modular arithmetic, and the AES key-schedule reuse pattern).
1.4. Paper Organisation
2. Preliminaries and Structural Symmetry
2.1. Notation
2.2. SM4 Algorithm Specification
2.3. Two Structural Symmetries
2.4. Workload Asymmetry
3. Related Work
3.1. SM4 Hardware Implementations
3.2. Symmetric-Cipher Acceleration on RISC-V
3.3. Open-Source Verification of Block Cipher Hardware
4. Asymmetric Dual-Channel Hardware Architecture
4.1. Design Space and Design Choice
- (A) Single-round iterative reuse. Instantiate one round function; encryption requires N clock cycles per block. Steady-state throughput: . Area: .
- (B) Fully unrolled pipeline. Instantiate N round functions in cascade. Steady-state throughput: . Area: .
4.2. Top-Level Interface
| Listing 1. Top-level module sm4_top interface (signal list, schematic). |
|
4.3. Encryption Path: 32-Stage Fully Unrolled Pipeline
4.4. Key-Expansion Path: 32-Cycle Iterative FSM
- State machine.
- S0 (IDLE).. The module awaits a 1-cycle key_load pulse. While key_load is de-asserted, the FSM remains in S0 (self-loop).
- S1 (INIT). On rising key_load, the state is initialised to (Equation (5)), and , .
- S2 (RUN). For 32 successive clocks, the combinational key_expand_round produces , which is written into the round-key shift register and into the new slot via . The counter increments each cycle.
- S3 (DONE). When , and the FSM falls back to S0 on the next rising clock edge. All 32 round keys are now statically present on the 1024-bit round_key_bus and remain valid until the next key_load pulse.

- Operator sharing.
4.5. Encryption–Decryption Unification
4.6. Resource–Timing Analysis
- A1.
- 28 nm general-purpose standard-cell library.
- A2.
- A single sm4_sbox, after technology mapping, occupies ≈220 GE based on a LUT-based estimate; this matches typical 8-bit S-box implementations reported for AES [31].
- A3.
5. RISC-V Processor Integration
5.1. Integration Paths and Constraints
5.2. Three-Tier Integration Scheme
5.3. T1: sm4_rocc Wrapper and BFM
5.4. Custom Instruction Encoding and C API
| Listing 2. C inline-assembly API (excerpt from sm4_intrinsics.h). |
|
5.5. T3: Illegal-Instruction Trap-and-Emulate
- 1.
- Read the 32-bit instruction word from instruction_pointer(regs) via copy_from_user.
- 2.
- Check ; parse funct7/rd/rs1/rs2; read register values from regs.
- 3.
- Forward parameters to SM4 RTL (or software-model fallback) via ioremap MMIO; block until the STATUS register’s busy bit clears.
- 4.
- Write back the ciphertext/plaintext to the destination register; advance PC by 4; return.
- Stability, compatibility, and overhead (qualitative).
6. Experimental Methodology
6.1. Platform and Toolchain
6.2. Three-Tier Reproducible Experimental Flow
- $ make rtl_sim # legacy: 5 vectors via iverilog + vvp
- $ ./run_all_experiments.sh # extended: 1040 vectors, Exp 1A + 1B
- $ make wave # gtkwave sm4_rocc.vcd
- $ make sw_emu # gcc -O3 -DSM4_SOFT_EMU sm4_rocc_demo.c …
- $ make -C /lib/modules/$(uname -r)/build M=$(pwd) modules
- $ sudo insmod sm4_trap.ko
- $ make sw_real
6.3. Test Stimuli
- Standard vector (S1). The GB/T 32907-2016 standard test vector with plaintext and master key both set to 0123456789abcdeffedcba9876543210, expected ciphertext 681edf34d206965e86b3e94f536e4246 [1].
- Burst vectors (S2). Nine pseudo-random 128-bit plaintext blocks injected in single-shot, 4-shot, and 5-shot burst modes, covering cold start, half-fill, and steady-state full-pipeline occupancy.
- Extended random vectors (S3). 1024 pseudo-random plaintexts generated by an xorshift64 PRNG seeded with the fixed constant 0xdeadbeefcafebabe, plus 16 hand-picked edge cases (all-zero, all-one, alternating bit patterns, single-bit MSB/LSB, the GB standard PT). The PRNG seed was fixed to guarantee bit-for-bit reproducibility.
7. Experimental Results and Analysis
7.1. Functional Correctness (Measured)
- T1a—Baseline 5-vector smoke test.
- (a)
- cmd_funct7 showed (11 events: 1 LDKEY + 5 ENC.HI/LO pairs);
- (b)
- busy held high for 33 clock cycles during LDKEY, matching 32 iterations plus a one-cycle state-hold;
- (c)
- the ciphertext bus showed five values matching GB/T 32907-2016 bit-for-bit;
- (d)
- resp_valid rose twice per ENC pair for the high/low 64-bit write-back.


- T1b—Extended 1040-vector standalone verification (Experiment 1A).
- T1c—extended 1040-vector RoCC-wrapped verification (Experiment 1B).
- T2—Real C910 software reference model.
7.2. Throughput and Latency
7.3. Post-Synthesis Area, Power and Energy Efficiency (F0)
- Area (measured).
- Quantitative resource sharing (measured).
- Power and energy efficiency (post-synthesis estimate).
7.4. Comparison with Related Work
7.5. Evidence Levels
7.6. Threats to Validity
8. Discussion and Future Work
- F1a. Digilent Arty A7-100T (Artix-7 XC7A100T, 101 K LUT). Use the LiteX framework [48] with a VexRiscv RV32IM soft core; mount sm4_top as a Wishbone CSR peripheral or VexRiscv CFU (Custom Function Unit). Vivado one-click synthesis at 100 MHz suffices to validate the asymmetric methodology with real hardware acceleration ratios.
- F1b. Digilent Genesys 2 (Kintex-7 K325T, 326 K LUT). Use Chipyard [49] with Rocket Chip RV64GC and the genuine RoCC interface. This path directly exercises the sm4_rocc wrapper with no protocol adaptation.
- F1c. An open-source RISC-V FPGA soft-core kit such as OpenC910 or ICEX, which makes the genuine RoCC port available without a commercial license; this path preserves the open-tool reproducibility of the present paper while replacing the projected synthesis numbers with measured ones.
9. Conclusions
- T1 RTL co-simulation: an extended testbench of 1040 vectors (16 hand-picked edge cases + 1024 fixed-seed random vectors) drove both the standalone sm4_top and the full RoCC-wrapped sm4_rocc; both reported a 1040/1040 pass. GTKWave waveforms confirmed the 32-cycle pipeline latency and steady-state 1 block/cycle throughput, consistent with the analytical formulae.
- T2 real Xuantie C910 at 1.85 GHz: a pure-C software reference model encrypted blocks, measured throughput 34.81 MiB/s ≈ 291.9 Mbps; 10/10 GB/T 32907-2016 test vectors passed.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| SM4 | Chinese national block cipher (GB/T 32907-2016) |
| RoCC | Rocket Custom Coprocessor interface |
| BFM | Bus Functional Model |
| FSM | finite-state machine |
| GE | Gate Equivalent |
| FO4 | Fan-Out-of-4 inverter delay |
| EDA | Electronic Design Automation |
| ISA | Instruction Set Architecture |
References
- GB/T 32907-2016; Information Security Technology—SM4 Block Cipher Algorithm. Standardization Administration of China: Beijing, China, 2016.
- Yang, P. ShangMi (SM) Cipher Suites for TLS 1.3; RFC 8998; Internet Engineering Task Force (IETF): Wilmington, DE, USA, 2021; Available online: https://www.rfc-editor.org/info/rfc8998 (accessed on 29 May 2026).
- Trusted Computing Group. TPM 2.0 Library Specification. 2019. Available online: https://trustedcomputinggroup.org/resource/tpm-library-specification/ (accessed on 29 May 2026).
- Dworkin, M. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode for Confidentiality on Storage Devices; NIST Special Publication 800-38E; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2010. [Google Scholar] [CrossRef]
- Abed, S.; Jaffal, R.; Mohd, B.J.; Alshayeji, M. Performance Evaluation of the SM4 Cipher Based on Field-Programmable Gate Array Implementation. IET Circuits Devices Syst. 2021, 15, 121–135. [Google Scholar] [CrossRef]
- T-Head Semiconductor. OpenC910: An Open-Source 12-Stage Out-of-Order RISC-V Processor. 2022. Available online: https://github.com/T-head-Semi/openc910 (accessed on 29 May 2026).
- Sipeed Ltd. LicheePi 4A Hardware Reference Manual (TH1520, T-Head Xuantie C910). 2023. Available online: https://wiki.sipeed.com/hardware/zh/lichee/th1520/lpi4a.html (accessed on 29 May 2026).
- Wang, J.; Wang, Z.; Xiao, C.; Zhang, L. AES Algorithm Design and Full-Flow Domestic Verification on the LicheePi 4A Platform. J. Beijing Electron. Sci. Technol. Inst. 2026, 34, 14–27. (In Chinese) [Google Scholar]
- Diffie, W.; Ledin, G. SMS4 Encryption Algorithm for Wireless Networks. Cryptology ePrint Archive, Report 2008/329. 2008. Available online: https://eprint.iacr.org/2008/329 (accessed on 29 May 2026).
- Su, B.-Z.; Wu, W.-L.; Zhang, W.-T. Security of the SMS4 Block Cipher Against Differential Cryptanalysis. J. Comput. Sci. Technol. 2011, 26, 130–138. [Google Scholar] [CrossRef]
- McGrew, D.A.; Viega, J. The Security and Performance of the Galois/Counter Mode (GCM) of Operation. In Progress in Cryptology—INDOCRYPT 2004; LNCS 3348; Springer: Berlin/Heidelberg, Germany, 2004; pp. 343–355. [Google Scholar] [CrossRef]
- Zhou, F.; Zhang, B.; Wu, N.; Bu, X. The Design of Compact SM4 Encryption and Decryption Circuits That Are Resistant to Bypass Attack. Electronics 2020, 9, 1102. [Google Scholar] [CrossRef]
- Chen, R.; Li, B. Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications. Electronics 2022, 11, 935. [Google Scholar] [CrossRef]
- Bai, X.; Xu, Y.; Guo, L. A Compact S-Box Design for SMS4 Block Cipher. In Proceedings of the International Conference on Information Technology and Software Engineering; Lecture Notes in Electrical Engineering; Springer: Dordrecht, The Netherlands, 2013; Volume 211, pp. 641–648. [Google Scholar] [CrossRef]
- Shao, T.; Wei, B.; Ou, Y.; Wei, Y.; Wu, X. New Second-order Threshold Implementation of SM4 Block Cipher. J. Electron. Test. 2023, 39, 695–710. [Google Scholar] [CrossRef]
- Schneider, T.; Moradi, A. Leakage Assessment Methodology—A Clear Roadmap for Side-Channel Evaluations. In Cryptographic Hardware and Embedded Systems—CHES 2015; LNCS 9293; Springer: Berlin/Heidelberg, Germany, 2015; pp. 495–513. [Google Scholar] [CrossRef]
- Lin, H.; Deng, X.; Yu, F.; Sun, Y. Grid Multi-Butterfly Memristive Neural Network with Three Memristive Systems: Modeling, Dynamic Analysis, and Application in Police IoT. IEEE Internet Things J. 2024, 11, 29878–29889. [Google Scholar] [CrossRef]
- Ding, S.; Lin, H.; Deng, X.; Yao, W.; Jin, J. A Hidden Multiwing Memristive Neural Network and Its Application in Remote Sensing Data Security. Expert Syst. Appl. 2025, 277, 127168. [Google Scholar] [CrossRef]
- Lin, H.; Deng, X.; Zhang, S.; Chen, X.; Min, G.; Xue, K. Securing Image Privacy in Internet-of-Vehicles With a Multiwing Hyperchaotic Memristive Neural Network. IEEE Internet Things J. 2025. early access. [Google Scholar] [CrossRef]
- Min, G.; Chen, X.; Deng, X.; Zhang, Y.; Li, Z.; Lin, H. Memristive CNN with Multi-Butterfly Attractors: Mathematical Modeling, Dynamics Analysis and Application in Secure Communication. Chaos Solitons Fractals 2026, 206, 117905. [Google Scholar] [CrossRef]
- Stoffelen, K. Efficient Cryptography on the RISC-V Architecture. In Progress in Cryptology—LATINCRYPT 2019; LNCS 11774; Springer: Cham, Switzerland, 2019; pp. 323–340. [Google Scholar] [CrossRef]
- National Institute of Standards and Technology. Advanced Encryption Standard (AES); FIPS Publication 197 (Updated); National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. [Google Scholar] [CrossRef]
- Tehrani, E.; Graba, T.; Si Merabet, A.; Danger, J.-L. RISC-V Extension for Lightweight Cryptography. In Proceedings of the 23rd Euromicro Conference on Digital System Design (DSD), Kranj, Slovenia, 26–28 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 222–228. [Google Scholar] [CrossRef]
- Marshall, B.; Newell, G.R.; Page, D.; Sherwood, T.; Wolf, C. The Design of Scalar AES Instruction Set Extensions for RISC-V. IACR Cryptogr. Hardw. Embed. Syst. 2021, 2021, 109–136. [Google Scholar] [CrossRef]
- RISC-V International. RISC-V Cryptography Extensions Volume I: Scalar & Entropy Source Instructions, Version 1.0.1. 2023. Available online: https://github.com/riscv/riscv-crypto (accessed on 29 May 2026).
- RISC-V International. RISC-V Vector Cryptography Extensions (Zvkns/Zvkg/Zvksed/Zvksh) Specification, Version 1.0.0. 2023. Available online: https://github.com/riscv/riscv-crypto (accessed on 29 May 2026).
- Gomes, T.; Sousa, P.; Silva, M.; Ekpanyapong, M.; Pinto, S. FAC-V: An FPGA-Based AES Coprocessor for RISC-V. J. Low Power Electron. Appl. 2022, 12, 50. [Google Scholar] [CrossRef]
- Asanović, K.; Avizienis, R.; Bachrach, J.; Beamer, S.; Biancolin, D.; Celio, C.; Cook, H.; Dabbelt, D.; Hauser, J.; Izraelevitz, A.; et al. The Rocket Chip Generator; Technical Report No. UCB/EECS-2016-17; EECS Department, University of California: Berkeley, CA, USA, 2016; Available online: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html (accessed on 29 May 2026).
- Lee, D.; Kohlbrenner, D.; Shinde, S.; Asanović, K.; Song, D. Keystone: An Open Framework for Architecting Trusted Execution Environments. In Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys ’20), Heraklion, Greece, 27–30 April 2020; ACM: New York, NY, USA, 2020. Article 38. [Google Scholar] [CrossRef]
- Paar, C.; Pelzl, J. Understanding Cryptography: A Textbook for Students and Practitioners; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef]
- Good, T.; Benaissa, M. AES on FPGA from the Fastest to the Smallest. In Cryptographic Hardware and Embedded Systems—CHES 2005; LNCS 3659; Springer: Berlin/Heidelberg, Germany, 2005; pp. 427–440. [Google Scholar] [CrossRef]
- Weste, N.H.E.; Harris, D.M. CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed.; Addison-Wesley: Boston, MA, USA, 2011. [Google Scholar]
- Rabaey, J.M.; Chandrakasan, A.P.; Nikolić, B. Digital Integrated Circuits: A Design Perspective, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2003. [Google Scholar]
- Williams, S. Icarus Verilog. 2023. Available online: https://github.com/steveicarus/iverilog (accessed on 29 May 2026).
- Bybell, A. GTKWave Electronic Waveform Viewer. 2023. Available online: https://gtkwave.sourceforge.net/ (accessed on 29 May 2026).
- The Linux Kernel Documentation. Kernel Probes (Kprobes). 2023. Available online: https://www.kernel.org/doc/html/latest/trace/kprobes.html (accessed on 29 May 2026).
- Waterman, A.; Asanović, K. (Eds.) The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA; Document Version 20191213; RISC-V International: Zürich, Switzerland, 2019; Available online: https://riscv.org/specifications/ (accessed on 29 May 2026).
- Wolf, C.; Glaser, J.; Kepler, J. Yosys—A Free Verilog Synthesis Suite. In Proceedings of the 21st Austrian Workshop on Microelectronics (Austrochip), Linz, Austria, 10 October 2013; Available online: https://github.com/YosysHQ/yosys (accessed on 29 May 2026).
- SkyWater Technology; Google. SkyWater Open Source PDK (sky130). 2022. Available online: https://github.com/google/skywater-pdk (accessed on 29 May 2026).
- Cherry, J. OpenSTA: Parallax Static Timing Analyzer. 2023. Available online: https://github.com/parallaxsw/OpenSTA (accessed on 29 May 2026).
- Yang, S.; Shao, L.; Huang, J.; Zou, W. Design and Implementation of Low-Power IoT RISC-V Processor with Hybrid Encryption Accelerator. Electronics 2023, 12, 4222. [Google Scholar] [CrossRef]
- Szymkowiak, T.; Isufi, E.; Saarinen, M.-J.O. Marian: An Open Source RISC-V Processor with Zvk Vector Cryptography Extensions (Poster). In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS ’24), Salt Lake City, UT, USA, 14–18 October 2024; ACM: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Srivastava, A.; Porwal, M.; Basu, K. CryptRISC: A Secure RISC-V Processor for High-Performance Cryptography with Power Side-Channel Protection. arXiv 2026, arXiv:2602.20285. [Google Scholar] [CrossRef]
- Zhang, R.; Xiang, Z.; Zhang, S.; Song, M. Optimized SM4 Hardware Implementations for Low Area Consumption. IET Inf. Secur. 2024, 2024, 7047055. [Google Scholar] [CrossRef]
- Kwon, H.; Kim, H.; Eum, S.; Sim, M.; Kim, H.; Lee, W.-K.; Hu, Z.; Seo, H. Optimized Implementation of SM4 on AVR Microcontrollers, RISC-V Processors, and ARM Processors. IEEE Access 2022, 10, 80225–80233. [Google Scholar] [CrossRef]
- Käsper, E.; Schwabe, P. Faster and Timing-Attack Resistant AES-GCM. In Cryptographic Hardware and Embedded Systems—CHES 2009; LNCS 5747; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–17. [Google Scholar] [CrossRef]
- Miao, X.; Guo, C.; Wang, M.; Wang, W. Bit-Sliced Implementation of SM4 and New Performance Records. IET Inf. Secur. 2023, 2023, 1821499. [Google Scholar] [CrossRef]
- Kermarrec, F.; Bourdeauducq, S.; Le Lann, J.-C.; Badier, H. LiteX: An Open-Source FPGA-Based SoC Builder. 2023. Available online: https://github.com/enjoy-digital/litex (accessed on 29 May 2026).
- Amid, A.; Biancolin, D.; Gonzalez, A.; Gruber, D.; Karandikar, S.; Liew, H.; Magyar, A.; Mao, H.; Ou, A.; Pemberton, N.; et al. Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs. IEEE Micro 2020, 40, 10–21. [Google Scholar] [CrossRef]
- Goodwill, G.; Jun, B.; Jaffe, J.; Rohatgi, P. A Testing Methodology for Side-Channel Resistance Validation. In Proceedings of the NIST Non-Invasive Attack Testing Workshop, Nara, Japan, 26–27 September 2011; Available online: https://csrc.nist.gov/CSRC/media/Events/Non-Invasive-Attack-Testing-Workshop/documents/08_Goodwill.pdf (accessed on 29 May 2026).











| Operation | Invocation Freq. | TP Sens. | Area Sens. | HW Preference |
|---|---|---|---|---|
| Key expansion | Low (per session) | Low | High | Iterative reuse |
| Encryption | High (streaming) | High | Medium | Fully unrolled |
| Decryption | High (streaming) | High | Medium | Shared w/encryption |
| Work | Algo. | Architecture | Toolchain | RISC-V | Platform |
|---|---|---|---|---|---|
| Zhou et al. [12] | SM4 | Compact + anti-bypass | Commercial | No | ASIC |
| Abed et al. [5] | SM4 | FPGA perf. evaluation | Commercial | No | FPGA |
| Shao et al. [15] | SM4 | 2nd-order TI (SCA) | Commercial | No | ASIC |
| Stoffelen [21] | AES/SM4 | ISA SW opt. | GCC | Yes (SW) | SiFive |
| FAC-V [27] | AES | RoCC coprocessor | Chipyard | Yes (RoCC) | FPGA |
| Wang et al. [8] | AES | 9-stage pipeline | Icarus + GTK | No | LicheePi 4A |
| This work | SM4 | 32-st. + iter. + RoCC | Icarus + GCC | Yes (3-tier) | LicheePi 4A |
| Module | Comb. (kGE) | Regs (bit) | Crit. Path |
|---|---|---|---|
| sm4_sbox | 0.22 | 0 | 4 FO4 |
| encrypt_round (1 stg.) | 1.28 | 0 | 8 FO4 |
| encrypt (32 stg.) | 41.0 | 4224 | 8 FO4/stg. |
| key_expand_round | 1.28 | 0 | 8 FO4 |
| key_expand | 2.3 | 1152 | 8 FO4 |
| sm4_top total | ≈58 | 5376 | 8 FO4 |
| Module | Area (μm2) | Area (kGE) |
|---|---|---|
| sm4_top (datapath) | 499,241 | 133.1 |
| sm4_rocc (full system) | 513,315 | 136.9 |
| Path | Call Overhead | Prog. Complexity | Core Mod.? | TH1520/C910 Status |
|---|---|---|---|---|
| MMIO peripheral | High | Low (driver) | No | Natively supported |
| RoCC coprocessor | Low (1 instr.) | Medium (asm) | Yes | Open-source core avail.; 4A not exposed |
| Zvksed vector | Very low | High (vector) | Yes (major) | Not yet implemented |
| Mnemonic | funct7 | rs1/rs2 | rd | Function |
|---|---|---|---|---|
| SM4.LDKEY | 0x01 | MK[127:64], MK[63:0] | x0 | Trigger key exp. |
| SM4.ENC.HI | 0x02 | pt[127:64], pt[63:0] | ct[127:64] | Encrypt, hi 64b |
| SM4.ENC.LO | 0x03 | — | ct[63:0] | Read lo 64b |
| SM4.DEC.HI | 0x04 | ct[127:64], ct[63:0] | pt[127:64] | Decrypt, hi 64b |
| SM4.DEC.LO | 0x05 | — | pt[63:0] | Read lo 64b |
| Item | Configuration |
|---|---|
| Board | LicheePi 4A (Sipeed) |
| SoC | T-Head TH1520 ( Xuantie C910, RV64GCV, up to 1.85 GHz) |
| Memory/Storage | 8 GB LPDDR4/32 GB eMMC |
| OS | OpenKylin 1.0 RISC-V (Linux 6.6, riscv64) |
| Simulator | Icarus Verilog 12.0 (open-source) |
| Waveform viewer | GTKWave 3.3 (open-source) |
| Compiler | GCC 13.2 riscv64-linux-gnu |
| # | Class | Plaintext (128-bit hex) | Expected Ciphertext (128-bit hex) |
|---|---|---|---|
| 0 | GB std. | 0123456789abcdeffedcba9876543210 | 681edf34d206965e86b3e94f536e4246 |
| 1 | all-zero | 00000000000000000000000000000000 | 2677f46b09c122cc975533105bd4a22a |
| 2 | all-one | ffffffffffffffffffffffffffffffff | 6811af7e097364e786fb45ce5d9a60f0 |
| 3 | alt 0xAA | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | a36a94f62c567437ba79b366144d4f3e |
| 4 | PRNG#0 | 27dc5c1b2d04284ba29639727aeb52db | 3bd1b4651e03b2be9329d3981daf0622 |
| 5 | PRNG#1 | 68f59be1ebed52beb19c47a2b60fe79b | ccecd88ab0ccbf768eb126b801a7295e |
| 6 | PRNG#2 | bb99d99371417e944002b0899afcd969 | f97151a0fb9b4d3831d7b0598da8a5e6 |
| 7 | PRNG#3 | 5f9cff7518e45a9bb05039f1e94c54ee | 3fe0d8633963bf9d18ee9bc247513782 |
| 8 | PRNG#4 | 07a37efdbc9837c70954fed44bc41668 | 62c10e61c0318fcb1cdf1d7c1fe9129d |
| 9 | PRNG#5 | 41244a759813044490f0025071f2b34c | 2e9c3ce0e19f328faf73a462da45e90c |
| Module | Dynamic (mW) | Static (μW) | Total (mW) | Energy (pJ/bit) | Tput. (Gbps/W) |
|---|---|---|---|---|---|
| sm4_top | 276 | 0.16 | 276 | 21.6 | 46.4 |
| sm4_rocc | 277 | 0.17 | 277 | 21.6 | 46.2 |
| Claim | Evidence Level | Basis | Section |
|---|---|---|---|
| T1a 5/5 ENC pairs pass | Measured | VVP pass=5, fail=0 | Section 7.1 |
| T1b 1040/1040 sm4_top pass (Exp 1A) | Measured | VVP pass=1040, fail=0 | Section 7.1 |
| T1c 1040/1040 sm4_rocc pass (Exp 1B) | Measured | VVP pass=1040, fail=0 | Section 7.1 |
| 1B avg. latency 400 ns/pair | Measured | VVP Avg time per encryption: 400 ns | Section 7.1 |
| T2 SW throughput 34.81 MiB/s | Measured | LicheePi 4A C910, GCC -O3 | Section 7.1 |
| T2 10/10 vectors pass | Measured | check_all on C910 | Section 7.1 |
| HW 12.8 Gbps @100 MHz steady | Sim. equivalent | RTL: 1 block/cycle steady state | Section 7.2 |
| HW 0.32 Gbps BFM seq. issue | Sim. equivalent | Equation (15) | Section 7.2 |
| HW ≤44.8 Gbps @350 MHz | Projected | 28 nm FO4 + Equation (13) | Section 7.2 |
| Area 133/137 kGE (top/rocc) | Measured | Yosys stat + sky130_fd_sc_hd | Section 7.3 |
| Area ≈ 58 kGE (lower bound) | Analytical est. | Equations (9)–(11) | Section 4.6 |
| Power ≈ 0.28 W @100 MHz | Post-synth. est. | OpenSTA report_power (0.2 act.) | Section 7.3 |
| Energy ≈ 22 pJ/bit, ≈46 Gbps/W | Post-synth. est. | Derived from power and 12.8 Gbps | Section 7.3 |
| S-box: 128 enc./4 key-exp. | Measured | Yosys hierarchical stat | Section 7.3 |
| On-board acceleration ratio | Not yet measured | TH1520 RoCC port unexposed | Section 7.6 |
| Design | Platform | Freq. (MHz) | Area (kGE) | Power | Latency | Tput. (Gbps) | Open |
|---|---|---|---|---|---|---|---|
| RISC-V AES crypto-hardware baselines | |||||||
| Wang et al. [8] (AES-128) | LicheePi 4A | 100 | N/R | N/R | N/R | 1.28 | Yes |
| FAC-V [27] (AES coproc.) | SiFive E31/Arty A7 | 65 | N/R | N/R | N/R | N/R | Yes |
| Recent SM4-on-RISC-V acceleration (2023–2026) | |||||||
| Yang et al. [41] (SM3/SM4 accel., 2023) | RISC-V SoC (FPGA) | N/R | N/R | N/R | 22 cyc d | N/R | Yes |
| Marian [42] (Zvksed, 2024) | VCU118/22 nm | 75/1000 | ∼100 kGE e | N/R | N/R | N/R | Yes |
| CryptRISC [43] (scalar+mask, 2026) | CVA6/Kintex-7 | N/R | N/R f | N/R | N/R | N/R | Yes |
| This work—annotated by evidence level | |||||||
| Ours (SW meas., T2) | LicheePi 4A C910 | 1 850 | — | — | — | 0.29 | Yes |
| Ours (BFM meas., 1B) | LicheePi 4A | 100 | — | — | 400 ns/blk | 0.32 | Yes |
| Ours (synth., F0) | Sky130 130 nm | 100 | 133 a | 0.28 W b | 330 ns fill | 12.8 c | Yes |
| Ours (28 nm proj.) | 28 nm (est.) | ≤350 | — | — | ≥94 ns | ≤44.8 | Yes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, J.; Wang, Z.; Zhou, R.; Xiao, C.; Zhang, L. Exploiting Structural Symmetry of SM4 for an Asymmetric Hardware Architecture: Design and Open-Source Verification on the RISC-V LicheePi 4A Platform. Symmetry 2026, 18, 1083. https://doi.org/10.3390/sym18071083
Wang J, Wang Z, Zhou R, Xiao C, Zhang L. Exploiting Structural Symmetry of SM4 for an Asymmetric Hardware Architecture: Design and Open-Source Verification on the RISC-V LicheePi 4A Platform. Symmetry. 2026; 18(7):1083. https://doi.org/10.3390/sym18071083
Chicago/Turabian StyleWang, Jianxin, Zixuan Wang, Runze Zhou, Chaoen Xiao, and Lei Zhang. 2026. "Exploiting Structural Symmetry of SM4 for an Asymmetric Hardware Architecture: Design and Open-Source Verification on the RISC-V LicheePi 4A Platform" Symmetry 18, no. 7: 1083. https://doi.org/10.3390/sym18071083
APA StyleWang, J., Wang, Z., Zhou, R., Xiao, C., & Zhang, L. (2026). Exploiting Structural Symmetry of SM4 for an Asymmetric Hardware Architecture: Design and Open-Source Verification on the RISC-V LicheePi 4A Platform. Symmetry, 18(7), 1083. https://doi.org/10.3390/sym18071083
