A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation

Marevac, Elmin; Kadušić, Esad; Živić, Nataša; Nesimović, Sanela; Ruland, Christoph

doi:10.3390/cryptography10030030

Open AccessArticle

A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation

by

Elmin Marevac

¹

,

Esad Kadušić

²

,

Nataša Živić

^3,*

,

Sanela Nesimović

²

and

Christoph Ruland

⁴

¹

Polytechnic Faculty, University of Zenica, 72000 Zenica, Bosnia and Herzegovina

²

Faculty of Educational Sciences, University of Sarajevo, 71000 Sarajevo, Bosnia and Herzegovina

³

Faculty of Digital Transformation, Leipzig University of Applied Sciences, 04277 Leipzig, Germany

⁴

Chair of Digital Communication Systems, University of Siegen, 57076 Siegen, Germany

^*

Author to whom correspondence should be addressed.

Cryptography 2026, 10(3), 30; https://doi.org/10.3390/cryptography10030030

Submission received: 14 February 2026 / Revised: 26 April 2026 / Accepted: 28 April 2026 / Published: 3 May 2026

(This article belongs to the Special Issue Advances in Post-Quantum Cryptography)

Download

Browse Figures

Versions Notes

Abstract

Deploying post-quantum cryptography on highly constrained devices remains challenging due to the large key sizes and substantial storage and memory-traffic demands of leading lattice-based schemes. Although constructions such as Kyber, Dilithium, and NTRU offer strong resistance against quantum adversaries, their multi-kilobyte public keys and intensive memory access patterns limit practical adoption in microcontrollers, smart cards, and low-power edge environments. This work proposes a hybrid key-encapsulation mechanism that integrates a compact, seed-generated Module-LWE structure with a quantum-secure hash-based authentication layer. The design employs a small public seed to instantiate lattice matrices on demand via a lightweight pseudorandom generator and incorporates a Merkle-tree commitment to represent compressed auxiliary error information. Additional design considerations—including sparsity-aware secret keys, SIMD-friendly polynomial operations, and cache-efficient decryption paths—are intended to reduce runtime memory usage and computational overhead. The security of the proposed construction is analysed under both Module-LWE and hash-based one-way assumptions, with further consideration of constant-time execution and cache-line alignment to mitigate side-channel risks. This hybrid approach outlines a design pathway toward post-quantum key-encapsulation mechanisms suitable for deployment on memory-limited and energy-constrained platforms.

Keywords:

post-quantum cryptography; lattice-based cryptography; Module-LWE; hash-based authentication; Merkle trees; lightweight PRNG; memory-constrained devices; embedded security; SIMD optimization; key encapsulation mechanisms

1. Introduction

The advent of large-scale quantum computing poses a fundamental challenge to the public-key cryptography that underpins modern digital infrastructure. Widely deployed schemes such as RSA and elliptic-curve cryptography (ECC) rely on number-theoretic problems—integer factorization and discrete logarithms—that are efficiently solvable by Shor’s quantum algorithm [1,2]. This vulnerability has catalysed a global effort to develop cryptographic primitives secure against quantum adversaries, a field now known as post-quantum cryptography (PQC) [3]. After a rigorous multi-year evaluation, the U.S. National Institute of Standards and Technology (NIST) has selected CRYSTALS-Kyber as its primary Key Encapsulation Mechanism (KEM) for standardization, with CRYSTALS-Dilithium chosen for digital signatures [4,5,6]. Both are based on structured lattice problems, reflecting the community’s confidence in their security and efficiency.

Lattice-based cryptography, particularly constructions derived from the Learning With Errors (LWE) problem and its algebraically structured variants—Ring-LWE and Module-LWE—has emerged as the most versatile PQC family [7,8]. These schemes benefit from strong worst-case hardness guarantees: breaking them implies solving hard approximation problems on arbitrary lattices, such as the Shortest Vector Problem (SVP) [9]. Structured variants like Module-LWE significantly improve performance by enabling fast polynomial arithmetic via the Number Theoretic Transform (NTT), while retaining reductions to worst-case lattice problems [10,11,12]. Kyber, for instance, achieves IND-CCA2 security through a Fujisaki–Okamoto transform applied to a Module-LWE-based public-key encryption core, offering compact parameters and high throughput on general-purpose processors [5,13].

However, the practical deployment of these schemes on resource-constrained embedded platforms—such as ARM Cortex-M microcontrollers, RISC-V-based IoT nodes, and smart cards—remains problematic [14,15]. These devices often operate with only 16–64 KB of RAM and 256–512 KB of flash storage [16]. In such environments, the several-kilobyte public keys and ciphertexts of Kyber (e.g., 1184 B public key and 1296 B ciphertext at NIST Security Level 1) can consume a disproportionate share of available memory [13]. While acceptable in server or desktop contexts, this footprint becomes prohibitive when multiple concurrent sessions, protocol buffers, or application logic must coexist in tight memory budgets [17,18].

More critically, memory constraints on embedded systems extend beyond static storage. Memory bandwidth, cache behaviour, and energy consumption associated with data movement frequently dominate total execution cost—often exceeding the cost of arithmetic operations themselves [19,20]. A scheme that minimizes RAM usage but exhibits poor spatial or temporal locality may still incur high latency and power draw due to frequent cache misses or external memory accesses. This reality has motivated research into “lightweight” PQC, focusing on reduced parameter sets, optimized modular arithmetic, and platform-specific assembly [21,22,23]. Yet many such optimizations involve implicit trade-offs: lowering security margins, increasing decryption failure probabilities, or relying on precomputed tables that exacerbate storage demands [24,25]. For example, some lightweight Kyber variants reduce polynomial degree or modulus size, but this can weaken concrete security estimates or complicate side-channel resistance [26,27,28].

A key insight is that a significant portion of the memory footprint in lattice-based schemes arises not from secret material, but from public, deterministic structures. The public matrix

A

in Module-LWE is typically generated from a short seed using a pseudorandom function (PRF), yet many implementations store it explicitly to simplify code and avoid recomputation [23,29]. Similarly, error vectors—sampled during encapsulation—are often stored in full, despite being reproducible or verifiable through alternative means. Storing these structures trades memory for simplicity, but on constrained devices, this trade-off is often inverted: computation is abundant, while flash and RAM are scarce [16]. Regenerating an on-the-fly from a seed reduces static storage dramatically—from kilobytes to 32 bytes—but introduces computational overhead and potential timing leakage if not implemented carefully [29].

This observation aligns with a broader trend toward hybrid cryptographic designs, which combine multiple independent hardness assumptions to hedge against unforeseen cryptanalytic breakthroughs [17]. Hybrid key exchange, for instance, merges classical (e.g., X25519) and post-quantum (e.g., Kyber) mechanisms so that an adversary must break both to compromise security [18,30]. Early deployments in TLS 1.3 have demonstrated that such hybrids incur manageable latency and bandwidth overheads in real-world settings [17,30]. More recently, researchers have explored integrating lattice-based KEMs with hash-based components—not just for transitional security, but to enhance robustness through design diversity [31,32].

Hash-based cryptography offers a compelling complement to lattice-based schemes. Its security rests on the collision and preimage resistance of well-studied hash functions—a conservative, structure-free assumption widely believed to resist quantum attacks when properly parameterized [33]. NIST has standardized hash-based signatures like LMS and XMSS, which use Merkle trees to aggregate one-time signature keys into a single public root [31,34]. Although these schemes suffer from large signature sizes or state management complexities, their conceptual simplicity and strong security make them ideal for authentication layers in hybrid systems [30].

Crucially, Merkle trees also enable efficient commitments to large data structures. Instead of storing an entire vector, one can store its Merkle root and reveal only the necessary authentication path during verification. This property has been underutilized in PQC KEMs, where error vectors and other auxiliary data are typically transmitted in full. By committing to these values via a Merkle tree, correctness can be preserved without explicit storage, reducing both static and dynamic memory requirements [35].

Despite these opportunities, existing hybrid PQC research focuses primarily on security composition or migration strategies, not on hardware-aware optimization [17,31]. Memory efficiency—particularly the reduction of stored public parameters and improved access patterns—has rarely been treated as a first-class design objective. This gap is especially acute for embedded platforms, which are increasingly deployed in security-sensitive roles (e.g., industrial control, medical devices, automotive systems) yet remain underserved by current PQC standards [14,36].

In this paper, we propose a hybrid KEM that explicitly prioritizes memory efficiency for constrained environments. Our construction integrates a seed-generated Module-LWE core with a hash-based authentication layer built around Merkle-tree commitments. The public key consists solely of a compact seed; the public matrix A is regenerated on demand using a lightweight PRF [29]. Error-related information is not stored explicitly but represented through a Merkle root, verified during decapsulation via succinct inclusion proofs [35]. Additional design choices—including sparsity-aware secret keys and cache-efficient NTT scheduling—further minimize runtime memory usage and improve energy efficiency [20,37].

Our goal is not to replace Kyber, but to explore a design space where hybridization enables deployment on platforms currently excluded from PQC adoption [14]. We make three contributions:

A fully specified hybrid KEM with explicit memory-oriented design choices;
A security analysis under standard Module-LWE and hash-function assumptions, including considerations for timing and cache-based side channels [26,27];
A comprehensive experimental evaluation across representative embedded platforms (ARM Cortex-M4, RISC-V) measuring memory footprint, memory traffic, and execution time [23].

The remainder of this paper is organized as follows. Section 2 reviews related work on lightweight PQC, memory-efficient lattice-based constructions, and hybrid lattice–hash approaches. Section 3 formalizes the design goals and system model, including precise memory/storage objectives, computational trade-offs, and the adversarial threat model guiding our construction. Section 4 details the proposed Merkle-LWE hybrid key-encapsulation mechanism, presenting its seed-based Module-LWE core, sparse secret representation, and Merkle commitment layer within a unified architecture. Section 5 describes critical implementation considerations for embedded platforms, including memory-efficient PRNG expansion, polynomial arithmetic optimizations [12], cache-aware computation, and constant-time side-channel mitigations [28]. Section 6 outlines the experimental methodology, target platforms (ARM Cortex-M4 and x86-64), measurement techniques, and reference schemes used for comparative evaluation [38]. Section 7 presents a comprehensive experimental evaluation across multiple dimensions—cryptographic object sizes, static code footprint, peak RAM usage, memory traffic, cache behaviour, computational cost, energy consumption, and correctness validation—demonstrating the scheme’s viability on resource-constrained devices. Section 8 concludes with a synthesis of findings, limitations, security implications, and directions for future work in memory-optimized PQC.

2. Related Work

The transition to PQC has intensified research into lattice-based schemes due to their strong theoretical foundations, versatility, and relatively efficient performance on general-purpose hardware. The selection of CRYSTALS-Kyber as the primary KEM in the NIST PQC standardization process underscores the community’s confidence in Module Learning With Errors (Module-LWE) as a secure and practical foundation for quantum-resistant key exchange [4,5]. Kyber leverages structured lattices and the Number Theoretic Transform (NTT) to achieve compact parameters and high throughput, with a Level 1 public key of 800 bytes and ciphertext of 768 bytes in its final specification [13]. While these sizes are manageable in server or desktop environments, they pose significant challenges for resource-constrained embedded platforms—such as ARM Cortex-M microcontrollers, RISC-V-based IoT nodes, and smart cards—which often operate with only tens of kilobytes of RAM and a few hundred kilobytes of flash storage [14,16]. Storing and processing multi-kilobyte keys can consume a disproportionate share of available memory, leaving insufficient space for application logic, network buffers, or concurrent protocol sessions. This fundamental mismatch between standardized PQC and embedded constraints has motivated a growing body of work on lightweight and memory-efficient PQC, yet critical gaps remain [15,29].

Efforts to deploy Kyber on constrained devices have yielded valuable insights but also exposed inherent trade-offs. Projects like Open Quantum Safe (OQS) and pqm4 provide portable and highly optimized implementations for platforms such as the ARM Cortex-M4, demonstrating that Kyber can indeed run on such hardware [23,38]. However, these implementations often require substantial stack usage (e.g., >16 KB for Kyber-768) and long execution times (hundreds of thousands of CPU cycles), which can be prohibitive in real-time or battery-powered applications [21,23]. To mitigate this, researchers have proposed “lightweight” variants like Kyber-LE or Kyber-Compact, which reduce polynomial degree, modulus size, or error distribution parameters to shrink memory footprint and accelerate computation [39]. While effective in reducing resource demands, such parameter reductions risk eroding concrete security margins and deviate from the standardized, vetted parameters that provide assurance in real-world deployments. Moreover, even in optimized implementations, the public matrix A is often stored explicitly for simplicity, despite being deterministically generatable from a short seed—a missed opportunity for static storage reduction that our work directly addresses.

Other lattice-based finalists in the NIST process offer alternative trade-offs but face similar limitations. SABER, based on the Module Learning With Rounding (Module-LWR) problem, claims slightly better performance on some embedded platforms due to simpler rounding-based arithmetic instead of Gaussian sampling [40]. Yet, its recommended parameters still yield public keys around 1.1 KB, which remains large for deeply constrained devices. NTRU, while historically efficient and now selected as an alternate NIST standard, relies on different hardness assumptions and faced scrutiny during the evaluation regarding its security reduction and potential for decryption failures [41,42]. Crucially, none of these schemes were designed with a memory-first philosophy; their optimizations primarily target computational speed or communication bandwidth, not the minimization of static storage or peak RAM usage—the true bottlenecks in embedded contexts. This architectural oversight leaves a gap for constructions that prioritize memory efficiency as a primary design goal rather than a secondary optimization.

Beyond lattice-based cryptography, other PQC families have been explored for lightweight applications, but with significant drawbacks. Code-based schemes like Classic McEliece, also selected by NIST, offer extremely conservative security based on the NP-hardness of decoding random linear codes [43]. However, their public keys are enormous—ranging from 250 KB to over 1 MB—rendering them entirely impractical for microcontrollers. Structured code-based alternatives like LEDAcrypt or BIKE use quasi-cyclic codes to reduce key sizes to 1–2 KB, but introduce new complexities: BIKE’s decryption is probabilistic and can fail, requiring retransmission mechanisms that are difficult to implement securely in unreliable embedded environments prone to power loss or crashes [44]. Hash-based signatures, such as SPHINCS+ (NIST-standardized) or the stateful LMS/XMSS, provide another avenue grounded in the collision resistance of hash functions—a conservative, structure-free assumption [34,45]. While SPHINCS+ is stateless and robust, its signature sizes are very large (8–49 KB), making it unsuitable for bandwidth-constrained IoT links. Stateful schemes like XMSS have smaller signatures but impose a critical operational burden: the signer must maintain a non-volatile counter to prevent catastrophic key reuse, a requirement that is error-prone in embedded systems without reliable persistent storage [31,33]. These trade-offs highlight the difficulty of achieving both small size and strong security in non-lattice PQC for embedded use.

Within lattice-based cryptography, several works have attempted to tailor schemes specifically for embedded platforms through low-level engineering. The pqm4 project, for instance, provides hand-optimized assembly implementations that exploit instruction-level parallelism and register allocation on the Cortex-M4, yielding significant speedups [20,23]. Similarly, Roy et al. explored high-precision arithmetic and cache-aware data layouts to minimize memory traffic during polynomial operations [22]. Banerjee et al. further analysed memory access patterns and proposed scheduling strategies to improve cache locality in NTT computations [29]. While these implementation-level optimizations are valuable, they operate within the constraints of existing scheme architectures and do not alter the fundamental representation of keys or public parameters. They optimize how data is processed, not what data is stored. In contrast, our work rethinks the very structure of the public and private keys, replacing large explicit vectors with compact seeds and cryptographic commitments, thereby addressing the root cause of memory inefficiency rather than its symptoms.

The concept of trading computation for memory is well-established, and its application to PQC is not new. Standardized schemes like Kyber and Dilithium already use a pseudorandom generator (PRG) to expand a short seed into the large public matrix A, meaning the public key can theoretically be just the seed plus the vector t = As + e [6,13]. This reduces the public key from kilobytes to tens of bytes for the seed, but the recipient must still store or reconstruct the full A to perform operations, and the sender must transmit the large t vector. Our approach extends this principle more radically: we eliminate the need to store or transmit t altogether. Instead, the public key is a Merkle root that commits to the secret key’s coefficients, and verification during decapsulation is performed via a succinct Merkle proof. This shifts the paradigm from “store and verify” to “commit and prove,” a structural change that enables unprecedented key size reductions [35]. This use of Merkle trees for commitment, rather than just authentication, draws inspiration from techniques in zero-knowledge proofs and verifiable computation but appears novel in the context of PQC KEM design.

Hybrid cryptographic constructions—combining multiple independent primitives—have become a dominant strategy for managing cryptographic risk during the PQC transition [17,46]. The most common form combines a classical algorithm (e.g., ECDH) with a PQC one (e.g., Kyber) so that an adversary must break both to compromise the session. This approach is being actively deployed in TLS 1.3, with major tech companies running large-scale experiments that show manageable latency and bandwidth overheads [18,30]. However, these transitional hybrids do not address the core memory inefficiency of the PQC component; they often double the key material size rather than reduce it. More relevant are hybrids that combine different post-quantum families to achieve design diversity. Cooper et al. explored combining lattice-based signatures with hash-based trees to create more efficient stateless signature schemes [31], while others have proposed using Merkle trees to authenticate components of a lattice-based scheme for integrity or multi-key aggregation. Astrizi et al. proposed a hybrid lattice-hash construction for lightweight IoT authentication, using a hash-based MAC to protect a lattice key exchange against fault attacks [46]. However, in all these cases, the hash component is an auxiliary layer; the public key remains a full lattice public key. Our work differs fundamentally: the hash-based Merkle tree is not an add-on but the core mechanism for public key representation. The public key is the Merkle root, enabling a holistic integration that leverages the strengths of both worlds for a singular purpose—memory minimization.

Recent research has also explored the use of sparse secrets in lattice cryptography to improve efficiency. Ducas et al. demonstrated that using sparse secrets can accelerate signing in Dilithium without compromising security, provided sparsity is carefully controlled [47,48]. Bindel et al. analysed the concrete security of LWE with sparse secrets and provided guidelines for safe parameter choices, showing that moderate sparsity does not significantly weaken the underlying problem [18]. Our work adopts this insight but integrates it into a comprehensive memory-minimization framework. The private key is not a list of coefficients but a seed that regenerates a sparse polynomial via a deterministic shuffle (e.g., Fisher-Yates). This compresses the private key dramatically while maintaining security, and when combined with the Merkle-root public key, creates a fully compact key pair. This synergistic use of sparsity and commitment is a key innovation over prior work that treats sparsity as an isolated performance tweak.

Any claim of practicality for embedded PQC must also address side-channel vulnerabilities, particularly timing and power analysis attacks. Standardized schemes come with guidance on constant-time implementation, and projects like pqm4 include masked and hardened versions [27,28]. However, many custom or lightweight PQC proposals sacrifice side-channel resistance for performance or size, inadvertently leaking information through conditional branches or table lookups [26]. Our design explicitly incorporates side-channel countermeasures: all core operations—polynomial arithmetic, hash computations, and Merkle path verification—are implemented in constant-time. We deliberately choose ChaCha20 as the PRG for deterministic generation, as it is a well-vetted, constant-time stream cipher suitable for embedded use [49,50]. Furthermore, by minimizing the amount of sensitive data stored in memory (e.g., the private key is just a seed), we reduce the attack surface for memory-scraping attacks. This embedded-aware security posture contrasts with approaches that optimize for speed at the expense of leakage resilience.

Recent research in post-quantum cryptography (PQC) also increasingly focuses on practical deployment models that combine classical and quantum-resistant primitives. Gandhi et al. [51] propose a hybrid end-to-end encryption system that integrates CRYSTALS-Kyber with AES-256-GCM in a zero-trust messaging architecture. Their work demonstrates that NIST-standardized PQC primitives can be effectively incorporated into real-world communication systems with acceptable performance overhead while maintaining protection against both classical and quantum adversaries. A key advantage of this approach is its practical validation of Kyber-based systems in end-to-end encryption scenarios; however, it primarily focuses on system integration rather than reducing the underlying memory footprint of lattice-based key structures.

In the context of resource-constrained environments, González de la Torre et al. [52] explore the adaptation of CRYSTALS-Kyber for wireless and device-to-device communication systems operating under noisy channels. Their approach integrates modulation and error-correction coding (e.g., QAM and BCH codes) into the transmission of Kyber polynomial coefficients at the physical layer. This represents an important step toward embedding post-quantum cryptography into low-level communication stacks, demonstrating feasibility under real-world channel conditions. However, the scheme remains tightly coupled to full Kyber parameter sets and does not reduce key or ciphertext size, limiting its applicability in deeply constrained embedded systems where static memory is a primary bottleneck.

Duarte Melo et al. [53] propose KyFrog, a conservative LWE-based key encapsulation mechanism designed with significantly increased security margins through larger lattice dimensions and smaller modulus selection. While KyFrog achieves extremely high estimated classical and quantum security levels, this comes at the cost of substantially enlarged ciphertext sizes (on the order of hundreds of kilobytes). This highlights a fundamental trade-off in lattice-based cryptography between security margins and communication overhead. The advantage of this work lies in its exploration of an extreme security-performance point in the design space; however, it further emphasizes that increasing security parameters directly exacerbates memory and bandwidth constraints, making such approaches unsuitable for microcontroller-class devices.

From an application perspective, Zhang et al. [54] introduce PQSF, a post-quantum secure federated learning framework based on lattice-based secret sharing and double masking techniques. Their scheme demonstrates that lattice-based constructions can reduce communication complexity and computational overhead in distributed machine learning environments, achieving measurable efficiency improvements compared to prior secret-sharing approaches. The primary contribution of this work is the integration of post-quantum security into federated learning pipelines; however, it still relies on structured lattice primitives without addressing the underlying storage cost of cryptographic keys or the static memory footprint of cryptographic material.

More advanced hybrid cryptographic systems are explored by Lansiaux [55], who proposes a zero-knowledge federated learning framework combining ML-KEM, lattice-based zero-knowledge proofs, and homomorphic encryption. This multi-layer design achieves strong security guarantees, including resistance to quantum adversaries and verification of model update integrity under the Module-LWE and SIS assumptions. A key strength of this approach is its rigorous formalization of security properties and practical evaluation in medical AI settings. However, the resulting system introduces significant computational overhead (approximately 20×), illustrating that hybridization and additional cryptographic layers increase complexity without addressing the core issue of large static key representations in lattice-based schemes.

In summary, the proposed Merkle-LWE KEM advances the state of the art by introducing a memory-first architecture that fundamentally rethinks key representation in lattice-based cryptography. While individual ideas—seed-based generation, sparse secrets, Merkle commitments—exist in the literature, their combination into a cohesive, IND-CCA-secure KEM represents a significant and novel contribution [5]. Existing work either optimizes computation within fixed-parameter schemes, proposes non-lattice alternatives with their own size or complexity issues, or layers hash functions onto lattice schemes without altering their core memory structure. Our construction achieves public keys of 96 bytes and private keys of 160–224 bytes—orders of magnitude smaller than any standardized lattice-based KEM—not by weakening security parameters, but by a novel structural representation that replaces large explicit vectors with compact cryptographic commitments. This design is explicitly tailored for the most constrained embedded platforms, where static storage and peak RAM are the primary bottlenecks, and it includes platform-specific optimizations, comprehensive benchmarking, and built-in side-channel resistance to ensure real-world viability. By shifting the resource balance from memory to controlled computation, Merkle-LWE opens a pathway to deploying quantum-resistant cryptography on a vast ecosystem of devices that would otherwise remain vulnerable in a post-quantum world.

3. Design Goals and System Model

The design of the Merkle-LWE KEM is driven by a singular, unifying objective: to enable quantum-resistant cryptography on deeply resource-constrained embedded platforms where memory—not computation—is the primary bottleneck. Traditional post-quantum cryptographic schemes, including NIST-standardized lattice-based constructions like CRYSTALS-Kyber, are optimized for general-purpose computing environments where gigabytes of RAM and storage are available [4,5]. However, these assumptions break down dramatically in the context of microcontrollers, IoT sensors, and other embedded systems that operate with tens of kilobytes of RAM and flash memory [14,16]. In such environments, even a few kilobytes of public key material can consume a significant fraction of total available resources, precluding the use of otherwise secure PQC primitives [15].

To address this gap, our system adopts a memory-first design philosophy. Rather than treating memory footprint as a secondary optimization target, we treat it as the central constraint around which all other design decisions are made. This leads to a deliberate inversion of the traditional cost model: we accept higher computational overhead and increased ciphertext size in exchange for drastic reductions in static storage and peak runtime memory usage. The result is a hybrid KEM that achieves public keys as small as 96 bytes and private keys between 160–224 bytes—representing a 99.3% reduction in total key size compared to conventional LWE implementations—while maintaining IND-CCA security against quantum adversaries.

This section formalizes the system model, adversarial assumptions, and quantitative design targets that underpin the Merkle-LWE architecture. We begin by articulating precise memory and storage objectives, then discuss the computational and energy implications of our design choices, and finally define the threat model and attack surface relevant to embedded deployments.

3.1. Memory and Storage Objectives

The core innovation of Merkle-LWE lies in its radical rethinking of how cryptographic state is represented and stored. In conventional lattice-based KEMs, the public key consists of an explicit matrix

A \in Z_{q}^{n \times n}

and a vector

b = A s + e

, both of which are stored in full [7,13]. For Kyber768 (NIST Level 3), this results in a public key of 1184 bytes and a private key of approximately 2400 bytes [13]. While manageable on servers, these sizes are prohibitive for devices with 32–256 KB of flash memory, especially when multiple keys or concurrent sessions are required [17,18].

Merkle-LWE eliminates this overhead through two synergistic techniques:

Seed-Based Deterministic Generation: Instead of storing the public matrix $A$ , the public key contains only a 32-byte seed. The matrix is regenerated on-the-fly using a lightweight pseudorandom generator (ChaCha20) whenever needed [49]. This reduces the public key component from kilobytes to tens of bytes without compromising security, as the seed uniquely determines $A$ [29,56].
Merkle Tree Commitments for Secret Representation: The private key does not store the full secret vector s. Instead, it stores a seed that generates a sparse polynomial via a Fisher-Yates shuffle, and the public key includes only the Merkle root of the secret’s coefficients. During encapsulation, the sender transmits a Merkle authentication path alongside the LWE sample, allowing the receiver to verify correctness without storing the entire secret or error vector [35].

These mechanisms yield the concrete size targets shown in Table 1 across three NIST-aligned security levels.

Critically, the public key size remains constant (96 B = 32 B seed + 64 B SHA3-512 Merkle root) across all security levels, as the Merkle root is independent of the underlying lattice dimension. This is a stark contrast to traditional schemes, where public key size scales linearly with security level [13].

Beyond static storage, we also constrain runtime memory usage. Our implementation targets a peak RAM consumption of 8–24 KB depending on the security level, which fits comfortably within the memory budgets of common ARM Cortex-M and RISC-V microcontrollers [23]. This is achieved through:

Sparse secret representation: Only non-zero coefficients are stored, reducing private key memory [57].
On-the-fly matrix generation: No need to cache large matrices in RAM.
Streamlined Merkle tree construction: Trees are built incrementally and cleared after use.
In-place polynomial operations: Intermediate buffers are reused wherever possible [29].

These design choices ensure that the scheme remains deployable on platforms with as little as 32 KB of RAM—a class of devices that constitutes the majority of the embedded ecosystem but has been largely excluded from current PQC standardization efforts [14,16].

3.2. Computational and Energy Considerations

The memory savings in Merkle-LWE come at a deliberate computational cost. By regenerating matrices and verifying Merkle paths instead of storing and loading data, we shift the resource burden from memory to CPU cycles. This trade-off is rational in embedded contexts for several reasons:

First, modern microcontrollers often have ample computational headroom relative to their memory constraints. An ARM Cortex-M4, for example, can execute hundreds of millions of instructions per second but may be limited to 128 KB of flash and 32 KB of RAM [21,23]. In such cases, spending extra cycles to avoid memory allocation is a favourable exchange.

Second, memory access is frequently more energy-intensive than computation on battery-powered devices. Studies have shown that reading a word from external flash can consume 10–100× more energy than performing an arithmetic operation in registers [19,20]. By minimizing memory traffic—particularly repeated reads of large public parameters—Merkle-LWE reduces overall energy consumption despite higher computational load [20].

Our benchmarking confirms this trade-off quantitatively. Compared to a traditional LWE KEM with explicit storage:

Key generation incurs ~41.7% more CPU cycles due to Merkle tree construction and PRNG expansion.
Encapsulation and decapsulation require ~725.6% more cycles due to on-the-fly matrix row generation and Merkle path verification.
However, memory traffic is reduced by 45–48% across all operations, as fewer repeated loads of large data structures are needed.

The energy profile reflects this balance. While peak power draw may increase slightly during active computation, the total energy per operation is lower because the device spends less time waiting for memory and can return to low-power sleep states more quickly. On an IoT sensor that performs key exchange once per hour, this translates to extended battery life—a critical metric for real-world deployment [15,20].

To mitigate the computational overhead, we employ several optimizations:

ChaCha20 as the PRG: Chosen for its speed, constant-time implementation, and suitability for embedded platforms [49].
Structured sparsity: Secret keys have controlled Hamming weight (e.g., 16 non-zero coefficients out of 256), enabling efficient sparse polynomial multiplication [37,57].
Platform-specific assembly: Hand-optimized routines for ARM Cortex-M4 and AVX512 for x86 reduce cycle counts where feasible [23,37].
Cache-aware scheduling: Polynomial operations are ordered to maximize spatial and temporal locality, reducing cache misses [22,58].

Importantly, we do not claim speed superiority over existing PQC schemes. Instead, we demonstrate that a different optimization objective—memory minimization—can yield a viable alternative for a specific, underserved class of devices. The computational cost is a feature, not a bug: it is the price paid for unprecedented memory efficiency.

3.3. Adversarial Capabilities and Attack Surface

The security model for Merkle-LWE assumes an adversary with the following capabilities, consistent with standard definitions for embedded PQC [14,25]:

Classical and Quantum Computation: The adversary may use classical or quantum computers but is computationally bounded (i.e., cannot solve Module-LWE or invert SHA3-512) [59].
Chosen-Ciphertext Attacks (CCA): The scheme is designed to be IND-CCA secure under the Fujisaki-Okamoto transform, meaning the adversary can query a decapsulation oracle on ciphertexts of their choice (except the challenge ciphertext) [5,18].
Passive Eavesdropping: The adversary can observe all public communication, including public keys and ciphertexts.
Side-Channel Access: The adversary may measure timing, power consumption, or electromagnetic emissions during cryptographic operations on the victim device [26,27].

Notably, we assume the adversary cannot physically extract secrets from secure memory (e.g., via invasive probing), modify firmware or induce permanent faults (though transient fault resistance is partially addressed via constant-time design). To address this threat model, Merkle-LWE incorporates multiple layers of defence:

Provable Security: The base KEM is provably IND-CPA secure under the Module-LWE assumption. The Fujisaki-Okamoto transform elevates this to IND-CCA security in the random oracle model, assuming the hash functions (SHA3-256/512) behave as random oracles [18].
Constant-Time Implementation: All core operations—including polynomial arithmetic, ChaCha20 expansion, and Merkle path verification—are implemented in constant-time to prevent timing and cache-based side-channel leakage [28,49]. Branches and memory accesses are independent of secret values.
Secure Memory Handling: Sensitive data (e.g., private keys, intermediate secrets) are zeroized immediately after use via “secure_memzero”, and dynamic allocations are minimized to reduce heap-based attack surfaces [38].
Hybrid Hardness Assumptions: By combining lattice-based hardness (Module-LWE) with hash-based commitments (SHA3), the scheme benefits from design diversity. An attacker would need to break both assumptions simultaneously to compromise security—a significantly higher bar than attacking either primitive alone [17,46].
Error Containment: The use of structured sparsity and bounded error distributions ensures that decryption failures are negligible, preventing attacks that exploit failure information [24].

The security assumptions underlying the Merkle-LWE KEM are formalized through an explicit adversarial model that captures both cryptographic and implementation-level threats. This model, summarized in Figure 1, considers a computationally bounded adversary with access to passive observation of public communication, adaptive chosen-ciphertext queries, and side-channel leakage such as timing or power measurements, while excluding invasive physical attacks and permanent fault injection. Within this threat model, security is derived from a combination of provable guarantees under the Module-LWE assumption and practical countermeasures such as constant-time execution and secure memory handling.

The Merkle tree itself introduces no new vulnerabilities. Its role is purely authenticating: it allows the verifier to confirm that a claimed coefficient belongs to the committed secret without revealing the entire secret [35]. The security of this mechanism relies solely on the collision resistance of SHA3-512, a well-studied and conservative assumption [31]. Finally, the system is designed to be robust against implementation errors common in embedded contexts:

Deterministic operation: All randomness is derived from system entropy via ChaCha20, eliminating risks from poor RNG seeding [49].
Explicit error handling: Every function returns detailed error codes, enabling callers to handle failures securely (e.g., by outputting a random shared secret on decapsulation failure).
Memory safety: The API uses opaque pointers and explicit allocation/deallocation, reducing the risk of buffer overflows or use-after-free bugs.

In summary, Merkle-LWE provides a balanced security posture tailored to embedded environments: it offers strong theoretical guarantees against adaptive chosen-ciphertext attacks while incorporating practical countermeasures against the side-channel and implementation threats most relevant to resource-constrained hardware.

3.4. Formal Security Reduction

The security analysis relies on the following assumptions: the Module-LWE problem instantiated with parameters

(n, q, k, η, w, β)

is

(t, ϵ_{M L W E})

-hard; SHA3-512 is collision-resistant with advantage

ϵ_{c o l l}

; and ChaCha20 is a secure pseudorandom function. For any probabilistic polynomial-time adversary

A

issuing at most

q_{H}

random oracle queries and

q_{D}

decapsulation queries, the IND-CCA advantage is bounded by

{A d v}_{Π}^{IND-CCA} (A) \leq 2 ϵ_{M L W E} + \frac{q_{H}}{2^{λ}} + q_{D} \cdot (\frac{1}{∣ K ∣}+ ϵ_{c o l l}),

(1)

where

∣ K ∣ = 2^{256}

denotes the shared secret space and

λ

is the security parameter.

The security argument proceeds via a sequence of hybrid transformations. All ChaCha20-generated outputs are replaced with uniformly random strings, and indistinguishability of this transformation reduces to the pseudorandom function security of ChaCha20 with advantage loss bounded by

ϵ_{P R F}

. Under this replacement, the public matrix

A

, the secret vectors

s

and

s^{'}

, and all sampled error terms become computationally indistinguishable from uniformly random elements, ensuring that deterministic generation from short seeds does not introduce exploitable algebraic structure.

Correctness and soundness of ciphertext validation are enforced through a Merkle commitment predicate of the form

V e r i f y (r o o t_{s}, π_{e}, S H A 3-512 (i d x_{e} ∥ e)) = 1,

(2)

Any adversary producing a distinct error value

e^{'} \neq e

that satisfies the same verification predicate implies a collision in SHA3-512, which occurs with probability at most

ϵ_{c o l l}

. Consequently, each valid ciphertext corresponds to a unique opening consistent with the committed Merkle root, establishing computational binding of the error representation.

In the decapsulation procedure, ciphertext validity is determined exclusively through Merkle path verification, replacing the explicit re-encryption equality check used in standard Fujisaki–Okamoto transforms. Because the commitment is computationally binding, each valid ciphertext defines a unique admissible error structure, and the FO consistency condition is preserved under this uniqueness property. This substitution eliminates the need to explicitly reconstruct or store intermediate vectors such as

t = A s + e

, while preserving IND-CCA security guarantees under the ROM.

Indistinguishability from the Module-LWE distribution follows by replacing ciphertext components with uniformly random samples. Any adversary distinguishing this hybrid from the real construction can be transformed into an algorithm solving Module-LWE with advantage

ϵ_{M L W E}

. Simulation of random oracle and decapsulation queries is performed using a programmed oracle consistent with the Merkle structure, and the only abort event occurs if the adversary queries the random oracle on the challenge shared secret, which happens with probability at most

q_{H} / 2^{λ}

.

The additional overhead introduced by the reduction arises from random oracle programming and Merkle authentication, contributing factors bounded by

O (q_{H})

and

O (q_{D} l o g ∣ E ∣)

, respectively. For the selected parameter regime, where

q_{H} \leq 2^{64}

and

l o g ∣ E ∣ \leq 8

, these terms remain negligible compared to the dominant Module-LWE hardness assumption. The Merkle layer introduces no algebraic interaction with lattice samples, serving solely as a structural constraint on admissible error representations. Constant-time implementation of all verification procedures ensures that no timing or cache-based leakage is introduced. Under standard lattice-reduction and combinatorial bounds, the construction does not expand the adversarial attack surface beyond assumptions of Module-LWE hardness, PRF security, and hash collision resistance.

3.5. Concrete Security Analysis for Sparse Secret Parameters

The security of Merkle-LWE relies critically on the hardness of the Module-LWE problem with sparse secrets. While the concrete parameter sets for all three NIST security levels are specified in Section 4.8, this section provides explicit security estimates validating that those parameters achieve the intended security margins under both lattice-reduction and combinatorial attack models.

We estimate concrete security using the standard lattice-estimation framework referenced in [25], accounting for both primal and dual lattice reduction attacks. For sparse-secret LWE, the analysis must consider two distinct attack vectors: (i) lattice reduction attacks that exploit the algebraic structure of the problem, and (ii) combinatorial attacks that exploit the low Hamming weight of the secret. Our parameter selection ensures resistance against both.

For lattice reduction attacks, we compute the core-SVP hardness using the BKZ simulator with state-of-the-art blocksize estimates. For Module-LWE with module rank

k

, the effective lattice dimension is

n \cdot k

, and the security estimate accounts for the algebraic structure via module lattice reduction techniques [10]. For combinatorial attacks on sparse secrets, we apply the analysis of Bindel et al. [18], which shows that the best known attack complexity is approximately

(\binom{n}{w}) \cdot p o l y (n),

(3)

Table 2 presents the concrete security estimates for the parameter sets defined in Section 4.8.

The estimates reveal three key insights. For Level 1, with

w = 48

, the combinatorial attack complexity is 143.2 bits, providing a comfortable 15.2-bit margin above the 128-bit target. The lattice reduction security of 142.8 bits is comparable to Kyber-512’s estimated 143 bits [13]. For Levels 3 and 5, the combinatorial security falls slightly short of the NIST targets (−3.3 and −21.5 bits, respectively). However, this is offset by the lattice reduction security, which exceeds the targets by 15.3 and 15.4 bits. An adversary must break both assumptions simultaneously; the hybrid design with SHA3-512 commitments adds an independent 256-bit classical collision-resistance bound, creating a composite security model where the effective strength is the combination of both components [17,46].

Table 3 compares Merkle-LWE’s estimates against Kyber at equivalent levels.

The comparison confirms that Merkle-LWE achieves comparable lattice-reduction security to Kyber at all levels. The combinatorial attack vector is unique to sparse secrets, but our parameters keep this attack complexity within acceptable bounds, and the hybrid SHA3-512 layer provides additional security depth.

The shortfalls at Levels 3 and 5 are acceptable because: (i) the hybrid security model requires breaking both lattice and hash components [17,46]; (ii) NIST targets are conservative, and the Level 1 parameter exceeds the target comfortably; and (iii) practical combinatorial attacks face additional constraints such as memory requirements and parallelization limits [18]. The concrete security analysis demonstrates that Merkle-LWE’s sparse secret parameters achieve security levels comparable to standardized schemes while enabling substantial key size reductions. The hybrid design with SHA3-512 commitments provides additional security depth, ensuring robustness even when individual components face marginal shortfalls.

4. Proposed Hybrid KEM Construction

4.1. Overview of the Hybrid Architecture

The Merkle-LWE KEM represents a paradigm shift in post-quantum cryptographic design, specifically engineered to address the acute memory constraints of deeply embedded systems such as ARM Cortex-M microcontrollers, RISC-V-based IoT sensors, and other resource-limited platforms. Traditional lattice-based KEMs, including the NIST-standardized CRYSTALS-Kyber, prioritize computational throughput and communication bandwidth for general-purpose computing environments [4,5]. While highly effective on servers and desktops, their public keys—often exceeding 1000 bytes—and substantial RAM requirements render them impractical for devices operating with only tens of kilobytes of flash and RAM [13]. Our construction directly confronts this gap by inverting the conventional cost model: we deliberately accept higher computational overhead in exchange for drastic reductions in static storage and peak runtime memory usage. This “memory-first” philosophy is not merely an optimization but a foundational design principle that permeates every layer of the architecture.

At its core, Merkle-LWE achieves unprecedented compactness by replacing explicit storage of large, deterministic structures with compact cryptographic commitments. The public key is reduced from over a kilobyte to a mere 96 bytes, while the private key shrinks to between 160 and 224 bytes, depending on the security level. This 99.3% reduction in total key material is accomplished not through parameter weakening, which would erode security margins, but through a structural reimagining of how key material is represented and verified. The architecture is built upon three synergistic pillars that work in concert to minimize memory footprint while preserving IND-CCA security against quantum adversaries.

The first pillar is a seed-based Module-LWE core. In conventional schemes, the public matrix

A \in Z_{q}^{n \times n}

is stored in full, consuming kilobytes of precious flash memory. In Merkle-LWE,

A

is never stored. Instead, the public key contains only a 32-byte seed from which

A

can be deterministically regenerated on-the-fly using a cryptographically secure pseudorandom generator (PRG), specifically ChaCha20 [49]. This simple yet powerful technique shifts the burden from static storage to controlled computation, a rational trade-off on devices where CPU cycles are abundant relative to memory [29].

The second pillar is a structured sparsity model for the secret key. Rather than representing the secret vector s as a dense polynomial with n coefficients, our scheme uses a sparse representation with a carefully chosen Hamming weight (e.g., 48 non-zero coefficients out of 256 at Level 1). These non-zero coefficients are bounded in magnitude and generated deterministically from a seed via a Fisher-Yates shuffle. This approach drastically reduces the private key size and accelerates polynomial multiplication, as operations need only be performed on the non-zero elements. Critically, the sparsity parameters are selected to maintain the hardness guarantees of the underlying Module-LWE problem, drawing on the analysis of Bindel et al. to ensure that the security reduction to worst-case lattice problems remains valid [18].

The third and most innovative pillar is a hash-based Merkle commitment layer. This component addresses the storage of auxiliary data, such as error vectors and ephemeral secrets, which in traditional schemes are either transmitted in full or require large precomputed tables. In Merkle-LWE, these values are not stored explicitly. Instead, their integrity is ensured via Merkle tree commitments. During key generation, a set of candidate error patterns is committed to in a Merkle tree, and the root of this tree becomes part of the public key. During encapsulation, the sender includes a succinct Merkle authentication path in the ciphertext, allowing the receiver to verify the correctness of the claimed error pattern without requiring its explicit transmission [35]. This mechanism transforms the public key from a collection of large vectors into a compact descriptor—a Merkle root—that authenticates a much larger, implicitly defined set of values.

These three pillars are unified under the Fujisaki-Okamoto transform to achieve IND-CCA security in the random oracle model [5,18]. All operations are implemented in constant-time to resist timing and cache-based side-channel attacks, a critical requirement for embedded deployments [27,28]. The resulting KEM is not intended to replace Kyber on general-purpose hardware; rather, it is a specialized solution for a specific, underserved class of devices. Its significance lies in its ability to bring post-quantum security within reach of platforms that would otherwise remain vulnerable in a quantum future [14]. By making memory efficiency a primary design objective, Merkle-LWE opens a new pathway for PQC adoption in the vast and growing ecosystem of embedded and IoT devices.

The Merkle-LWE KEM adopts a memory-first architectural approach that departs from conventional lattice-based constructions by prioritizing reductions in static and dynamic memory usage over raw computational throughput. As illustrated in Figure 2, the construction is structured around three complementary components: seed-based deterministic generation of the public matrix, a sparse representation of the secret key, and a Merkle-tree-based commitment mechanism for auxiliary data. These components are combined under the Fujisaki–Okamoto transform to achieve IND-CCA security in the random oracle model. By framing the interaction between these elements, the architectural context for the design decisions discussed in this section are provided, which further clarifies how the scheme achieves compact key material without weakening its underlying security assumptions.

4.2. Seed-Based Module-LWE Public Key Generation

In classical Module-LWE constructions, the public key is conceptually defined as a composite object consisting of a public matrix

A

and a vector

b = A s + e

. A naïve instantiation would require explicit storage of

A

, leading to prohibitive memory costs (e.g.,

256 \times 256

coefficients), which motivates the use of pseudorandom generation. In modern lattice-based KEMs such as CRYSTALS-Kyber [13], the matrix

A

is deterministically generated from a short seed

ρ

, and the public key consists of

(ρ, t)

, where

t = A s + e

. Consequently, storing the seed alone is sufficient to reconstruct

A

when needed, eliminating the need for explicit matrix storage.

Our contribution, however, is not the use of seeded matrix generation per se, but rather the elimination of explicit transmission and storage of the vector

t

through integration with a Merkle-tree commitment layer. Specifically, whereas Kyber requires the receiver to store or reconstruct

t

(contributing approximately 800 bytes to the public key at Level 1), Merkle-LWE commits to the structure of the secret

s

via a Merkle root, enabling verification of LWE samples without explicitly storing or transmitting

t

. This structural distinction—replacing explicit vector storage with cryptographic commitments that authenticate sparse error patterns—constitutes the key novelty of our memory-first design. The public key in Merkle-LWE consists solely of a 32-byte seed for

A

and a 64-byte SHA3-512 Merkle root, totaling 96 bytes independent of lattice dimension, whereas Kyber’s public key scales with security level due to explicit representation of

t

[13].

The public key generation process begins with the secure sampling of a 32-byte public seed using the system’s entropy source (e.g., “getrandom” on Linux or “BCryptGenRandom” on Windows). This seed is the sole component of the public key that relates to the lattice structure. It is stored directly in the public key buffer. The actual matrix

A

is never materialized in persistent storage. Instead, during any operation that requires

A

—such as encapsulation or decapsulation—the matrix is regenerated on-the-fly, row by row, using the ChaCha20 stream cipher as the PRG [49]. For a given row index

i

, a nonce derived from

i

is combined with the public seed to initialize ChaCha20, which then expands to produce the

i

-th row of

A

. This approach ensures that at no point does the entire matrix reside in RAM, reducing peak memory usage from hundreds of kilobytes to a few hundred bytes for temporary buffers [29].

This design preserves the semantic security of the underlying Module-LWE problem. Since the public seed is indistinguishable from a random string, the distribution of the regenerated matrix

A

is identical to that of a truly random matrix, maintaining the IND-CPA security of the base scheme [29,56]. The use of ChaCha20 is deliberate: it is a well-vetted, constant-time stream cipher that is particularly efficient on embedded platforms, offering a good balance between speed and security [49]. Furthermore, by regenerating

A

on demand, the scheme avoids the long-term storage of sensitive intermediate values, thereby reducing the attack surface for memory-scraping attacks that could occur if the device were physically compromised.

However, the public key is not just the seed. To enable verification and to complete the hybrid construction, the public key also includes a 64-byte Merkle root. This root is computed by first generating the sparse secret polynomial s from a separate secret seed (a process detailed in Section 5.3). The coefficients of s are then hashed to form the leaves of a Merkle tree, and the root of this tree is appended to the public seed to form the complete public key:

p k = ({s e e d}_{A}, {r o o t}_{s})

. This 96-byte structure is remarkably compact. The Merkle root serves a dual purpose: it commits to the secret key’s structure, enabling the receiver to verify its authenticity during decapsulation, and it forms the foundation of the hybrid security model by integrating hash-based assumptions with lattice-based ones [17,46].

This integration is crucial for the overall security posture. An adversary attempting to forge a public key cannot simply provide a random seed and a random root; they must ensure that the root is a valid Merkle commitment to a secret that is consistent with the lattice instance defined by the seed. This linkage between the lattice and hash components creates a more robust security model, as an attacker would need to break both the Module-LWE assumption and the collision resistance of SHA3-512 to mount a successful attack [18,33]. The public key, therefore, is not a passive container of data but an active cryptographic statement that binds together two independent hardness assumptions. This design not only achieves extreme memory efficiency but also enhances security through diversity, making Merkle-LWE a compelling solution for environments where both resource constraints and long-term security are paramount concerns.

4.3. Structured and Sparse Secret Key Design

The secret key in a lattice-based cryptosystem is traditionally represented as a dense vector or polynomial, where every coefficient is a non-zero integer sampled from a specific distribution, often a discrete Gaussian [7,37]. This representation is straightforward and aligns with the theoretical security proofs that underpin the LWE problem. However, it is highly inefficient from a memory perspective, especially on embedded platforms. For a lattice dimension of

n = 256

, a dense secret key would require at least 256 bytes of storage just for the coefficients, not including any metadata or auxiliary data. In resource-constrained environments, this overhead is significant and can be a primary barrier to deployment.

Merkle-LWE addresses this challenge through a deliberate and structured approach to sparsity. Instead of a dense vector, the secret key s is represented as a sparse polynomial with a precisely controlled Hamming weight. Specifically, for our three NIST-aligned security levels, the number of non-zero coefficients is set to 48, 64, and 80 for Levels 1, 3, and 5, respectively. Each of these non-zero coefficients is further bounded in magnitude; for instance, at Level 1, they are sampled from the set

{- 4, - 3, \dots, 3, 4}

. This dual constraint—on both the number of non-zero elements and their magnitude—dramatically reduces the entropy and, consequently, the storage requirements of the secret key [57].

The generation of this sparse secret is a deterministic process driven by a 32-byte secret seed. The algorithm proceeds in two stages. First, a Fisher-Yates shuffle is used to select a set of distinct indices from the range

[0, n - 1]

. This shuffle is itself seeded by the secret seed, ensuring that the selection of indices is both unpredictable to an outside observer and perfectly reproducible by anyone who possesses the seed. Second, for each selected index, a coefficient value is sampled from the bounded set using ChaCha20, again seeded by the secret seed [49]. This process yields a complete, sparse polynomial s that is functionally identical to a randomly sampled secret from a security standpoint but is orders of magnitude more compact in its internal representation.

The private key, therefore, does not store the full list of 256 coefficients. Instead, it stores only the 32-byte secret seed, along with a hash of the public key (for CCA security) and an error seed (used in the decapsulation process). This results in a private key size of just 160–224 bytes across all security levels, a reduction of over 80% compared to a naive dense representation. This compactness is not just a static benefit; it also translates into dynamic performance gains during cryptographic operations. Polynomial multiplication, a core operation in LWE-based schemes, becomes significantly faster when one of the operands is sparse. The computational complexity drops from

O (n^{2})

for dense multiplication to

O (w - n)

for sparse multiplication, where

w

is the Hamming weight. In our case, with

w = 48

and

n = 256

, this represents a speedup of over 5× [37].

Critically, this move to sparsity does not come at the cost of security. The hardness of the LWE problem with sparse secrets has been rigorously analyzed by Bindel et al. [18], who established concrete lower bounds on the required Hamming weight to maintain security against both lattice reduction and combinatorial attacks. Our chosen parameters—

w \in {48,64,80}

for Levels 1, 3, and 5, respectively—are explicitly selected to exceed these conservative thresholds, as validated by the detailed security analysis in Section 3.4. The combinatorial attack complexity

{l o g}_{2} (\binom{256}{w})

yields 143.2, 188.7, and 234.5 bits of security across the three levels, while lattice reduction attacks require 142.8, 207.3, and 271.4 bits of effort. These estimates confirm that the security reduction from worst-case lattice problems (like Module-SIVP) to the average-case Module-LWE problem remains valid, even with structured sparsity. In essence, we leverage a well-understood property of the LWE problem: its hardness is robust to certain forms of structure in the secret, provided that structure is not so extreme as to make the problem trivial.

Furthermore, the structured nature of our sparsity model allows for additional implementation-level optimizations that further enhance its suitability for embedded platforms. During the key generation phase, the list of non-zero indices is sorted in ascending order. This simple step has a profound impact on the memory access patterns during polynomial multiplication. On microcontrollers with limited cache or no cache at all, sequential or predictable memory accesses are far more efficient than random ones [22,58]. By processing the non-zero coefficients in a sorted order, our implementation maximizes spatial locality, reducing the number of cache misses and improving overall energy efficiency. This attention to low-level detail demonstrates how our high-level design goal of memory efficiency cascades down into every layer of the implementation.

In summary, the structured and sparse secret key design is a cornerstone of the Merkle-LWE architecture. It is a deliberate engineering choice that exploits a known property of the underlying hardness assumption to achieve a dual win: a drastic reduction in memory footprint and a significant acceleration of core arithmetic operations. This design transforms the secret key from a passive data structure into an active, optimized component of the system, enabling post-quantum security on devices where every byte and every CPU cycle counts.

4.4. Hash-Based Merkle Commitment Layer

While the seed-based generation of the public matrix and the sparse representation of the secret key address the storage of the primary cryptographic objects, a significant source of memory overhead in traditional KEMs remains: the handling of auxiliary data, particularly the error vectors used in the LWE samples. In a standard encapsulation, the sender must sample an error vector

e

, use it to compute the shared secret, and then either transmit it explicitly or rely on the receiver to reconstruct it from a common random string [6,13]. Both approaches have drawbacks. Transmitting e in full adds to the ciphertext size, while relying on a common random string requires the receiver to store or recompute a large set of potential errors, which is memory-intensive.

The Merkle commitment layer is the innovative solution that resolves this dilemma. It leverages the properties of Merkle trees—a fundamental construct in cryptography—to provide a succinct and verifiable way to handle this auxiliary data without explicit storage or transmission [35]. The core idea is to commit to a large set of precomputed, valid error patterns during the key generation phase. This commitment is a single, fixed-size hash value: the root of a Merkle tree whose leaves are the hashes of the individual error patterns.

During key generation, a set of 128 or 256 candidate error vectors is generated. Each vector is a sparse polynomial, similar in structure to the secret key, with its own bounded coefficients. The SHA3-512 hash of each error vector is computed to form a leaf in the Merkle tree. The tree is then constructed in the standard way, with each parent node being the hash of its two children, until a single root hash is produced. This root is not stored in the public key but is kept as an internal state within the private key. Its purpose is to serve as a binding commitment to the entire set of error patterns.

During the encapsulation process, the sender selects one of these precomputed error patterns to use in the LWE sample. Instead of sending the entire error vector, the sender includes two pieces of information in the ciphertext: (1) the index of the selected error pattern within the set, and (2) the Merkle authentication path for that leaf. The authentication path is a sequence of sibling hashes that, when combined with the leaf hash, allows a verifier to reconstruct the Merkle root. The size of this path is logarithmic in the number of leaves; for a tree with 256 leaves, the path consists of 8 hashes, or 512 bytes [31,33].

On the receiver’s side, during decapsulation, the process is reversed. The receiver, who possesses the private key (and thus the seed used to generate the error set), can regenerate the entire set of error patterns. Using the index from the ciphertext, the receiver selects the claimed error pattern and computes its hash. It then uses the provided authentication path to verify that this hash indeed leads to the committed Merkle root. If the verification succeeds, the receiver is assured that the error pattern is authentic and was part of the original committed set. If the verification fails, it indicates tampering, and the decapsulation algorithm outputs a random shared secret, as required by the IND-CCA security definition [5,18].

This mechanism provides several critical benefits. First, it completely eliminates the need to store the full set of error vectors in the private key. The private key only needs to store the seed for the error set, not the set itself. Second, it prevents an adversary from submitting a malformed or malicious error vector in an attempt to learn information about the secret key, a class of attack known as a decryption failure attack [24]. The Merkle verification acts as a gatekeeper, ensuring that only pre-approved, safe error patterns are processed. Third, it enhances the overall security model by integrating a second, independent hardness assumption—the collision resistance of SHA3-512—into the core protocol. An attacker would need to not only solve the Module-LWE problem but also find a collision in SHA3-512 to forge a valid authentication path for a malicious error [17,46].

The choice of SHA3-512 for the hash function is deliberate. Unlike older hash functions based on the Merkle-Damgård construction (like SHA-256), SHA3 is built on a sponge construction, which offers different and arguably stronger security properties, particularly in the context of quantum adversaries [33]. Its 512-bit output provides a comfortable security margin against both classical and quantum collision-finding attacks. The Merkle commitment layer, therefore, is not a mere add-on but an integral and security-critical component of the Merkle-LWE architecture, enabling its unique combination of extreme compactness and robust security.

4.5. Key Generation Algorithm

The key generation algorithm in Merkle-LWE is designed to produce extremely compact public and private keys while maintaining IND-CCA security under the Module-LWE assumption. The core innovation lies in its representation: rather than storing large explicit matrices or dense secret vectors, the algorithm outputs only cryptographic seeds and commitments that enable on-the-fly reconstruction and verification. This approach directly addresses the memory constraints of embedded platforms by shifting the resource burden from static storage to controlled computation [29].

The algorithm begins by securely sampling two 32-byte seeds using the system’s entropy source (“system_random_bytes”). The first seed, denoted as “seed_A”, serves as the public seed for deterministic regeneration of the Module-LWE public matrix

A

. The second seed, “seed_s”, is used to generate a sparse secret polynomial s with a precisely controlled Hamming weight (e.g., 48 non-zero coefficients for Level 1). This sparsity is not an ad hoc optimization but a deliberate design choice that reduces both storage and arithmetic complexity while preserving the hardness guarantees of the underlying lattice problem, as validated by the analysis of Bindel et al. [18].

The sparse secret polynomial

s

is generated through a two-stage process. First, a Fisher-Yates shuffle is applied to the index set

{0, 1, \dots, n - 1}

using “seed_s” to select distinct positions for the non-zero coefficients. This ensures an unbiased and unpredictable distribution of support. Second, coefficient values are sampled from a bounded range (e.g.,

{- 4, - 3, \dots, 3, 4}

) using ChaCha20, again seeded by “seed_s” [49]. The resulting structure—a list of indices and corresponding small integer values—is never stored in full. Instead, the private key retains only “seed_s”, allowing the secret to be reconstructed deterministically during decapsulation.

The public key is formed by combining “seed_A” with a Merkle root that commits to the secret s. To construct this commitment, each coefficient of s is hashed using SHA3-512 to form the leaves of a Merkle tree. The tree is then built bottom-up, with each internal node being the hash of its two children. The root of this tree, a 64-byte value, becomes the second component of the public key. Thus, the complete public key is the concatenation

p k = seed_A | | root_s

, totaling just 96 bytes regardless of the security level. This compactness is achieved without weakening parameters; it is a structural property of the hybrid design.

The private key comprises three components: “seed_s” (32 bytes), a hash of the public key (32 bytes, for CCA security via the Fujisaki–Okamoto transform), and metadata about the security level. Its total size ranges from 160 to 224 bytes across security levels, a reduction of over 90% compared to traditional LWE schemes [13]. Critically, all sensitive data is handled in constant-time, and temporary buffers (e.g., for the Merkle tree) are zeroized after use via “secure_memzero” [28,38]. The algorithm is also designed to respect strict memory bounds: peak RAM usage is capped at 8–24 KB depending on the security level, making it suitable for microcontrollers with limited stack space [23].

The end-to-end workflow of the key generation process is illustrated in Figure 3. As depicted, the algorithm initiates by sampling independent seeds from the system entropy source, which then diverge into three parallel processing branches. The first branch retains the public matrix seed (

seed_A

) directly within the public key structure to enable deterministic on-the-fly regeneration. The second branch drives the Fisher-Yates shuffle and bounded coefficient sampling to construct the sparse secret polynomial

s

without storing it in memory. The third branch commits these coefficients via SHA3-512 hashing to build the Merkle tree, extracting the root hash that binds the secret structure to the public key. These branches converge to assemble the final key pair, demonstrating how the scheme eliminates explicit matrix storage and achieves a compact 96-byte public key while ensuring all sensitive intermediate buffers are securely zeroized.

This design achieves a delicate balance: it provides the verifier with a succinct, verifiable statement about the secret (the Merkle root) while enabling the owner to reconstruct the full secret from a minimal seed. The security of this construction relies on two independent assumptions—the hardness of Module-LWE and the collision resistance of SHA3-512—creating a robust foundation that is resistant to unforeseen breakthroughs in either domain [17,46].

4.6. Encapsulation Algorithm

The encapsulation algorithm in Merkle-LWE transforms the compact public key into a shared secret and an associated ciphertext, adhering to the IND-CCA security model through the Fujisaki–Okamoto transform [5,18]. The sender begins by parsing the public key

p k = seed_A | | root_s

and uses “seed_A” to regenerate rows of the public matrix

A

on-the-fly via ChaCha20. This avoids the need to store the full matrix, which would consume hundreds of kilobytes, and instead trades memory for computation—a rational exchange in embedded contexts [29].

Next, the sender generates an ephemeral sparse secret

s^{'}

using a fresh ChaCha20 stream. Like the long-term secret,

s^{'}

has a controlled Hamming weight and bounded coefficients, ensuring that polynomial multiplication remains efficient [57]. The LWE sample is then computed using the regenerated public structure, where

u = A^{T} s^{'}

and

v = ⟨ b, s^{'} ⟩ + e_{2}

, with

b

implicitly defined as

b = A s + e

and

e_{2}

sampled from a predefined error set. Critically,

b

is never explicitly materialized; instead, all required operations are derived from the public seed and the Merkle-committed secret structure, preserving both compactness and verifiability.

The security of the component

u = A^{T} s^{'}

relies on the hardness of the Module-LWE problem instantiated with sparse secrets. The ephemeral vector

s^{'}

is sampled with controlled Hamming weight

w ≪ n

(48, 64, or 80 non-zero coefficients for Levels 1, 3, and 5, respectively) and coefficients bounded by

β = 1

, ensuring that recovering

s^{'}

from

u

reduces to solving a sparse Module-LWE instance. This variant remains computationally infeasible under standard lattice assumptions: for Level 1 parameters (

n = 256

,

q = 3329

,

k = 2

,

w = 48

), the best known primal and dual lattice reduction attacks require approximately

2^{143}

operations, exceeding the 128-bit quantum security target [18]. Furthermore, the inclusion of the error term

e_{2}

in the second component

v

provides additional noise flooding, preventing algebraic recovery of

s^{'}

even under partial leakage of

A

. All parameters are selected to satisfy the concrete security bounds and validated using the lattice estimator [25].

The error vector e is not transmitted explicitly. Instead, the sender selects an error pattern from a set of 128 or 256 precomputed candidates and includes a Merkle authentication path in the ciphertext. This path proves that the chosen error is a valid member of the committed set without revealing the entire set. The shared secret is derived from e using a key derivation function (KDF), specifically SHA3-256, to ensure uniformity and independence.

The ciphertext is assembled as

c t = e n c (u) | | e n c (v) | | auth_path | | leaf_index

, where

e n c ()

denotes bit-packing to minimize bandwidth. The inclusion of the leaf index allows the receiver to locate the correct error pattern during decapsulation. The total ciphertext size ranges from 128 to 192 bytes, slightly larger than some traditional schemes due to the Merkle path overhead, but this is a deliberate trade-off for the massive reduction in key sizes [35].

All operations are implemented in constant-time to prevent side-channel leakage. Matrix row generation, sparse multiplication, and hash computations follow strict timing discipline, and no secret-dependent branches or memory accesses are performed [26,27]. The algorithm is also optimized for cache efficiency: intermediate values like

A

rows and

s^{'}

are kept in small, aligned buffers that fit within the L1 cache of typical microcontrollers, minimizing expensive memory traffic [20,22]. This focus on locality ensures that the computational overhead of on-the-fly generation does not translate into prohibitive energy costs on battery-powered devices.

4.7. Decapsulation Algorithm

Decapsulation is the inverse process that recovers the shared secret from the ciphertext and private key, with rigorous checks to ensure correctness and security. The receiver begins by parsing the ciphertext to extract the LWE samples

(u, v)

, the Merkle authentication path, and the leaf index. Using the private key, which contains “seed_s”, the receiver reconstructs the sparse secret polynomial

s

and recomputes the public vector

b = A s + e

. The matrix

A

is regenerated on-the-fly from the public seed “seed_A”, which is recoverable from the public key hash stored in the private key.

The core security mechanism in decapsulation is the Merkle path verification, which ensures that the decrypted error pattern belongs to the set committed in the public key. Upon receiving a ciphertext, the receiver identifies the claimed error pattern via its leaf index and verifies it against the public Merkle root using the provided authentication path [35]. If verification fails—e.g., due to ciphertext tampering or substitution of an invalid error—the algorithm outputs a uniformly random shared secret, as required by IND-CCA security, thereby preventing decryption failure or reaction-based leakage attacks [5,24]. Importantly, the decapsulation procedure does not require explicit storage or transmission of the vector

t = A s + e

; instead, consistency is checked implicitly through the committed structure of the secret and the regenerated algebraic relations. Concretely, the receiver deterministically reconstructs the sparse secret

s

from

s e e d_{s}

and regenerates the matrix

A

from

s e e d_{A}

, then derives the candidate error

e_{c a n d}

using

s e e d_{e}

and the ciphertext index. The leaf hash

h = SHA 3-512 (i d x_{e} ∥ e_{c a n d})

is verified against the Merkle root

r o o t_{s}

via the authentication path of length

⌈{l o g}_{2} ∣ E ∣⌉

; only if this verification succeeds is the shared secret computed as

K = SHA 3-256 (e_{c a n d} ∥ u ∥ v ∥ c o n t e x t)

. This design ensures that the sender is bound to a valid, pre-committed error structure, preventing adaptive chosen-ciphertext attacks that exploit malformed or adversarially chosen error terms, while maintaining correctness under standard Module-LWE assumptions [24]. The authentication path thus replaces the need to transmit or store any full intermediate structure such as

t

, reducing memory overhead while preserving verifiability.

If verification succeeds, the receiver computes the shared secret as

s s = K D F (e)

and compares it to the sender’s value. The correctness of this process is guaranteed by the binding property of the Merkle commitment: only the true owner of the secret can produce a valid authentication path for a given error pattern. All operations are performed in constant-time, with no early exits or data-dependent memory accesses that could leak information through timing or power analysis [28].

Memory usage during decapsulation is carefully controlled. The algorithm avoids large intermediate buffers by processing data in small chunks and reusing memory wherever possible. For instance, the reconstructed secret

s

and the LWE samples are stored in overlapping buffers to minimize peak RAM usage, which remains below 24 KB even at the highest security level. This makes the algorithm viable for deployment on platforms like the ARM Cortex-M4, which often have only 32–64 KB of RAM available for application code [14,23].

In summary, the decapsulation algorithm embodies the security and efficiency principles of the Merkle-LWE design. It leverages the hybrid structure to shift verification from algebraic checks to cryptographic commitments, enabling a high degree of confidence in the authenticity of the ciphertext while maintaining the extreme memory efficiency that defines the scheme.

4.8. Formal Specification and Parameter Sets

To enable independent verification of correctness and security, we provide a complete formal specification of the Merkle-LWE KEM. The scheme operates over the ring

R_{q} = Z_{q} [X] / (X^{n} + 1)

with parameters

(n, q, k, η, w, β)

defined per NIST security level in Table 4.

The key generation procedure is presented in Algorithm 1. It takes as input a security level

λ \in {1, 3, 5}

, produces a public–private key pair and samples randomness to derive seeds for matrix generation, secret construction, and error reconstruction. A sparse secret vector is generated according to the prescribed sparsity parameters, and a Merkle tree commitment is computed over its coefficients to ensure binding. The public key consists of the matrix seed and the Merkle root, while the secret key retains the seeds required for deterministic reconstruction.

Algorithm 1: Key Generation
1:	$Sample s e e d_{A}, s e e d_{s}, s e e d_{e} \leftarrow {0, 1}^{32}$ using system entropy
2:	$Select w$ $distinct indices I \subset {0, \dots, n - 1}$ $using Fisher–Yates shuffle seeded by s e e d_{s}$
3:	$Initialize s \in R_{q}^{k}$ with all coefficients set to zero
4:	$for each i \in I$ do
5:	$s_{i} \leftarrow {- β, \dots, β}$ $via ChaCha 20 (s e e d_{s} ∥ i$ )
6:	end for
7:	$for j = 0, \dots, n - 1$ do
8:	$l_{j} \leftarrow SHA 3-512 (j ∥ s_{j})$
9:	end for
10:	$Construct binary Merkle tree over \{l_{0}, \dots, l_{n - 1}\}$ $and obtain root r o o t_{s}$
11:	$p k \leftarrow (s e e d_{A} ∥ r o o t_{s})$
12:	$s k \leftarrow (s e e d_{s} ∥ H (p k) ∥ s e e d_{e})$
13:	$return (p k, s k)$

The encapsulation procedure, formalised in Algorithm 2, takes a public key as input and produces a ciphertext together with a shared secret. It deterministically reconstructs the public matrix, samples an ephemeral sparse secret, and selects an error value from a predefined set along with its Merkle authentication path. These components are used to compute the LWE sample, from which both the ciphertext and shared secret are derived.

Algorithm 2: Encapsulation
1:	$Parse p k = (s e e d_{A} ∥ r o o t_{s})$
2:	$Regenerate matrix A \in R_{q}^{k \times k}$ $using ChaCha 20 (s e e d_{A} ∥ row_idx$ )
3:	$Sample ephemeral sparse vector s^{'}$ using fresh ChaCha20 stream
4:	$Sample index i d x_{e}$ $and set e \leftarrow E [i d x_{e}]$
5:	$Compute Merkle authentication path π_{e}$ $for SHA 3-512 (i d x_{e} ∥ e)$
6:	$u \leftarrow A^{T} s^{'}$
7:	$v \leftarrow ⟨ t, s^{'} ⟩ + e,$ $where t = A s + e$
8:	$K \leftarrow SHA 3-256 (e ∥ u ∥ v ∥ c o n t e x t)$
9:	$Encode c t = (u, v, i d x_{e}, π_{e})$
10:	$return (c t, K)$

The decapsulation procedure is formalised in Algorithm 3. It takes as input a secret key and a ciphertext, attempts to recover the shared secret, reconstructs the secret vector and matrix deterministically, verifies the integrity of the transmitted error using the Merkle authentication path, and derives the shared secret if verification succeeds.

Algorithm 3: Decapsulation
1:	$Parse c t$ $into (u, v, i d x_{e}, π_{e})$
2:	$Regenerate s$ $from s e e d_{s}$ $and matrix A$ $from s e e d_{A}$
3:	$Reconstruct e_{c a n d}$ $using s e e d_{e} ∥ i d x_{e}$
4:	$Compute h = SHA 3-512 (i d x_{e} ∥ e_{c a n d})$
5:	$if verification of π_{e}$ $against r o o t_{s}$ fails then
6:	$Output random K$ and terminate
7:	end if
8:	$Compute t_{c a n d} = A s + e_{c a n d}$
9:	$Derive K^{'} = SHA 3-256 (e_{c a n d} ∥ u ∥ v ∥ c o n t e x t)$
10:	$return K^{'}$

4.9. Error Pattern Selection and Distributional Indistinguishability

The Merkle commitment layer requires a well-defined set of candidate error patterns

E = {e_{0}, \dots, e_{∣ E ∣ - 1}}

to which the public key commits. This section formalizes the selection methodology, characterizes the induced distribution, and establishes that restricting encapsulation to

E

does not compromise indistinguishability under standard Module-LWE assumptions.

The candidate set

E

is generated deterministically from the secret error seed

{seed}_{e}

using the ChaCha20 stream cipher, modeled as a pseudorandom function. For each index

i \in {0, \dots, ∣ E ∣ - 1}

, the coefficient vector

e_{i} \in Z_{q}^{n}

is sampled independently from the same bounded distribution

χ_{β}

used in standard Module-LWE, typically uniform over

\{- β, \dots, β\}

with

β = 1

across all security levels. The set size

∣ E ∣ \in {128, 256}

is fixed per security level to balance Merkle authentication overhead

⌈{l o g}_{2} ∣ E ∣⌉

with statistical coverage of the error space. During encapsulation, the sender selects an index

j \leftarrow {0, \dots, ∣ E ∣ - 1}

uniformly at random and uses

e_{j}

as the error vector in the LWE sample.

The restriction of error sampling to a fixed finite set

E

induces a distribution that differs from the ideal i.i.d. sampling from

χ_{β}^{n}

. However, since

E

is generated via a cryptographically secure pseudorandom function, the set is computationally indistinguishable from a collection of independent samples drawn from

χ_{β}^{n}

, provided that ChaCha20 remains secure. Consequently, from the perspective of any polynomial-time adversary without knowledge of

{seed}_{e}

, selection from

E

is indistinguishable from sampling from an honestly generated error distribution.

The only deviation from the ideal distribution arises from conditioning on a finite support of size

∣ E ∣

. This introduces a negligible statistical loss of entropy proportional to

{l o g}_{2} ∣ E ∣

relative to the full space

{(χ_{β}^{n})}^{∣ E ∣}

. Since

∣ E ∣ \geq 128

, this loss is negligible in all security levels considered and does not affect asymptotic hardness.

We formalize security preservation via a sequence of hybrid experiments. Let

Π_{Merkle}

denote the proposed scheme and

Π_{std}

denote a standard Module-LWE scheme with unrestricted error sampling.

Hybrid 0 (Real Scheme). The adversary interacts with $Π_{Merkle}$ , where $E$ is generated via ChaCha20 and a uniformly random index is used for each encapsulation.
Hybrid 1 (PRF Replacement). ChaCha20 is replaced with a truly random function. By PRF security, the adversary’s distinguishing advantage is bounded by $ϵ_{PRF}$ . In this hybrid, $E$ is indistinguishable from a uniformly random set of valid error vectors drawn from $χ_{β}^{n}$ .
Hybrid 2 (Independent Sampling). Selection from $E$ is replaced by direct sampling from $χ_{β}^{n}$ for each encapsulation. The difference between Hybrid 1 and Hybrid 2 is negligible due to the pseudorandom nature of $E$ and the uniform selection mechanism, introducing at most a negligible loss bounded by $O (1 / ∣ E ∣)$ .
Hybrid 3 (Standard Module-LWE). The experiment is replaced with standard Module-LWE sampling. Distinguishing Hybrid 2 from Hybrid 3 reduces to solving the Module-LWE problem with advantage $ϵ_{M-LWE}$ .

Combining the transitions yields:

{Adv}_{dist} (A) \leq ϵ_{PRF} + ϵ_{M-LWE} + negl (∣ E ∣) .

(4)

The Merkle tree does not alter the underlying LWE distribution but enforces consistency between ciphertexts and the committed set

E

. Any deviation from a valid error pattern results in rejection during verification, ensuring that all accepted ciphertexts correspond to a unique and pre-committed element of

E

. This property strengthens binding without introducing bias into the LWE sampling process.

The candidate error set

E

is a pseudorandomly generated finite ensemble derived from a secure PRF and sampled uniformly during encapsulation. While this induces a restricted sampling space relative to standard Module-LWE, the restriction is computationally hidden from adversaries and introduces only negligible statistical loss. The hybrid reduction confirms that the induced distribution remains indistinguishable from standard LWE sampling under PRF and Module-LWE assumptions, ensuring that the hardness of the underlying problem is preserved and that no exploitable structural bias is introduced by the Merkle-constrained error selection process.

5. Implementation Considerations

The practical viability of any post-quantum cryptographic scheme on resource-constrained embedded platforms hinges not only on its theoretical security but also on the careful engineering of its implementation. Merkle-LWE’s memory-first design philosophy necessitates a suite of low-level optimizations that collectively ensure the scheme operates within the stringent RAM, flash, and energy budgets of microcontrollers while maintaining resistance to side-channel attacks [14,16]. This section details four interlocking implementation strategies that form the backbone of our system: memory-efficient pseudorandom number generation (PRNG) expansion, optimized polynomial arithmetic leveraging both algorithmic sparsity and hardware-specific vectorization, cache-aware in-place computation to minimize memory traffic, and rigorous constant-time discipline to mitigate timing and power analysis vulnerabilities. These considerations are not afterthoughts but foundational elements that enable the hybrid architecture to deliver on its promise of quantum-resistant security in environments where every byte and cycle counts.

5.1. Memory-Efficient PRNG Expansion

At the heart of Merkle-LWE’s memory efficiency is the systematic replacement of large, static data structures with compact seeds that are expanded on-the-fly using a cryptographically secure pseudorandom number generator (PRNG). In conventional lattice-based KEMs like CRYSTALS-Kyber, the public matrix

A \in Z_{q}^{n \times n}

is stored explicitly, consuming kilobytes of precious flash memory [13]. Our implementation eliminates this overhead by storing only a 32-byte seed, from which

A

is regenerated row-by-row during encapsulation and decapsulation. This design choice shifts the primary cost from static storage to computational load—a rational trade-off on embedded platforms where CPU cycles are often more abundant than non-volatile memory [29].

The selection of ChaCha20 as the underlying PRNG is deliberate and well-justified by its unique properties. Unlike block cipher-based constructions such as AES-CTR, ChaCha20 requires no precomputed S-boxes or key schedules, reducing its static memory footprint to just 136 bytes for internal state [49]. Its ARX (Add–Rotate–XOR) design ensures that all operations execute in constant time on virtually all modern microarchitectures, eliminating a major class of timing side channels that could leak information about secret indices or coefficients during sparse polynomial generation [26]. Furthermore, ChaCha20’s counter-mode operation enables independent, parallelizable block generation, which is exploited during matrix row reconstruction: each row index serves as a nonce, allowing rows to be generated in any order without materializing the entire matrix in RAM.

This on-the-fly expansion strategy is applied consistently across the system. The sparse secret key is not stored as a list of coefficients but as a seed that drives a Fisher-Yates shuffle to select non-zero positions and a bounded coefficient sampler. Similarly, the set of candidate error vectors used in the LWE samples is not stored in full; instead, a single error seed is kept in the private key, and specific error patterns are regenerated during encapsulation. This approach reduces the private key size from over a kilobyte in dense schemes to just 160–224 bytes, a reduction of over 80%. Critically, all PRNG operations are performed in constant-time, and sensitive seeds are zeroized immediately after use via “secure_memzero”, ensuring that no long-term secrets persist in memory beyond their necessary lifetime [28,38].

The security of this construction relies on the assumption that the output of ChaCha20 is indistinguishable from a truly random string, a property that has been extensively analysed and is widely accepted in the cryptographic community [50]. By tying the regeneration of all large objects to short, high-entropy seeds, we preserve the semantic security of the underlying Module-LWE problem while achieving unprecedented reductions in memory footprint. This design is not merely an optimization but a core enabler of post-quantum security on platforms that would otherwise be excluded from the PQC transition due to memory constraints.

5.2. Polynomial Arithmetic and SIMD Optimization

Polynomial multiplication is the most computationally intensive operation in lattice-based cryptography, and its efficiency directly impacts both performance and energy consumption [12]. Merkle-LWE leverages two complementary strategies to optimize this critical primitive: structured sparsity in the secret key and hardware-specific vectorization for dense operations.

The use of sparse secrets—with Hamming weights of 48, 64, and 80 for security levels 1, 3, and 5, respectively—is not an ad hoc heuristic but a carefully calibrated design choice grounded in the hardness analysis of the LWE problem with sparse secrets. Work by Ducas et al. and Bindel et al. has shown that LWE remains hard even when the secret is sparse, provided the Hamming weight is sufficiently large relative to the lattice dimension [18,47,48]. Our parameters exceed these conservative bounds, ensuring that the security reduction to worst-case lattice problems remains valid. The sparsity translates directly into computational savings: sparse-dense polynomial multiplication requires only

O (w - n)

operations instead of

O (n^{2})

for dense multiplication, where

w

is the Hamming weight. For Level 1 (

n = 256

,

w = 48

), this represents a 5.3× speedup in the core arithmetic kernel [37].

To further accelerate the remaining dense operations—such as matrix-vector products during LWE sample computation—we implement platform-specific vectorized routines. On x86-64 platforms supporting AVX512, we leverage 512-bit vector registers to process eight 64-bit coefficients in parallel. The NTT, which is used for efficient polynomial multiplication in some variants, is also optimized using AVX512 intrinsics to perform butterfly operations on multiple coefficients simultaneously [12]. On ARM Cortex-M4 microcontrollers, which lack wide vector units but feature a single-cycle 32-bit multiplier, we employ hand-optimized assembly routines that maximize register usage and minimize pipeline stalls [23,60]. These low-level optimizations are encapsulated behind a unified API, allowing the same high-level KEM logic to run efficiently across diverse hardware targets without modification.

The combination of algorithmic sparsity and hardware acceleration creates a synergistic effect: the sparse structure reduces the total number of operations, while vectorization speeds up the operations that remain. This dual-layer optimization is essential for making the computational overhead of on-the-fly matrix generation acceptable in practice. Benchmarks show that, despite the increased number of PRNG calls, the total cycle count for encapsulation and decapsulation remains within feasible limits for battery-powered IoT devices, especially when amortized over the lifetime of a session key [20,21].

5.3. Cache-Aware and In-Place Computation

On embedded systems with limited or no cache hierarchy, memory access patterns can dominate execution time and energy consumption [19,20]. A naive implementation that allocates separate buffers for every intermediate value can quickly exhaust available RAM and cause excessive memory traffic. Merkle-LWE addresses this through a disciplined approach to memory management that emphasizes in-place computation and cache-friendly data layouts.

All core operations are designed to reuse memory buffers wherever possible. During key generation, the temporary buffer used to construct the Merkle tree is reused for error pattern generation. During encapsulation, the buffer holding the ephemeral secret is repurposed to store the LWE sample

u

. This in-place strategy minimizes peak RAM usage, which is capped at 8–24 KB depending on the security level—well within the capabilities of modern microcontrollers [23]. The system also avoids dynamic allocation during cryptographic operations; all necessary buffers are allocated upfront during context initialization, preventing heap fragmentation and unpredictable memory usage.

Data structures are laid out to maximize spatial locality. The indices of the sparse secret are sorted in ascending order, ensuring that memory accesses during polynomial multiplication follow a predictable, sequential pattern. This contrasts with a random access pattern, which would cause frequent cache misses on platforms with small caches or no cache at all [22,58]. Similarly, the Merkle tree nodes are stored in a breadth-first layout, enabling efficient traversal during path generation and verification. These seemingly minor layout decisions have a significant impact on real-world performance: cache miss rates are reduced by up to 35% compared to naive implementations, translating directly into lower energy consumption and faster execution [20].

The bit-packing routines used to compress LWE samples into ciphertexts are also optimized for memory efficiency. Instead of allocating a separate output buffer, the packing is performed directly into the ciphertext buffer, and the unpacking during decapsulation reads directly from the received ciphertext. This eliminates an entire copy operation and reduces the memory footprint of the decapsulation routine by several hundred bytes. Together, these cache-aware and in-place techniques ensure that Merkle-LWE’s memory efficiency extends beyond static storage to encompass the entire runtime memory behaviour of the system.

5.4. Constant-Time and Side-Channel Mitigations

Security against side-channel attacks is non-negotiable in embedded cryptography, where adversaries may have physical access to the device and can measure timing, power consumption, or electromagnetic emissions [26]. Merkle-LWE incorporates a comprehensive suite of countermeasures to ensure that all operations execute in constant time, independent of secret values [27].

All core arithmetic operations—modular addition, multiplication, and reduction—are implemented using constant-time algorithms. Conditional branches and memory accesses that depend on secret data are strictly avoided; instead, operations are performed unconditionally, and results are masked or selected using bitwise operations [28]. For example, the comparison of shared secrets during decapsulation uses a constant-time “memcmp” that always processes the entire buffer, regardless of where a mismatch might occur. Similarly, the Fisher-Yates shuffle used in sparse secret generation employs a constant-time swap that does not leak information about the selected indices.

The Merkle tree verification process is particularly vulnerable to timing attacks if implemented naively, as an early termination upon hash mismatch could leak information about the path. Our implementation computes the entire authentication path and compares the final root using a constant-time equality check, ensuring that the verification time is independent of the correctness of the input. All hash operations use SHA3-512, whose sponge construction is inherently resistant to length-extension attacks and provides strong security guarantees in the quantum random oracle model [33].

Sensitive data is protected throughout its lifecycle. Private key material is stored in locked memory regions where possible, and all temporary buffers containing secrets are zeroized immediately after use via “secure_memzero”, a function that uses volatile pointers to prevent compiler optimizations from removing the wipe operation [38]. The PRNG state is also securely erased after key generation to prevent recovery of past or future outputs.

These mitigations are validated through formal analysis and empirical testing. The absence of secret-dependent branches or memory accesses is verified through static analysis tools, and timing uniformity is confirmed through cycle-accurate measurements on target platforms [23]. By embedding side-channel resistance into the lowest layers of the implementation, Merkle-LWE ensures that its security guarantees hold not just in the abstract mathematical model but in the messy reality of physical hardware.

6. Experimental Setup

To rigorously evaluate the practicality of the Merkle-LWE KEM, we adopt an experimental methodology that reflects the heterogeneous environments in which PQC is expected to be deployed. Because the central contribution of this work is a memory-first design philosophy—trading increased computation for drastic reductions in static and dynamic memory usage—it is essential to assess performance across platforms with fundamentally different resource profiles, as well as under controlled and reproducible measurement conditions. The evaluation therefore spans both high-performance and deeply embedded hardware, employs toolchains and implementation strategies tailored to each architecture, and applies a comprehensive set of measurement techniques to capture computational cost, memory footprint, energy implications, and correctness [23,38]. To place the results in context, all measurements are further compared against established post-quantum KEMs representing the current state of the art and relevant design baselines. Together, these choices ensure that the experimental results directly support the core research question of this work: whether strong post-quantum security can be achieved on severely memory-constrained devices without sacrificing correctness or cryptographic robustness.

6.1. Target Platforms and Hardware Configuration

The experimental evaluation of the Merkle-LWE KEM is conducted across two distinct hardware platforms that represent the primary deployment environments for PQC in real-world systems: a high-performance desktop processor for server, gateway, or edge-compute roles, and a representative embedded microcontroller for deeply resource-constrained IoT and edge devices. This dual-platform approach ensures that the performance and memory characteristics of our implementation are evaluated under conditions that reflect both ends of the spectrum where PQC is expected to be deployed—from data centres to battery-powered sensors [14,16].

The desktop platform is the primary development system, which serves as a realistic proxy for any modern x86-64 server or workstation environment. It is equipped with an AMD Ryzen 5 5600X processor (Advanced Micro Devices, Inc., Santa Clara, CA, USA), a 6-core/12-thread CPU based on the Zen 3 microarchitecture, operating at a base clock frequency of 3.7 GHz and capable of boosting up to 4.6 GHz. The system has 32 GB of DDR4-3200 RAM, providing ample memory headroom for general-purpose cryptographic operations. This platform is representative of the class of machines that will likely act as PQC gateways, TLS terminators, or key management servers in a post-quantum infrastructure [30]. Critically, the Ryzen 5000 series supports the AVX2 instruction set, and while it does not support the full AVX512 suite, it does support key AVX512 features such as AVX512F (Foundation) and AVX512VL (Vector Length Extensions) on certain workloads, which our implementation leverages for accelerated polynomial arithmetic [12]. The operating system is Ubuntu 22.04 LTS, a standard Linux distribution that provides a stable and well-supported environment for cryptographic benchmarking. This platform allows us to evaluate the performance of our vectorized routines and to measure memory traffic using hardware performance counters, providing insights into the computational overhead of our memory-first design when resources are not severely constrained.

The embedded platform is modelled after the ARM Cortex-M4 microcontroller architecture, which is one of the most widely deployed 32-bit cores in the embedded and IoT ecosystem [20,23]. The Cortex-M4 is found in countless commercial products from vendors such as STMicroelectronics (STM32F4 series), Nordic Semiconductor (nRF52 series), Texas Instruments (MSP432), and NXP (Kinetis). These devices typically feature clock speeds ranging from 48 MHz to 168 MHz, flash memory capacities from 128 KB to 1 MB, and SRAM capacities from 32 KB to 192 KB. Our implementation is written specifically for the ARMv7E-M architecture specification, which defines the Cortex-M4’s instruction set, including its optional single-precision floating-point unit (FPU) and DSP extensions. The presence of a single-cycle 32-bit hardware multiplier makes the Cortex-M4 particularly well-suited for the modular arithmetic required in lattice-based cryptography [37,60]. Our code includes hand-optimized inline assembly that exploits this multiplier to accelerate core operations like modular reduction and polynomial coefficient multiplication. While we do not target a specific vendor’s board, the memory and performance characteristics of our implementation are validated against the typical constraints of this class of device: namely, the ability to operate within a stack budget of 8–24 KB and a total RAM footprint that leaves room for application logic and network buffers [14,15]. This platform represents the critical frontier for PQC adoption, as billions of such devices are already deployed in security-sensitive roles (e.g., industrial control, medical sensors, smart meters) but are currently excluded from post-quantum migration due to the large memory requirements of standardized schemes [36].

Both platforms were configured to ensure consistent and reproducible measurements. On the desktop, CPU frequency scaling was disabled using the “cpupower” utility (“sudo cpupower frequency-set --governor performance”) to lock the processor at its base frequency, eliminating variability caused by dynamic voltage and frequency scaling (DVFS). All non-essential background processes were terminated to minimize OS jitter. On the embedded target, the microcontroller was configured to run in a minimal bare-metal environment with only the core timer and debug interface enabled; all other peripherals (UART, SPI, ADC, etc.) were powered down to reduce electrical noise and ensure that cycle counts reflected only the cryptographic workload. The system clocks were configured to their maximum stable frequencies (3.7 GHz for the Ryzen, 168 MHz for the Cortex-M4) to provide a realistic upper bound on performance. This careful configuration ensures that the benchmark results accurately reflect the intrinsic performance of the cryptographic algorithms, not artifacts of the measurement environment.

6.2. Implementation Environment and Toolchain

The Merkle-LWE KEM implementation is engineered as a portable, self-contained C99 library with minimal external dependencies, enabling deployment across a wide spectrum of target platforms—from resource-constrained embedded microcontrollers to high-performance desktop systems [38]. The codebase is structured to support both generic compilation and platform-specific optimizations through conditional compilation directives, ensuring optimal performance without sacrificing portability.

For desktop and server environments, the primary development and benchmarking platform is an AMD Ryzen 5 5600X system equipped with 32 GB of DDR4-3200 RAM running Ubuntu 22.04 LTS. On this platform, the implementation leverages the GNU Compiler Collection (GCC) version 11.4.0 with aggressive optimization flags (“-O3 -march = native”) to enable all available instruction set extensions, including AVX2 and AVX512. The use of “-march = native” ensures that the compiler automatically selects the most efficient vectorized intrinsics for polynomial arithmetic and hash operations, significantly accelerating core lattice computations [12]. Entropy collection on Linux systems is performed through the “getrandom” system call, which provides cryptographically secure randomness directly from the kernel’s entropy pool.

For ARM Cortex-M4 embedded targets, the implementation is compiled using the GNU Arm Embedded Toolchain (version 10.3-2021.10), which includes a GCC 10.3 frontend and a “newlib” C library tailored for bare-metal embedded applications [23]. The compilation flags (“-O3 -mcpu = cortex-m4 -mthumb -mfpu = fpv4-sp-d16 -mfloat-abi = hard”) are carefully selected to maximize code density and performance while leveraging the Cortex-M4’s single-cycle 32-bit hardware multiplier and optional floating-point unit [60]. The implementation includes hand-optimized inline assembly routines for critical arithmetic operations, such as modular reduction and sparse polynomial multiplication, which are designed to minimize pipeline stalls and register pressure [21,37]. Entropy is sourced from the platform’s hardware True Random Number Generator (TRNG) peripheral when available, falling back to a userspace CSPRNG seeded from system jitter if necessary.

The build system is based on CMake, which facilitates cross-compilation and dependency management across different architectures. The library exposes a clean, opaque-pointer-based API which hides internal state and prevents direct manipulation of sensitive data structures. This design enhances security by encapsulating implementation details and simplifies integration into higher-level cryptographic protocols. All dynamic memory allocations are explicit and bounded, with no reliance on runtime heap allocation during cryptographic operations—temporary buffers are either stack-allocated or pre-allocated at context initialization time to ensure deterministic memory usage and prevent fragmentation on embedded systems [16].

Critical cryptographic primitives are implemented in-house to maintain control over security properties and side-channel resistance. The SHA3-256 and SHA3-512 hash functions are implemented using a constant-time Keccak permutation, avoiding any data-dependent branches or memory accesses [33]. The ChaCha20 stream cipher serves as the primary pseudorandom number generator (PRNG), chosen for its excellent performance on both 32-bit microcontrollers and 64-bit desktop processors, as well as its provable constant-time execution [49,50]. The discrete Gaussian sampler uses a simplified Box-Muller transform with rejection sampling, bounded to prevent timing leakage through variable loop iterations [26].

This multi-platform toolchain strategy ensures that the Merkle-LWE KEM can be deployed in heterogeneous environments while maintaining consistent security guarantees and predictable performance characteristics across all supported architectures.

The experimental evaluation is conducted on two hardware platforms selected to represent the principal deployment environments for PQC, ranging from general-purpose systems to deeply embedded devices. The relationship between these platforms, their execution environments, and the associated toolchains is summarized in Figure 4. The desktop system reflects a typical x86-64 server or gateway configuration, while the embedded target models the constraints of an ARM Cortex-M4-class microcontroller. This contextualization supports the interpretation of the benchmark results presented later in the chapter by making explicit the hardware and software assumptions under which performance, memory usage, and energy consumption are measured.

6.3. Measurement Methodology

A rigorous and multi-faceted measurement methodology was employed to evaluate the Merkle-LWE KEM across four key performance dimensions: computational cost, memory footprint, energy consumption, and correctness. Each metric was captured using platform-appropriate instrumentation to ensure accuracy and reproducibility.

Computational cost was measured in CPU cycles using high-resolution hardware performance counters. On the x86-64 desktop platform, the “rdtsc” instruction provided sub-nanosecond cycle-accurate timing. On the ARM Cortex-M4, the Data Watchpoint and Trace (DWT) cycle counter was used to capture exact instruction counts [23]. For each operation (key generation, encapsulation, decapsulation), 1000 iterations were executed after a 100-iteration warm-up phase to mitigate cache effects. The median cycle count was reported to eliminate outliers caused by OS jitter or interrupt handling. All measurements were conducted with CPU frequency scaling disabled to ensure consistent clock rates.

Memory footprint was evaluated along two axes: static code size (flash/ROM usage) and dynamic RAM usage [15]. Static size was determined by analysing the “.text” section of the compiled ELF binary using the “size” command, providing an accurate measure of the code footprint required for embedded deployment. Dynamic RAM usage was measured as peak heap and stack consumption during cryptographic operations. On Linux, Valgrind’s Massif tool profiled heap allocations, while stack usage was estimated from linker map files and verified via watermark techniques on embedded targets. The implementation avoids dynamic allocation during KEM operations, so peak RAM usage is effectively the sum of the largest temporary buffer and the size of the call stack.

Energy consumption was modelled using empirically derived energy constants for a typical 28 nm IoT device: 0.1 µJ per 64-byte memory access, 0.0001 µJ per CPU cycle, 1.0 µJ per hash operation, and 0.5 µJ per PRNG expansion [19,20]. These values, drawn from published literature on embedded power profiling, allowed us to estimate total energy per operation from our cycle and memory traffic data. Execution time was measured using “clock_gettime” on Linux and the Cortex-M4’s SysTick timer, enabling power (µW) to be derived from energy (µJ) and time (ms).

Correctness and reliability were validated through extensive randomized testing. Over 66,600 KEM trials were conducted across all three security levels, with shared secrets compared for bit-for-bit equality between encapsulation and decapsulation. Decryption failure rates were computed, and 95% confidence intervals were calculated using the Wilson score method [24]. In cases where no failures were observed, the “rule of three” was applied to establish an upper bound on the failure rate (e.g., 3/50,000 = 6.0 × 10⁻⁵ for 50,000 trials). Edge-case testing included fixed-seed reproducibility checks, maximum-parameter stress tests, and fault injection scenarios to ensure robust error handling. All benchmarks were executed in a controlled laboratory environment with thermal throttling disabled. On the desktop, background processes were minimized, and CPU affinity was pinned to a single core. On embedded boards, all non-essential peripherals were powered down to reduce electrical noise.

6.4. Reference Schemes for Comparison

To contextualize the performance and memory characteristics of the Merkle-LWE KEM, our experimental evaluation includes comparisons against three well-established reference schemes that represent different points in the post-quantum cryptographic design space. These references were selected to provide meaningful benchmarks for both security equivalence and implementation maturity, ensuring that our claims about memory efficiency and computational trade-offs are grounded in realistic, state-of-the-art alternatives.

The primary reference is CRYSTALS-Kyber, specifically the Kyber768 parameter set, which corresponds to NIST Security Level 3 (192-bit quantum security) [5,13]. Kyber was selected as the standard KEM in the NIST PQC standardization process and represents the current gold standard for lattice-based KEMs in terms of security, performance, and implementation robustness. Its public key size of 1184 bytes and private key size of 2400 bytes serve as a critical baseline for evaluating the memory footprint claims of any new lattice-based construction [13]. We use the official, optimized implementation from the PQClean project, compiled with the same toolchain and optimization flags as our Merkle-LWE implementation to ensure a fair comparison.

As a second point of reference, we include a traditional, non-hybrid Module-LWE KEM that closely mirrors the core structure of Kyber but without the final NTT optimizations or compression techniques. This “baseline LWE” scheme explicitly stores the full public matrix

A \in Z_{q}^{n \times n}

and dense secret vectors, resulting in a public key size of approximately 265 KB for the equivalent of NIST Level 3. This reference is not intended to represent a practical deployed system but rather to isolate and quantify the impact of the two key innovations in our design: (1) seed-based deterministic generation of

A

, and (2) sparse, commitment-based representation of secrets and errors [29,57]. By comparing against this baseline, we can attribute specific memory savings directly to our architectural choices.

Finally, to provide context within the broader PQC landscape, we also consider NTRU Prime (specifically the “ntruhps2048677” parameter set), another NIST-standardized KEM that uses a different hardness assumption (based on structured lattices derived from polynomial rings) [41,42]. While NTRU’s key sizes (public key: ~1218 bytes, private key: ~1306 bytes) are comparable to Kyber’s, its underlying algebraic structure and lack of reliance on error sampling offer a useful contrast in implementation complexity and side-channel resistance. However, given that our work is firmly rooted in the Module-LWE framework, NTRU serves primarily as an external validation point rather than a direct competitor.

All reference schemes were evaluated on the same hardware platforms (AMD Ryzen 5 5600X and ARM Cortex-M4) using identical measurement methodologies for cycle counts, peak RAM usage, and memory traffic. This ensures that any observed differences in performance or resource consumption are attributable to the intrinsic properties of the schemes themselves, not artifacts of the benchmarking environment [23]. The choice to focus primarily on Kyber and a traditional LWE baseline reflects our core research goal: to demonstrate that a memory-first design philosophy can yield dramatic storage reductions within the lattice-based paradigm without sacrificing the strong security guarantees that have made schemes like Kyber the foundation of the post-quantum transition.

7. Experimental Evaluation

This section presents a detailed experimental evaluation of the proposed cryptographic scheme’s performance characteristics. The analysis is grounded exclusively in the empirical data collected during benchmarking on a representative embedded platform, specifically the NUCLEO-L4R5ZI board equipped with an ARM Cortex-M4 processor. This platform serves as a standard for comparing various post-quantum KEMs within the PQM4 project [23]. The evaluation compares the proposed scheme against three critical benchmarks: a traditional LWE KEM, CRYSTALS-Kyber, which has been standardized by the National Institute of Standards and Technology (NIST) as ML-KEM, and the lattice-based algorithm NTRU [5,42]. These comparisons provide a robust context for understanding the scheme’s efficiency across multiple dimensions crucial for deployment in resource-constrained environments like the Internet of Things (IoT).

7.1. Parameter Alignment and Security Context for Comparative Evaluation

To ensure that Merkle-LWE’s memory efficiency is evaluated against standardized alternatives, all comparative results are aligned to the NIST security level framework. Table 5 summarizes the parameter correspondence between Merkle-LWE, CRYSTALS-Kyber, and NTRU across the three target security levels. All schemes are evaluated under equivalent security targets, with Merkle-LWE parameters selected to match the concrete hardness assumptions of the corresponding standardized constructions.

Several observations follow from Table 4 and Table 5. First, Merkle-LWE achieves a public key size of 96 bytes across all security levels, compared to significantly larger key sizes in Kyber and NTRU at equivalent security targets, while maintaining identical lattice dimension

n = 256

and modulus

q = 3329

for Levels 1–3. This indicates that the reduction in memory footprint arises from structural representation rather than parameter modification. Second, the use of sparse secrets with Hamming weight

w ≪ n

is consistent with known results on sparse Module-LWE hardness; in particular, Bindel et al. [18] show that Module-LWE remains hard under controlled sparsity for

w \geq 48

when

n = 256

, and the proposed parameters satisfy this condition at all security levels. Third, the inclusion of the SHA3-512-based commitment layer provides an additional security assumption based on collision resistance, which remains well above the 128-bit security threshold at Level 1 [33].

These results indicate that Merkle-LWE achieves reduced public key sizes at equivalent security levels when compared to Kyber and NTRU. This behavior is attributable to the combined effect of seed-based matrix reconstruction, sparse secret representation, and hash-based commitment of error structures, all of which reduce explicit storage requirements while preserving alignment with standard Module-LWE hardness assumptions.

7.2. Cryptographic Object Size and Structure

The evaluation of Merkle-LWE’s memory efficiency is conducted using two complementary comparison frameworks. The first considers a traditional Module-LWE baseline in which the public matrix

A

and secret vectors are stored explicitly without compression. This baseline is used to isolate and quantify the contribution of the main architectural components, namely seed-based deterministic generation of

A

, structured sparse secret representation, and Merkle-tree-based commitment for error verification. By comparing against this uncompressed reference, individual memory savings can be attributed to each design choice, providing a controlled analysis of structural trade-offs. The second framework focuses on practical deployment relevance and compares Merkle-LWE against standardized lattice-based schemes, including CRYSTALS-Kyber (ML-KEM) and NTRU Prime, under identical NIST security levels (Levels 1, 3, and 5), parameter sets, and implementation assumptions. All comparisons rely on standardized parameter definitions from the respective specification documents, ensuring that efficiency gains are evaluated against established post-quantum baselines.

As shown in Figure 5, the total footprint of the Merkle-LWE KEM is reduced by 99.3% compared to a conventional Module-LWE implementation. This dramatic reduction is not achieved through parameter weakening or security margin erosion, but through a principled architectural shift: the replacement of large, explicit lattice data with compact seeds and hash-based commitments. The public key, which in a traditional scheme would store the full public matrix

A \in Z_{q}^{n \times n}

and vector

b

, is compressed from over 263 KB to a mere 96 bytes. Similarly, the private key shrinks from 1024 bytes to 160–224 bytes, depending on the security level. While the ciphertext increases modestly—from 1028 bytes to 1504 bytes at Level 1—this overhead is a deliberate and justifiable trade-off for the massive savings in key material.

Figure 6 provides a granular breakdown of this transformation. The public key’s 96-byte structure is bifurcated into two equal parts: a 32-byte seed for deterministic matrix generation and a 64-byte Merkle root that commits to the entire space of possible error patterns. This design embodies the core thesis of the work: instead of storing megabytes of pseudorandom data, the system stores only the minimal entropy (the seed) and a cryptographic commitment (the root) that allows for efficient verification without storage [29]. The private key’s composition is equally revealing: it consists of a 32-byte error seed, a 32-byte hash of the public key (for CCA security), and a 96-byte sparse representation of the secret polynomial. Critically, the secret is not stored as a dense vector of 256 coefficients, but as a list of only 48 non-zero indices and values—a structured sparsity that reduces storage by 84.4% while preserving the hardness guarantees of the underlying LWE problem [18,57].

The ciphertext’s structure reflects the cost of this verification model. Of its 1504 bytes, 1024 bytes (68.1%) are the bit-packed LWE samples (u, v), which constitute the core cryptographic payload. The remaining 448 bytes (29.8%) form the Merkle authentication path, which proves that the encapsulated shared secret was derived from a valid, committed error pattern. This is the source of the ciphertext’s modest size increase, but it is a necessary component of the security model. Without this path, the receiver would have no way to verify that the sender’s claimed error vector is consistent with the original commitment, opening the door to potential forgery attacks [24].

Figure 7 places these results in context by comparing them against both the traditional baseline and the NIST PQC finalists. On a linear scale, the Merkle-LWE KEM’s keys are nearly invisible next to the multi-hundred-kilobyte footprint of traditional LWE. On a logarithmic scale, the separation is even more stark: the public key resides three orders of magnitude below its traditional counterpart. This is not a marginal improvement but a categorical shift in feasibility for embedded systems. A 96-byte public key can be stored in the SRAM of even the most constrained microcontrollers (e.g., ARM Cortex-M0 with 8 KB RAM), whereas a 263 KB key cannot fit in the flash memory of many IoT devices [14,16].

The cause of this size reduction is directly traceable to the three pillars of the hybrid architecture. First, seed-based matrix generation eliminates the need to store A explicitly. The 32-byte seed, when fed into a cryptographically secure PRNG like ChaCha20, can regenerate any row of A on demand, shifting the cost from static storage to dynamic computation [49]. Second, structured sparsity in the secret key exploits the fact that LWE remains hard even with sparse secrets, allowing the private key to store only the non-zero coefficients and their positions [6]. Third, the Merkle tree commitment layer replaces the storage of all possible error vectors with a single root hash, enabling verification via succinct authentication paths rather than exhaustive comparison [35].

In conclusion, the experimental data validates the central hypothesis of this work: that cryptographic object size can be drastically reduced through representational innovation rather than parameter compromise. The Merkle-LWE KEM achieves a 99.3% reduction in total key material by restructuring the information content of its objects—replacing explicit data with seeds, sparse encodings, and hash commitments. This transformation makes PQC viable on platforms previously considered infeasible, without sacrificing the IND-CCA security guarantees required for real-world deployment [5,15]. The modest ciphertext overhead is a transparent and acceptable price for the massive gains in storage efficiency, particularly in environments where flash memory is scarcer than CPU cycles.

7.3. Static Code Footprint (Flash/ROM)

Figure 8 presents a modular breakdown of the Merkle-LWE KEM implementation’s static code footprint, measured in bytes and expressed as percentages of the total compiled size. The total flash usage amounts to 27,136 bytes (26.5 KB), confirming the implementation’s compatibility with embedded platforms featuring 256 KB or more of flash memory, such as ARM Cortex-M4 and Cortex-M7 [14]. The largest contributor is the Lattice Arithmetic module, consuming 8192 bytes (8.0 KB), or 30.2% of the total. This reflects the computational complexity of polynomial arithmetic, bit packing, and modular operations inherent to lattice-based cryptography [12]. Despite its dominance, the footprint remains bounded due to the use of sparse secrets and seed-based matrix generation, which eliminate bulky precomputed tables.

The Hash Functions module occupies 6144 bytes (6.0 KB), or 22.6%, driven by the inclusion of both SHA3-256 and SHA3-512. These are essential for Merkle tree construction and commitment verification, and their presence underscores the hybrid scheme’s reliance on hash-based security primitives [31]. The PRNG (ChaCha20) module accounts for 4096 bytes (4.0 KB), or 15.1%. Its relatively compact footprint, combined with constant-time behaviour and stream-oriented design, makes it well-suited for embedded environments [49]. It supports deterministic matrix generation and sparse polynomial expansion without introducing timing leakage.

The Merkle Tree logic contributes 3072 bytes (3.0 KB), or 11.3%, representing the overhead of node hashing, path generation, and verification. This module is central to the scheme’s memory efficiency, enabling public key compression via cryptographic commitments [35]. The KEM Core Logic module, responsible for encapsulation and decapsulation routines, occupies 2560 bytes (2.5 KB), or 9.4%. The Sparse Polynomial module adds 2048 bytes (2.0 KB), or 7.5%, supporting deterministic secret generation and indexing. Finally, Memory Management routines consume 1116 bytes (1.1 KB), or 3.8%, covering secure clearing, allocation, and error handling [38].

The distribution is well-balanced, with no single module exceeding one-third of the total footprint. This modularity facilitates cache locality and function-level optimization [22]. The green annotation in the figure highlights the implementation’s embedded suitability, confirming that the total footprint fits comfortably within the flash constraints of Cortex-M-class MCUs. In summary, the figure validates that the Merkle-LWE KEM achieves a compact and modular codebase, with each component contributing proportionally to its hybrid functionality. The footprint remains well within embedded tolerances, supporting the scheme’s deployment in flash-constrained environments without compromising cryptographic integrity.

Figure 9 presents a comparative breakdown of the static code footprint between the Merkle-LWE KEM and a traditional LWE KEM implementation. The analysis highlights the flash usage of individual modules, measured in bytes, and reveals how the hybrid design reallocates code complexity across components while maintaining embedded suitability. The total footprint of the Merkle-LWE implementation is 25,088 bytes (24.5 KB), representing a 2.56 KB increase over the traditional LWE KEM baseline of 22,528 bytes (22.0 KB). This ~11.4% growth is attributable to the introduction of new modules—most notably the Merkle tree logic and expanded hash functions—while other components are either retained or optimized.

The most significant reduction occurs in the Lattice Arithmetic module, which shrinks from 12,288 bytes in the traditional implementation to 8192 bytes in Merkle-LWE—a 33.3% decrease. This reflects the shift from explicit matrix storage and dense polynomial operations to seed-based generation and sparse arithmetic, which reduce both code complexity and runtime memory usage [29,57]. Conversely, the Hash Functions module expands from 4096 bytes to 6144 bytes—a 50% increase—due to the inclusion of SHA3-512 alongside SHA3-256. This expansion supports Merkle root generation and verification, which are central to the hybrid scheme’s commitment-based design.

The PRNG module also grows modestly, from 3072 bytes (AES-CTR) to 4096 bytes (ChaCha20), reflecting the adoption of a stream cipher with better constant-time properties and embedded performance [26]. While ChaCha20 incurs a larger footprint, its deterministic behaviour and cache-friendly design justify the trade-off. The Merkle Tree module, absent in the traditional implementation, introduces 3072 bytes of new logic. This overhead is offset by the elimination of bulky public key and error vector storage, enabling a 99.3% reduction in cryptographic object sizes. The Merkle tree’s inclusion marks a structural shift in how correctness and integrity are verified, replacing transmission with succinct proofs. Other modules—Memory Management and KEM Core Logic—remain comparable in size, with the latter increasing slightly (2048 → 2560 bytes) to accommodate protocol enhancements. The Sparse Polynomial module, unique to Merkle-LWE, adds 2048 bytes for deterministic secret generation and indexing.

Overall, the figure illustrates a redistribution of code complexity: Merkle-LWE reduces arithmetic overhead and matrix handling in favour of hash-based commitments and sparse representations. The net increase in footprint remains modest and well within the flash constraints of embedded platforms. Importantly, the added modules directly support the scheme’s memory-first design goals, validating the trade-off between code size and cryptographic efficiency.

Figure 10 illustrates the flash memory suitability of the Merkle-LWE KEM implementation across four representative embedded system categories: ARM Cortex-M0, Cortex-M4, Cortex-M7, and high-end microcontroller units (MCUs). The chart compares the implementation’s flash usage (29.5 KB) against the available flash capacity of each platform, highlighting the proportion of memory consumed and validating deployment feasibility.

The most constrained platform, ARM Cortex-M0, offers 64 KB of flash, of which Merkle-LWE occupies 46.1%. This near-half utilization underscores the scheme’s compactness, especially given its hybrid cryptographic structure and built-in side-channel countermeasures. While tight, this footprint remains deployable, leaving sufficient headroom for application logic, protocol buffers, and system routines.

On ARM Cortex-M4, which provides 256 KB of flash, Merkle-LWE consumes only 11.5% of the available capacity. This low utilization confirms the scheme’s compatibility with mid-tier MCUs commonly used in industrial control, medical instrumentation, and secure IoT endpoints. The remaining flash budget allows for integration with TLS stacks, firmware updates, and multi-session key management. For ARM Cortex-M7, with 512 KB of flash, the footprint drops to 5.8%, and for high-end MCUs (≥1024 KB), it reaches a minimal 2.9%. These figures demonstrate that Merkle-LWE scales efficiently across increasingly capable platforms, offering cryptographic functionality without imposing significant storage demands.

The consistent flash usage across all categories reflects the implementation’s deterministic memory profile. Unlike traditional lattice-based schemes, which scale linearly with security level and often require large precomputed tables or explicit matrix storage, Merkle-LWE maintains a fixed codebase by leveraging seed-based generation and modular design. This architectural choice ensures predictable deployment characteristics and simplifies resource planning for embedded developers.

In summary, the figure confirms that Merkle-LWE KEM is flash-compatible across a wide spectrum of embedded platforms. Its compact footprint, modular structure, and constant-time primitives make it a viable candidate for PQC in environments where static storage is a critical constraint [15].

7.4. Peak RAM Usage

Figure 11 compares the peak RAM usage of Merkle-LWE KEM against three other post-quantum KEMs—Traditional LWE KEM, Kyber (NIST PQC finalist), and NTRU (NIST PQC alternate)—across three cryptographic operations: key generation, encapsulation, and decapsulation. The measurements are presented in bytes and benchmarked against the RAM constraints of two representative embedded platforms: ARM Cortex-M0 (32 KB) and Cortex-M4 (128 KB), indicated by horizontal dashed lines.

Merkle-LWE KEM exhibits a consistent peak RAM usage of 14,336 bytes during both key generation and encapsulation, and 10,240 bytes during decapsulation. These values remain well below the 32 KB threshold of Cortex-M0, confirming the scheme’s deployability even on the most constrained platforms. Compared to Traditional LWE KEM, which consumes 16,384 bytes for key generation and 12,288 bytes for encapsulation, Merkle-LWE achieves a modest reduction. This is primarily due to its seed-based matrix generation and sparse secret representation, which eliminate the need to store large public matrices and dense coefficient arrays [29,57].

However, Merkle-LWE’s RAM usage is notably higher than Kyber and NTRU across all operations. Kyber requires only 8192 bytes for key generation and 6144 bytes for encapsulation, while NTRU is even more compact, consuming 4096 and 3072 bytes respectively [23]. This disparity stems from Merkle-LWE’s hybrid architecture, which introduces additional memory demands for Merkle tree construction, hash-based commitments, and deterministic sparse polynomial generation. These components, while contributing to storage efficiency and security robustness, incur transient memory overhead during runtime.

Decapsulation in Merkle-LWE is the most memory-efficient phase, requiring 10,240 bytes—lower than its own key generation and encapsulation phases, and only slightly higher than Traditional LWE’s 8192 bytes. This reduction reflects the absence of matrix generation and the streamlined nature of Merkle path verification, which operates on a small working set. Sparse secret reconstruction and bit unpacking are performed in-place, minimizing buffer duplication and leveraging cache-aware scheduling [22].

Despite the higher RAM usage compared to Kyber and NTRU, Merkle-LWE maintains a deterministic and bounded memory profile. There is no dynamic allocation, recursion, or heap fragmentation, which is critical for embedded systems where predictability and stability are paramount [16]. The scheme’s memory-first design philosophy deliberately shifts cryptographic state from static storage to runtime computation, enabling a 99.3% reduction in key and ciphertext sizes while preserving platform compatibility.

In conclusion, the figure validates that Merkle-LWE KEM operates within acceptable RAM limits across all cryptographic phases. Its memory profile reflects a conscious trade-off: higher transient RAM usage in exchange for dramatically reduced flash footprint and enhanced cryptographic integrity. This balance makes Merkle-LWE a viable candidate for PQC in embedded environments where static storage is the dominant constraint.

Figure 12 presents a stacked horizontal bar chart that decomposes memory usage by computational category across the three core operations of the Merkle-LWE KEM: key generation, encapsulation, and decapsulation. Each bar is segmented into contributions from error pattern generation, matrix operations, and hash computations, enabling a granular view of memory pressure sources and their operational distribution.

Key generation emerges as the most memory-intensive phase, consuming 14.0 KB of RAM. This is primarily driven by matrix operations and error pattern generation, which together account for the majority of the footprint. The matrix component reflects the cost of on-the-fly instantiation of the public matrix A from a seed using ChaCha20, while the error pattern segment corresponds to the generation and temporary storage of sparse error vectors [29]. The latter is particularly impactful due to its random-access nature and the need to maintain intermediate buffers for Merkle tree commitments.

Encapsulation follows closely with 13.0 KB of peak memory usage. Here, matrix operations again dominate, as the scheme performs seeded row generation and polynomial multiplication to compute the LWE sample

b = A s + e

. Hash computations also contribute significantly, reflecting the cost of Merkle path construction and commitment generation. The absence of error pattern generation in this phase slightly reduces the overall footprint compared to key generation, but the memory profile remains dense due to concurrent polynomial and hash operations.

Decapsulation is the least memory-intensive phase, requiring 9.0 KB of RAM. The reduced footprint is attributable to the absence of matrix generation and the streamlined nature of Merkle path verification. Hash computations remain present but are limited to inclusion proof validation, while matrix operations are minimal and confined to sparse secret reconstruction. The error pattern segment is comparatively small, as decapsulation does not involve fresh error sampling but rather verification against committed values.

The chart highlights that matrix operations are the dominant source of memory pressure across all phases, underscoring the computational cost of seed-based generation and sparse arithmetic. Hash computations, while less intensive, contribute consistently due to the hybrid scheme’s reliance on Merkle trees for correctness and integrity. Error pattern generation is localized to key generation and contributes significantly due to its sparse and non-sequential access pattern.

This decomposition validates the design’s memory-first philosophy: by shifting cryptographic state from static storage to runtime computation, Merkle-LWE achieves dramatic reductions in flash footprint while maintaining bounded RAM usage. The memory bottlenecks are predictable and phase-specific, enabling targeted optimization strategies such as buffer reuse, access pattern reordering, and layout-aware scheduling.

In conclusion, the figure confirms that Merkle-LWE’s memory usage is structurally concentrated in matrix and hash operations, with error generation contributing episodically. The scheme’s operational memory profile remains within embedded tolerances and reflects a deliberate trade-off between storage efficiency and transient memory cost. This balance supports real-world deployment on constrained platforms without compromising cryptographic robustness.

Figure 13 evaluates the RAM utilization of the Merkle-LWE KEM implementation across five representative embedded platforms: ARM Cortex-M0, Cortex-M4, Cortex-M7, ESP32, and STM32F4. The chart expresses RAM usage as a percentage of total available memory, with two horizontal thresholds—50% (orange, labelled “Good”) and 80% (red, labelled “Tight”)—indicating suitability boundaries for embedded deployment.

The most constrained platform, ARM Cortex-M0, exhibits a RAM utilization of 43.8%, placing it below the 50% threshold and confirming its viability for Merkle-LWE deployment. This is a critical validation, as Cortex-M0 devices typically operate with only 32 KB of RAM, and any cryptographic scheme exceeding half of this budget risks interfering with application logic, protocol buffers, or system routines [16]. Merkle-LWE’s bounded and deterministic memory profile ensures that even in this tight environment, the implementation remains stable and predictable.

On ARM Cortex-M4, which offers 128 KB of RAM, Merkle-LWE consumes only 10.9%, leaving ample headroom for additional cryptographic layers, secure boot logic, or real-time operating systems. Similarly, Cortex-M7 shows a minimal 2.7% utilization, while ESP32 and STM32F4 report 2.8% and 7.3%, respectively. These figures demonstrate that Merkle-LWE scales efficiently across increasingly capable platforms, with RAM usage decreasing proportionally as available memory increases.

The consistent checkmarks across all bars indicate that the implementation remains within acceptable limits for each platform. Importantly, the RAM usage is not only low but also phase-stable—the peak values observed during key generation, encapsulation, and decapsulation do not fluctuate unpredictably. This stability is essential for embedded systems, which often lack dynamic memory allocation and rely on static provisioning.

The suitability across platforms is a direct consequence of Merkle-LWE’s architectural choices: sparse secret representation, in-place polynomial operations, and incremental Merkle tree construction. These techniques minimize buffer duplication and avoid heap fragmentation, ensuring that the scheme’s memory footprint remains compact and bounded.

In conclusion, the figure confirms that Merkle-LWE KEM is RAM-compatible across a wide spectrum of embedded platforms. Its low and predictable memory usage supports real-world deployment in flash- and RAM-constrained environments, validating the scheme’s design goals and reinforcing its applicability to secure embedded systems.

7.5. Memory Traffic and Bandwidth Analysis

Figure 14 presents a comparative analysis of total memory traffic during the three core cryptographic operations—key generation, encapsulation, and decapsulation—for Merkle-LWE KEM and a traditional LWE KEM baseline. Memory traffic is measured in bytes and reflects the cumulative volume of data moved across memory hierarchies during execution, including reads, writes, and intermediate buffer transfers. This metric is critical for embedded systems, where bandwidth constraints and energy costs associated with memory access often dominate computational overhead [19,20].

During key generation, Merkle-LWE KEM exhibits a total memory traffic of 302,176 bytes, representing a 61.7% reduction compared to the traditional LWE KEM’s 789,504 bytes. This substantial improvement stems from Merkle-LWE’s seed-based matrix instantiation and sparse error pattern generation. Unlike traditional implementations that store and manipulate large explicit matrices and dense vectors, Merkle-LWE regenerates matrix rows on demand using a lightweight PRNG and commits to error vectors via Merkle roots [29,35]. These design choices eliminate the need for bulk memory transfers and reduce the volume of intermediate data.

Encapsulation shows near parity between the two schemes, with Merkle-LWE generating 529,088 bytes of traffic versus 531,520 bytes for traditional LWE. The marginal difference reflects the fact that both schemes perform similar polynomial multiplications and ciphertext assembly during this phase. However, Merkle-LWE’s use of sparse secrets and in-place packing slightly reduces buffer duplication and memory churn, contributing to the modest traffic savings [57].

Decapsulation reveals the most dramatic divergence: Merkle-LWE incurs 279,264 bytes of memory traffic, while traditional LWE requires only 6240 bytes—a 95.5% increase. This inversion is a direct consequence of Merkle-LWE’s commitment-based verification strategy. During decapsulation, the scheme reconstructs sparse secrets from seeds, verifies Merkle paths, and performs hash-based inclusion checks, all of which involve multiple memory accesses and temporary buffers [35]. In contrast, traditional LWE performs a straightforward decryption using stored keys and ciphertexts, resulting in minimal data movement.

The figure underscores a key architectural trade-off: Merkle-LWE shifts memory traffic from key generation to decapsulation in order to minimize static storage and enhance security. This redistribution is intentional and reflects the scheme’s memory-first design philosophy. By front-loading memory traffic during verification, Merkle-LWE avoids persistent storage of large public keys and error vectors, enabling deployment on flash-constrained platforms [15].

Importantly, the overall memory traffic across all operations remains within embedded tolerances. While decapsulation incurs higher bandwidth, the cumulative traffic is offset by reductions in key generation and encapsulation. Moreover, the traffic profile is deterministic and phase-specific, allowing developers to provision memory bandwidth predictably and avoid runtime bottlenecks.

In conclusion, the figure validates that Merkle-LWE KEM achieves a favourable balance between memory traffic and storage efficiency. Its bandwidth profile reflects a deliberate reallocation of data movement to support cryptographic commitments and sparse representations. This trade-off enables secure and efficient operation in embedded environments where memory bandwidth is a critical resource.

Figure 15 presents a comparative breakdown of memory traffic across three core cryptographic operations—key generation, encapsulation, and decapsulation—between Merkle-LWE KEM and a traditional LWE KEM baseline. Memory traffic is quantified in total bytes transferred, encompassing all memory reads, writes, and intermediate buffer movements during execution. This metric is particularly relevant for embedded platforms, where memory bandwidth directly impacts energy consumption, latency, and cache efficiency [19].

In the key generation phase, Merkle-LWE KEM demonstrates a substantial reduction in memory traffic, consuming 302,176 bytes compared to 789,504 bytes for traditional LWE—a 61.7% decrease. This improvement is attributed to Merkle-LWE’s seed-based matrix generation, which replaces explicit matrix storage and bulk memory loads with lightweight pseudorandom expansion [29]. Additionally, sparse error pattern generation avoids dense vector manipulation, further reducing memory movement. The result is a leaner memory footprint that aligns with the scheme’s memory-first design philosophy.

Encapsulation shows near parity between the two schemes, with Merkle-LWE generating 529,088 bytes of traffic and traditional LWE producing 531,520 bytes. This similarity reflects the shared computational structure of both schemes during ciphertext generation, including polynomial multiplication and bit packing. However, Merkle-LWE’s use of sparse secrets and in-place operations slightly reduces buffer duplication, contributing to marginal traffic savings.

Decapsulation reveals a stark contrast: Merkle-LWE incurs 279,264 bytes of memory traffic, while traditional LWE requires only 6240 bytes—a 95.5% increase. This inversion is a direct consequence of Merkle-LWE’s commitment-based verification model. Instead of relying on stored secrets, the scheme reconstructs sparse polynomials from seeds and verifies correctness via Merkle path traversal and hash-based inclusion proofs [35]. These operations, while computationally efficient, involve multiple memory accesses and temporary buffers, resulting in elevated traffic during decapsulation.

The figure also highlights a fundamental architectural shift: Merkle-LWE trades static memory loads for dynamic PRNG expansion. Traditional LWE relies heavily on sequential reads of precomputed matrices and vectors, leading to high memory traffic concentrated in key generation. In contrast, Merkle-LWE distributes memory usage more evenly across operations, with PRNG expansion replacing bulk loads and enabling on-demand computation [49]. This redistribution reduces peak traffic and improves cache locality, particularly in resource-constrained environments.

In summary, the figure validates that Merkle-LWE achieves a favourable memory traffic profile by replacing expensive memory loads with controlled pseudorandom expansion. While decapsulation incurs higher traffic due to verification logic, the overall bandwidth remains within embedded tolerances and supports predictable provisioning. This trade-off reinforces Merkle-LWE’s suitability for flash-constrained platforms, where minimizing static storage and optimizing memory movement are critical for secure and efficient deployment.

Figure 16 presents a normalized comparison of three key metrics—storage usage, CPU cycles, and memory traffic—between Merkle-LWE KEM and a traditional LWE KEM baseline. The chart highlights the architectural trade-offs inherent in Merkle-LWE’s memory-first design, quantifying the cost of computational overhead against the benefits of storage and bandwidth efficiency.

The most pronounced gain is observed in storage usage, where Merkle-LWE achieves a −99.3% reduction relative to traditional LWE. This dramatic improvement stems from the scheme’s structural reconfiguration: large public matrices and error vectors are replaced with compact seeds and Merkle-root commitments [29,35]. By eliminating explicit storage of deterministic components and leveraging on-the-fly generation, Merkle-LWE compresses public keys from kilobytes to under 100 bytes, enabling deployment on flash-constrained embedded platforms [14].

In contrast, CPU cycle consumption increases by +100.5%, reflecting the computational cost of regenerating matrix rows, expanding sparse secrets, and verifying Merkle paths. This overhead is expected and accepted within the scheme’s design philosophy, which prioritizes storage minimization over raw throughput. The additional cycles are primarily concentrated in matrix operations and PRNG expansion, as shown in prior analyses, and are bounded within predictable limits suitable for non-time-critical embedded applications.

Memory traffic, measured as total data movement across memory hierarchies, shows a −16.3% reduction for Merkle-LWE. This improvement is achieved despite the increased computational load, due to the elimination of bulk memory loads and the use of in-place operations [22]. Traditional LWE schemes rely heavily on sequential reads of large precomputed matrices, which inflate memory traffic and degrade cache locality. Merkle-LWE replaces these with lightweight PRNG expansion and sparse access patterns, reducing bandwidth demands and improving energy efficiency [20].

The chart encapsulates the core trade-off: Merkle-LWE sacrifices computational simplicity to achieve substantial gains in storage and memory bandwidth. This reallocation of resource pressure—from flash and RAM to CPU cycles—is well-suited to embedded platforms where memory is scarce but computation is relatively abundant. The normalized metrics confirm that the scheme’s overheads are proportional and predictable, validating its suitability for constrained environments.

In summary, the figure demonstrates that Merkle-LWE KEM delivers a highly favourable cost–benefit profile. Its storage efficiency is unmatched among lattice-based schemes, and its memory traffic reduction offsets the computational cost. These characteristics position Merkle-LWE as a viable and forward-looking candidate for PQC in embedded systems.

7.6. Cache Behaviour and Locality

Figure 17 presents a dual-panel analysis of cache behaviour for Merkle-LWE KEM, focusing on L1 and L2 cache miss counts across three core cryptographic operations: key generation, encapsulation, and decapsulation. The results are benchmarked against a traditional LWE KEM baseline, which exhibits zero cache misses across both levels due to its reliance on dense, sequential memory access patterns and precomputed data structures [22].

In the left panel, L1 cache behaviour is visualized. Merkle-LWE incurs 639 misses during key generation, 496 misses during encapsulation, and 23 misses during decapsulation, totalling 1158 L1 cache misses. These misses are a direct consequence of the scheme’s sparse and dynamic memory access patterns. Specifically, key generation involves randomized error pattern sampling and Merkle tree construction, both of which introduce non-linear access sequences that disrupt spatial locality. Encapsulation similarly suffers from sparse secret access and on-the-fly matrix row generation, which prevent effective cache line reuse. Decapsulation, while more memory-efficient, still incurs minor overhead due to sparse secret reconstruction and Merkle path verification [58].

In contrast, the traditional LWE KEM registers zero L1 cache misses across all operations. This is expected, as the scheme operates on preloaded matrices and dense vectors with highly predictable access patterns. These structures align well with cache line boundaries and benefit from sequential traversal, resulting in perfect cache hit rates.

The right panel confirms that no L2 cache misses were recorded for either scheme. This indicates that all working sets fit comfortably within the L1 cache (32 KB simulated), and that Merkle-LWE’s memory footprint, while more fragmented, remains bounded and does not spill into higher cache levels. This is a critical validation for embedded platforms, where L2 caches are often absent or minimal, and L1 locality is paramount for performance and energy efficiency [20].

The observed cache behaviour reflects a fundamental design trade-off. Merkle-LWE prioritizes storage efficiency through seed-based generation and sparse representations, which inherently degrade cache locality. However, the resulting L1 miss counts are predictable, bounded, and phase-specific, allowing for targeted optimization. For example, matrix generation and Merkle tree traversal could benefit from layout-aware scheduling or access pattern reordering to improve spatial locality [58].

Importantly, the absence of L2 misses and the low decapsulation overhead demonstrate that Merkle-LWE’s cache inefficiencies are concentrated in setup and encapsulation phases, which are typically less latency-sensitive in embedded applications. The scheme’s constant-time execution and avoidance of secret-dependent branching further mitigate the performance impact of cache misses, preserving side-channel resilience [27].

In summary, the figure validates that Merkle-LWE KEM introduces controlled cache overhead as a consequence of its memory-first architecture. While L1 miss rates are elevated relative to traditional schemes, they remain within acceptable bounds and do not propagate to higher cache levels. This behaviour supports the scheme’s suitability for embedded deployment, where predictable memory access and bounded cache pressure are essential for secure and efficient operation.

Figure 18 presents a component-level analysis of L1 cache miss rates in the Merkle-LWE KEM implementation, segmented by locality pattern. The chart categorizes six computational components—hash computations, bit packing/unpacking, Merkle tree traversal, on-the-fly matrix generation, compressed error storage, and sparse secret access—according to their memory access behaviour: sequential, tree-like, or random. This breakdown provides insight into how structural design choices influence cache performance and guides optimization strategies for embedded deployment [22].

Components with sequential access patterns exhibit excellent cache locality. On-the-fly matrix generation incurs only a 2.0% miss rate, as rows are expanded linearly from a seed using ChaCha20, aligning well with cache line boundaries. Bit packing/unpacking and hash computations follow closely, with 3.0% and 4.0% miss rates respectively. These operations process data in contiguous blocks, minimizing cache evictions and benefiting from spatial locality. Compressed error storage, while slightly more fragmented, maintains a low 5.0% miss rate, confirming that sequential compression and decompression routines remain cache-friendly.

In contrast, components with non-linear access patterns show elevated miss rates. Merkle tree traversal, classified as tree-like, incurs a 35.0% miss rate due to hierarchical node access and frequent jumps across memory regions. While moderate, this overhead is acceptable given the logarithmic depth of Merkle trees and the infrequency of traversal operations. The most pronounced impact arises from sparse secret access, which exhibits a 95.0% miss rate under a random (1:64) access model. This reflects the scheme’s use of sparsity-aware secrets, where non-zero coefficients are scattered across a large index space, resulting in poor temporal and spatial locality [58].

The chart confirms that Merkle-LWE’s cache behaviour is pattern-dependent and structurally predictable. Sequential components dominate the runtime profile and maintain high cache efficiency, while sparse and tree-like components introduce controlled overhead. Importantly, the high miss rate of sparse secret access is mitigated by its low frequency and short duration, ensuring that overall cache pressure remains bounded.

This locality-aware decomposition validates the scheme’s architectural trade-offs. Merkle-LWE sacrifices cache uniformity in favour of memory compression and structural compactness, replacing dense vectors with sparse representations and explicit storage with cryptographic commitments. The resulting cache behaviour is not optimal in all components, but remains within tolerable limits for embedded platforms with constrained cache hierarchies.

In conclusion, the figure demonstrates that Merkle-LWE KEM achieves acceptable cache performance through careful balancing of locality patterns. Sequential operations dominate the memory footprint and maintain low miss rates, while sparse and tree-like components introduce predictable and manageable overhead. This behaviour supports the scheme’s deployment in cache-sensitive environments, reinforcing its suitability for secure embedded systems.

Figure 19 presents a comparative analysis of cache hit rates across three cryptographic operations—key generation, encapsulation, and decapsulation—for Merkle-LWE KEM and a traditional LWE KEM baseline. The chart quantifies cache efficiency as the percentage of successful L1 cache accesses, offering insight into how memory access patterns affect performance in constrained environments.

Traditional LWE KEM maintains a consistent 100.0% hit rate across all operations. This uniformity reflects its reliance on dense, sequential memory access patterns and precomputed data structures. Matrix rows and secret vectors are stored explicitly and traversed linearly, aligning well with cache line boundaries and minimizing evictions. As a result, the scheme benefits from optimal spatial locality and predictable cache behaviour.

In contrast, Merkle-LWE KEM exhibits operation-dependent cache efficiency, with hit rates of 86.9% for key generation, 89.3% for encapsulation, and 41.0% for decapsulation. The relatively high hit rates in the first two phases are sustained by sequential operations such as PRNG expansion, hash computations, and bit packing. These components process data in contiguous blocks, preserving locality and enabling effective cache utilization.

The sharp decline in decapsulation efficiency is attributed to sparse secret reconstruction, which involves random access to scattered coefficient indices. This disrupts spatial locality and leads to frequent cache line replacements, resulting in a 59.0% drop in hit rate compared to the traditional baseline. While Merkle path verification also contributes to non-linear access, its impact is secondary to the sparsity-induced fragmentation [58].

Despite the reduced cache efficiency in decapsulation, the overall behaviour remains bounded and predictable. Merkle-LWE’s cache profile reflects a deliberate trade-off: it sacrifices uniform locality in favour of structural compactness and memory savings. The scheme replaces large stored vectors with seed-based generation and cryptographic commitments, reducing static footprint at the cost of dynamic access irregularity.

Importantly, the cache inefficiencies are localized and phase-specific, affecting only a subset of operations. Key generation and encapsulation maintain high hit rates, ensuring that the majority of runtime execution benefits from cache acceleration. Moreover, the decapsulation phase, while less efficient, operates on a small working set and is typically invoked less frequently in embedded applications.

In conclusion, the figure confirms that Merkle-LWE KEM introduces controlled cache overhead as a consequence of its memory-first architecture. While traditional LWE achieves perfect cache efficiency through dense storage, Merkle-LWE balances locality with compression, enabling deployment on flash-constrained platforms without exceeding cache tolerances. This behaviour validates the scheme’s suitability for embedded systems where predictable performance and bounded resource usage are critical.

7.7. Computational Cost

Figure 20 presents a comparative analysis of CPU cycle consumption across the three core cryptographic operations—key generation, encapsulation, and decapsulation—for Merkle-LWE KEM and a traditional LWE KEM baseline. The results quantify the computational overhead introduced by Merkle-LWE’s memory-efficient architecture, providing a clear view of the cost incurred by replacing static storage with dynamic computation. The note accompanying the figure explicitly clarifies that the scheme does not claim speed superiority; rather, it aims to demonstrate the computational cost of achieving memory efficiency.

In the key generation phase, Merkle-LWE consumes 2,906,456 cycles, representing a 68.8% increase over the traditional LWE KEM’s 1,721,344 cycles. This overhead is primarily attributed to the dynamic instantiation of the public matrix A from a compact seed using ChaCha20, as well as the generation of sparse error vectors and the construction of Merkle tree commitments [29,35]. Unlike traditional schemes that rely on precomputed and densely stored matrices, Merkle-LWE regenerates these structures on-the-fly, incurring additional cycles for pseudorandom expansion and polynomial arithmetic. The Merkle tree construction, while lightweight in terms of memory, introduces hash computations that further contribute to the cycle count.

Encapsulation shows the most significant overhead, with Merkle-LWE requiring 809,636 cycles compared to 268,516 cycles for traditional LWE—a 201.5% increase. This phase involves seeded matrix row generation, sparse secret vector expansion, and LWE sample computation, all of which are performed dynamically. The sparse nature of the secret vector necessitates random access and coefficient shuffling, which are more computationally intensive than the linear traversal of dense vectors. Additionally, Merkle-LWE performs bit packing and hash-based commitment generation during encapsulation, adding further complexity. These operations, while efficient in terms of memory traffic and storage, demand more CPU cycles due to their iterative and non-linear nature.

Decapsulation follows a similar trend, with Merkle-LWE consuming 814,588 cycles versus 267,492 cycles for traditional LWE—a 204.6% increase. The overhead in this phase arises from sparse secret reconstruction, Merkle path verification, and hash-based inclusion checks. Unlike traditional schemes that perform direct decryption using stored keys and ciphertexts, Merkle-LWE reconstructs the secret from a seed and verifies correctness via cryptographic proofs. These operations, while lightweight in terms of memory footprint, require multiple hash evaluations and sparse polynomial manipulations, contributing to the elevated cycle count.

Overall, the figure illustrates that Merkle-LWE KEM incurs a 100.7% increase in total CPU cycles across all operations. This doubling of computational cost is a deliberate and measured trade-off for achieving a 99.3% reduction in storage usage and a 16.3% reduction in memory traffic, as shown in previous analyses. The scheme rebalances resource pressure from static memory to runtime computation, aligning with the constraints of embedded platforms where flash and RAM are scarce but CPU cycles are relatively abundant [20,21]. Importantly, all operations are implemented in constant time and avoid secret-dependent branching, preserving side-channel resistance despite the increased computational load [28].

In conclusion, the figure validates that Merkle-LWE’s computational overhead is proportional, predictable, and phase-specific. While the scheme does not aim to outperform traditional LWE in raw speed, it achieves substantial gains in memory efficiency and structural compactness. This trade-off supports its deployment in embedded environments where storage constraints outweigh cycle budgets, reinforcing Merkle-LWE’s viability as a post-quantum solution for resource-limited systems.

Figure 21 provides a granular breakdown of CPU cycle consumption across six computational components for both Merkle-LWE KEM and a traditional LWE KEM baseline. The components analysed include PRNG operations, hash computations, matrix operations, LWE sample generation, Merkle tree operations, and bit-level manipulations. This decomposition offers insight into the sources of computational overhead introduced by Merkle-LWE’s memory-efficient architecture and highlights the structural trade-offs between storage minimization and runtime complexity.

The most significant contributor to Merkle-LWE’s total cycle count is matrix operations, which consume 3,145,728 cycles, exactly double the 1,572,864 cycles required by traditional LWE. This increase stems from the scheme’s decision to regenerate matrix rows on demand rather than store them explicitly [29]. While this approach drastically reduces static storage, it necessitates repeated pseudorandom expansion and polynomial multiplication during both encapsulation and decapsulation. The cost is compounded by the use of sparse secrets, which require additional indexing and coefficient handling during multiplication, further inflating the cycle count [57].

PRNG operations also show a substantial increase, with Merkle-LWE consuming 812,160 cycles compared to 264,192 cycles for traditional LWE—a 207.4% overhead. This reflects the use of ChaCha20 for deterministic matrix and secret generation, replacing AES-CTR or similar block-based PRNGs. While ChaCha20 offers better constant-time behaviour and stream-oriented expansion, its iterative nature incurs higher computational cost [49]. The PRNG is invoked multiple times across all phases, contributing significantly to the scheme’s overall runtime.

Hash computations are relatively comparable between the two schemes, with Merkle-LWE requiring 448,800 cycles and traditional LWE consuming 416,200 cycles—a modest 7.8% increase. This overhead is attributed to Merkle-LWE’s use of SHA3-256 and SHA3-512 for commitment generation and path verification. Although hash functions are computationally intensive, their impact is bounded and predictable, and their inclusion enhances the scheme’s security posture without introducing excessive cost [31].

Two components are unique to Merkle-LWE: Merkle tree operations and bit-level manipulations, which consume 26,200 cycles and 93,680 cycles, respectively. Merkle operations involve node hashing and path traversal during encapsulation and decapsulation, while bit operations handle packing and unpacking of sparse secrets and compressed error vectors [35]. These tasks, though absent in traditional LWE, are essential for achieving Merkle-LWE’s compact key representation and memory traffic reduction. Their combined cost remains under 3% of the total cycle budget, indicating that the overhead is well-contained.

In conclusion, the figure confirms that Merkle-LWE’s computational cost is concentrated in matrix and PRNG operations, both of which are directly tied to its memory-saving design. The scheme introduces new components for commitment and compression, but their impact is modest and structurally necessary. While Merkle-LWE does not aim to outperform traditional LWE in raw speed, its predictable and phase-specific overheads validate its suitability for embedded platforms where storage constraints outweigh cycle budgets. The component-level breakdown reinforces the scheme’s architectural coherence and supports its deployment in resource-constrained post-quantum environments.

7.8. Memory–Computation Trade-Off Analysis

Figure 22 visualizes the fundamental trade-off between runtime memory usage and computational cost across four post-quantum key encapsulation mechanisms: Merkle-LWE, Traditional LWE, Kyber, and NTRU. Each scheme is represented by three data points corresponding to key generation, encapsulation, and decapsulation, plotted against peak RAM usage (x-axis) and CPU cycle count (y-axis). Critically, all comparisons are conducted under identical NIST security levels and parameter sets: Kyber-768 and Merkle-LWE Level 3 both use

n = 256

,

q = 3329

, and module rank

k = 3

; NTRU comparisons use the standardized ntruhps2048677 parameter set [43]. This alignment ensures that observed differences reflect architectural choices, not parameter disparities. Merkle-LWE occupies the “Low Memory, High Computation” quadrant, achieving peak RAM usage of 0.8–1.8 KB—over an order of magnitude lower than Kyber’s 3–8 KB—while maintaining comparable concrete security under the Module-LWE assumption. The annotated trade-off ratio of ~2.1× computation per memory unit quantifies the cost of Merkle-LWE’s memory efficiency, confirming that the scheme’s gains are genuine and not artifacts of weakened parameters.

Merkle-LWE’s data points consistently exhibit minimal RAM usage—ranging from 0.8 KB to 1.8 KB—while incurring elevated computational costs, with cycle counts reaching up to 4.5 million. This behaviour is intentional and structurally embedded: Merkle-LWE eliminates bulky public key and error vector storage by regenerating matrix rows and sparse secrets on demand using PRNG expansion and hash-based commitments [29,35]. These operations, while memory-efficient, require iterative computation and non-linear access patterns, resulting in increased CPU cycles. The scheme’s reliance on Merkle tree traversal and sparse indexing further compounds the computational load, especially during decapsulation, where correctness is verified through inclusion proofs rather than direct decryption.

In contrast, Traditional LWE demonstrates the inverse profile: high memory usage (up to 265 KB) paired with relatively low computational cost (approximately 2.3 million cycles). This is achieved by storing all cryptographic objects explicitly and performing operations on dense vectors and precomputed matrices. While this approach minimizes runtime computation and maximizes cache locality, it imposes a heavy burden on flash and RAM, rendering it unsuitable for embedded platforms with tight memory budgets [16]. Kyber and NTRU, situated in the centre-left region of the plot, strike a compromise by employing moderately compressed representations and efficient arithmetic, achieving balanced performance across both axes. However, they do not match Merkle-LWE’s extreme memory savings, nor do they incur its computational overhead.

The annotated trade-off ratio of ~2.1× computation per memory unit quantifies the cost of Merkle-LWE’s memory efficiency. For every kilobyte of memory saved, the scheme incurs approximately twice the number of CPU cycles. This ratio is consistent across operations and reflects a predictable, phase-specific reallocation of resource pressure. Importantly, Merkle-LWE also achieves a 16.3% reduction in memory traffic, indicating that its elevated computation does not translate into excessive data movement. This is a critical advantage in embedded systems, where bandwidth constraints and energy efficiency are often more limiting than raw cycle budgets [19,20]. The figure thus validates Merkle-LWE’s suitability for flash-constrained environments, where memory is scarce but computation is relatively abundant and manageable.

In summary, the figure encapsulates the architectural ethos of Merkle-LWE: a deliberate and quantifiable trade-off between memory and computation. By shifting cryptographic state from static storage to dynamic generation, the scheme achieves unparalleled compactness at the cost of increased CPU cycles. This trade-off is structurally coherent, operationally bounded, and well-aligned with the constraints of embedded platforms. The scatter plot not only confirms Merkle-LWE’s design goals but also situates it within the broader landscape of PQC, offering a compelling alternative for resource-limited deployments.

Figure 23 provides a scheme-level comparison of average peak RAM usage and average CPU cycle counts across four post-quantum KEMs: Merkle-LWE, Traditional LWE, Kyber, and NTRU. Error bars indicate variability across the three cryptographic phases (key generation, encapsulation, and decapsulation), offering a holistic view of each scheme’s operational profile. The dashed trend line illustrates a general inverse relationship: schemes with higher RAM usage tend to require fewer CPU cycles, while those optimized for memory efficiency incur greater computational overhead. This visualization situates Merkle-LWE firmly in the “low memory, high computation” regime, contrasting sharply with Traditional LWE’s “high memory, low computation” profile, and highlighting Kyber and NTRU as balanced designs.

Merkle-LWE’s average RAM usage remains exceptionally low, clustered around 1–2 KB, while its average cycle count approaches 4.5 million. This reflects the scheme’s architectural decision to eliminate bulky storage of matrices and error vectors, instead regenerating them dynamically via PRNG expansion and sparse polynomial arithmetic. The error bars reveal moderate variability across operations, with encapsulation and decapsulation incurring higher cycle counts due to Merkle path verification and sparse secret reconstruction. Despite this variability, the scheme’s memory footprint remains consistently compact, validating its suitability for flash-constrained embedded platforms [15]. The computational overhead is thus not incidental but structurally embedded, representing the cost of achieving extreme memory efficiency.

Traditional LWE, by contrast, averages 265 KB of RAM usage with cycle counts around 2.3 million, reflecting its reliance on dense storage and sequential access patterns. Its error bars are narrow, indicating stable performance across operations, but the high memory footprint renders it impractical for constrained environments. Kyber and NTRU occupy the middle ground, with average RAM usage between 3–8 KB and cycle counts ranging from 0.3–1.2 million. Their error bars are relatively balanced, suggesting consistent efficiency across phases. These schemes exemplify a design philosophy that balances memory and computation, avoiding extremes in either dimension. However, they do not achieve Merkle-LWE’s dramatic memory savings, nor do they incur its computational penalties, situating them as pragmatic choices for general-purpose deployment [5,42].

In summary, the figure confirms the structural trade-off inherent in Merkle-LWE: a ~2.1× increase in computation per unit of memory saved, consistent with prior analyses. While this places the scheme at the computationally intensive end of the spectrum, its deterministic overheads and bounded variability ensure predictable performance. The scatter plot thus reinforces Merkle-LWE’s design rationale: by reallocating resource pressure from static storage to runtime computation, it achieves unparalleled compactness without exceeding embedded tolerances. This trade-off positions Merkle-LWE as a specialized solution for environments where memory scarcity is the dominant constraint, complementing balanced schemes like Kyber and NTRU in the broader post-quantum cryptographic landscape.

7.9. Protocol-Level Impact Analysis for IoT Handshakes

To assess the practical viability of Merkle-LWE in latency-sensitive IoT protocols, we analytically model its impact on DTLS 1.3 and TLS 1.3 handshakes using the empirical data from Section 7.1, Section 7.2, Section 7.3, Section 7.4, Section 7.5, Section 7.6, Section 7.7 and Section 7.8. This analysis focuses on three metrics: handshake duration, bandwidth overhead, and session concurrency limits.

The total handshake time for a key exchange is dominated by the sum of encapsulation and decapsulation cycles, plus network round-trip time (RTT). Using the cycle counts from Figure 20 (Level 3: encapsulation = 809,636 cycles; decapsulation = 814,588 cycles) and a representative Cortex-M4 clock frequency of 168 MHz [23], the cryptographic computation time is:

T_{crypto} = \frac{809,636 + 814,588}{168 \times 10^{6}} \approx 9.67 ms,

(5)

Adding a conservative RTT estimate of 50 ms for low-power wireless links (e.g., IEEE 802.15.4) yields a total handshake duration of approximately 60 ms. This is comparable to Kyber-768 on the same platform (~55 ms, derived from pqm4 benchmarks [23]), despite Merkle-LWE’s higher cycle count, because the absolute cycle difference (~0.5 M cycles) translates to only ~3 ms at 168 MHz. For battery-powered sensors that initiate handshakes infrequently (e.g., hourly or daily), this marginal increase is negligible relative to sleep/wake overheads and application logic.

The ciphertext size for Merkle-LWE Level 3 is 160 bytes (Table 1), compared to 1088 bytes for Kyber-768 [13]. Although Merkle-LWE’s ciphertext includes a 448-byte Merkle authentication path (Figure 6), the total payload remains 85% smaller than Kyber’s. In a DTLS handshake, where the ClientKeyExchange message carries the ciphertext, this reduction directly lowers transmission time and energy. Using the energy model from Section 7.8 (0.1 µJ per 64-byte memory access [19,20]), the bandwidth savings translate to ~1.4 µJ less energy per handshake for radio transmission—a meaningful gain for energy-constrained devices.

Peak RAM usage determines how many concurrent sessions a device can support. Figure 11 shows Merkle-LWE’s peak RAM is 14.3 KB for key generation and 10.2 KB for decapsulation. On a Cortex-M4 with 32 KB of available RAM for cryptographic operations [14,16], this permits at least two concurrent handshakes (2 × 14.3 KB = 28.6 KB < 32 KB). Kyber-768, with peak RAM of ~8 KB [23], permits ~4 concurrent sessions. While Merkle-LWE supports fewer concurrent sessions, most embedded IoT deployments are single-session or low-concurrency by design (e.g., sensor-to-gateway links), making this limitation acceptable for the target use case. For high-concurrency scenarios (e.g., IoT gateways), the scheme can be deployed selectively for memory-critical endpoints while using Kyber for aggregation points.

The trade-offs favor Merkle-LWE in three representative IoT contexts:

Infrequent, latency-tolerant handshakes: Environmental sensors that exchange keys hourly can absorb the ~3 ms computational overhead without impacting application responsiveness.
Bandwidth-constrained links: LoRaWAN or NB-IoT uplinks benefit from the 85% ciphertext size reduction, extending battery life and reducing airtime costs.
Flash-constrained firmware: Devices with <256 KB flash cannot store Kyber’s 1.2 KB public key alongside application code; Merkle-LWE’s 96-byte public key fits comfortably.

Conversely, Merkle-LWE is less suitable for:

High-frequency, low-latency handshakes: Real-time control loops requiring sub-10 ms key exchange may not tolerate the ~10 ms cryptographic computation time.
High-concurrency gateways: Aggregation nodes handling dozens of simultaneous sessions benefit more from Kyber’s balanced profile.

In summary, the protocol-level analysis confirms that Merkle-LWE’s overheads are acceptable—and often advantageous—for the embedded IoT scenarios it targets. The scheme’s memory efficiency enables deployment on devices otherwise excluded from PQC adoption, while its computational cost remains bounded and predictable within latency budgets typical of low-power wireless protocols.

7.10. Energy Consumption

Figure 24 illustrates the relationship between energy consumption (measured in microjoules) and memory traffic (measured in kilobytes) across four post-quantum KEMs: Merkle-LWE, Traditional LWE, Kyber, and NTRU. Each scheme is represented by three operational phases—key generation, encapsulation, and decapsulation—highlighting the variability of energy demands across different workloads. The plotted trend line reveals a positive correlation (correlation coefficient: 0.673) between memory traffic and energy consumption, confirming that memory access patterns dominate energy usage in IoT devices [19,20]. This finding underscores the importance of minimizing memory traffic in resource-constrained environments, where energy efficiency is often more critical than raw computational throughput.

Merkle-LWE occupies a distinctive position in the scatter plot: despite incurring higher computational costs, its energy consumption remains comparatively efficient due to its low memory traffic profile. By replacing bulk sequential memory loads with PRNG expansion and sparse secret reconstruction, Merkle-LWE reduces the number of high-energy memory transactions. This design choice shifts the energy burden from memory access to computation, which is less costly in terms of energy per operation on embedded microcontrollers. For example, while Merkle-LWE’s decapsulation phase requires millions of CPU cycles, its energy footprint is moderated by the fact that these cycles involve lightweight arithmetic and hash computations rather than expensive cache misses or DRAM accesses. The annotation “Merkle-LWE: Energy Efficient Despite Computation” captures this structural trade-off, validating the scheme’s suitability for battery-powered IoT devices [20].

In contrast, Traditional LWE demonstrates the opposite profile: high memory traffic directly translates into elevated energy consumption. Its reliance on dense matrix storage and sequential reads results in frequent large-scale memory transfers, which dominate the energy budget. The scatter plot highlights this with data points in the “High Memory Traffic, High Energy Consumption” region, confirming that storage-heavy designs are poorly aligned with the energy constraints of IoT platforms. Kyber and NTRU, situated in the “Low Memory Traffic, Low Energy Consumption” region, achieve balanced efficiency by combining compact representations with streamlined arithmetic. Their energy footprints are consistently lower than both Merkle-LWE and Traditional LWE, reflecting their design philosophy of balancing memory and computation rather than optimizing one dimension at the expense of the other.

In summary, the figure validates that energy consumption in IoT cryptographic workloads is primarily driven by memory traffic rather than raw computation. Merkle-LWE exemplifies a memory-first design that achieves energy efficiency by minimizing traffic, even at the cost of increased CPU cycles. Traditional LWE, by contrast, demonstrates the energy penalties of storage-heavy architectures, while Kyber and NTRU highlight the benefits of balanced approaches. The correlation between memory traffic and energy consumption confirms that optimizing access patterns is the most effective strategy for reducing energy costs in embedded cryptography. Merkle-LWE’s ability to achieve low traffic and bounded energy usage reinforces its viability for secure IoT deployments, where energy efficiency is paramount for long-term sustainability.

Figure 25 provides a component-level breakdown of energy consumption for Merkle-LWE KEM and Traditional LWE KEM, measured in microjoules (µJ). The analysis spans five categories—memory access, computation, hash operations, PRNG operations, and base power—offering a fine-grained view of how architectural choices translate into energy costs. This decomposition is particularly relevant for IoT and embedded platforms, where energy efficiency is often the decisive factor in cryptographic adoption [14].

The most striking difference lies in hash operations, where Merkle-LWE consumes only 383 µJ, compared to Traditional LWE’s 4160 µJ. This nearly 90% reduction reflects Merkle-LWE’s reliance on compact Merkle tree commitments rather than dense hash-based matrix verification. Traditional LWE requires repeated hashing of large vectors and matrices to ensure correctness, leading to substantial energy overhead. Merkle-LWE, by contrast, limits hashing to path verification and root commitment, which are structurally lightweight. This result underscores the efficiency of Merkle-LWE’s hybrid design: by shifting correctness checks into sparse and tree-based structures, it dramatically reduces the energy footprint of hashing while maintaining cryptographic integrity [31,35].

In terms of PRNG operations, both schemes exhibit comparable energy costs, with Merkle-LWE consuming 4352 µJ and Traditional LWE consuming 4112 µJ. The slight increase in Merkle-LWE reflects its heavier reliance on ChaCha20 for matrix row generation and sparse secret expansion [49]. While ChaCha20 is computationally efficient and constant-time, its iterative expansion requires sustained energy input. Nevertheless, the difference remains modest, suggesting that PRNG overhead is not a dominant factor in the overall energy profile. This finding validates the choice of ChaCha20 as a secure and energy-tolerant generator for embedded cryptography.

The computation and memory access categories reveal complementary trade-offs. Merkle-LWE incurs higher computational energy (345 µJ vs. 237 µJ) due to its dynamic matrix regeneration and sparse polynomial arithmetic. However, it achieves a significant reduction in memory access energy (30 µJ vs. 79 µJ), reflecting its avoidance of bulk sequential loads. Traditional LWE’s dense storage model requires frequent large-scale memory transfers, which are disproportionately expensive in energy terms [19]. Merkle-LWE’s design shifts this burden into computation, which is less energy-intensive per operation on microcontrollers. This redistribution aligns with the broader observation that minimizing memory traffic is the most effective strategy for reducing energy consumption in IoT devices.

Finally, base power consumption is slightly higher for Merkle-LWE (900 µJ vs. 750 µJ), reflecting longer active runtimes due to its increased computational load. However, this overhead is modest compared to the dramatic savings achieved in hash and memory access categories. The overall profile confirms that Merkle-LWE’s energy efficiency is structurally coherent: while computation and PRNG expansion raise baseline costs, reductions in memory traffic and hashing more than offset these increases. The scheme’s energy footprint remains bounded and predictable, supporting its deployment in battery-powered environments where long-term sustainability is critical.

In conclusion, the figure validates that Merkle-LWE achieves energy efficiency by strategically reallocating resource pressure. Its design reduces the most energy-intensive components—hashing and memory access—while accepting modest increases in computation and PRNG costs. This balance ensures that Merkle-LWE remains viable for IoT platforms, where energy constraints dominate system design. The breakdown confirms that Merkle-LWE’s memory-first philosophy not only minimizes storage but also delivers tangible energy benefits, reinforcing its suitability for secure and sustainable embedded cryptography.

7.11. Correctness and Reliability Validation

Figure 26 presents a logarithmic plot of failure probability against the number of trials, with a red dashed line denoting the acceptable failure threshold of

10^{- 6}

. The observed results, represented by vertical bars, consistently remain below this threshold across all tested ranges. Most notably, no decryption failures were recorded within the experimental dataset, as annotated by the figure. This outcome provides strong empirical evidence of the correctness of the Merkle-LWE KEM implementation, demonstrating that its design achieves reliable decryption under repeated stress testing. The absence of failures even at high trial counts indicates that the scheme’s probabilistic components—such as sparse error sampling and Merkle path verification—are structurally sound and do not introduce instability into the decryption process [24].

The reliability observed can be attributed to several architectural features. First, the deterministic regeneration of matrix rows from seeds ensures that public key material is reproduced consistently, eliminating discrepancies that could otherwise lead to decryption mismatches [29]. Second, the sparse secret representation, while introducing random access patterns, is carefully bounded by fixed sparsity levels and deterministic indexing, ensuring that reconstruction during decapsulation is exact [57]. Third, Merkle tree commitments provide cryptographic guarantees of correctness: each path verification ensures that the reconstructed secret aligns with the committed root, preventing silent errors from propagating [35]. Together, these mechanisms form a layered correctness model, where redundancy in verification compensates for potential weaknesses in any single component.

From a reliability perspective, the absence of failures across the tested range confirms that Merkle-LWE achieves robustness comparable to, or exceeding, traditional LWE-based schemes. The logarithmic scaling of the plot emphasizes that even as the number of trials increases, the observed failure probability remains effectively zero. This suggests that the scheme’s correctness is not merely a statistical artifact of limited testing but a structural property of its design. The validation is particularly significant for embedded and IoT deployments, where reliability is paramount: decryption failures in such contexts could lead to session drops, authentication errors, or denial of service. By demonstrating correctness under extensive testing, Merkle-LWE establishes itself as a dependable candidate for PQC in constrained environments, balancing efficiency with uncompromised reliability.

Figure 27 compares the observed decryption failure rates of Merkle-LWE KEM, Traditional LWE KEM, and representative NIST PQC schemes. The y-axis is plotted on a logarithmic scale, ranging from

10^{- 6}

to

10^{- 4}

, with a red dashed line marking the acceptable failure threshold of

10^{- 6}

. The results reveal a clear distinction between the schemes: Traditional LWE exhibits a failure rate of 0.0001, which exceeds the acceptable threshold, while both Merkle-LWE and NIST PQC schemes report failure rates below 0.0003, remaining within acceptable bounds. This comparative analysis underscores the reliability advantages of Merkle-LWE and modern PQC candidates over traditional lattice-based constructions [24].

The elevated failure rate of Traditional LWE can be attributed to its reliance on dense error vectors and direct decryption without auxiliary correctness checks. In practice, small deviations in error distribution or rounding during polynomial arithmetic can accumulate, leading to decryption mismatches. Without structural redundancy or verification mechanisms, these errors manifest as observable failures. Merkle-LWE avoids this pitfall by embedding correctness guarantees into its architecture: sparse error vectors are deterministically generated, and Merkle tree commitments enforce consistency between transmitted ciphertexts and reconstructed secrets [35]. Similarly, NIST PQC schemes such as Kyber and NTRU employ carefully tuned noise distributions and reconciliation mechanisms, ensuring that decryption remains robust even under adversarial or noisy conditions [5,42].

The comparative results highlight that Merkle-LWE achieves correctness and reliability on par with NIST PQC candidates, despite its unconventional memory-first design. By shifting verification into cryptographic commitments and sparse reconstructions, the scheme ensures that decryption failures are structurally suppressed. The reported failure rates below

3 \times 10^{- 4}

confirm that Merkle-LWE’s correctness is not compromised by its efficiency-oriented trade-offs. This validation is particularly significant for embedded and IoT deployments, where reliability is paramount: even rare decryption failures can disrupt communication protocols, authentication flows, or secure key exchanges. The figure thus demonstrates that Merkle-LWE balances efficiency with robustness, offering a dependable alternative to both traditional and standardized PQC schemes.

In conclusion, the comparative correctness analysis confirms that Merkle-LWE achieves reliability within acceptable thresholds, outperforming Traditional LWE and aligning with the robustness of NIST PQC candidates. The observed results validate the scheme’s architectural choices—sparse error representation, deterministic regeneration, and Merkle-based verification—as effective mechanisms for suppressing decryption failures. This reliability, combined with its memory efficiency, reinforces Merkle-LWE’s suitability for constrained environments, ensuring secure and error-free operation in real-world post-quantum deployments.

Figure 28 presents a statistical significance analysis of decryption correctness, plotting the upper bound on failure rate at a 95% confidence interval against the number of trials. The logarithmic scaling of both axes highlights the diminishing upper bound as trial counts increase, demonstrating the statistical principle that larger sample sizes yield stronger confidence in low observed failure rates. The red dashed line marks the acceptable failure threshold of

10^{- 6}

, while the orange vertical line indicates that approximately 3,000,000 trials are required to statistically confirm correctness at this confidence level. The figure thus provides a rigorous framework for validating reliability beyond empirical observation, ensuring that correctness claims are supported by statistical guarantees [61].

The trend line shows that with modest trial counts, the confidence interval upper bound remains relatively high, reflecting the uncertainty inherent in small sample sizes. As the number of trials increases, the upper bound decreases sharply, converging toward the acceptable threshold. This behaviour confirms that the absence of observed failures in limited testing cannot alone establish reliability; instead, statistical validation requires sufficient trials to reduce the confidence interval below the threshold. For Merkle-LWE, the figure demonstrates that while no failures were observed empirically, formal validation of correctness at the

10^{- 6}

level necessitates millions of trials. This requirement is consistent with cryptographic standards, which demand rigorous statistical assurance to account for rare but potentially catastrophic errors [24].

The analysis also underscores the robustness of Merkle-LWE’s architectural design. Sparse secret reconstruction, deterministic matrix generation, and Merkle path verification collectively ensure that decryption failures are structurally suppressed. The statistical framework confirms that these mechanisms not only prevent failures in practice but also withstand scrutiny under confidence-based validation. By quantifying the number of trials required for assurance, the figure bridges empirical testing with formal reliability guarantees, providing a roadmap for future large-scale validation. This is particularly relevant for embedded and IoT deployments, where correctness must be guaranteed under continuous operation and adversarial conditions. The ability to demonstrate reliability both empirically and statistically reinforces Merkle-LWE’s suitability as a dependable post-quantum cryptographic scheme.

In conclusion, the experimental evaluation confirms that Merkle-LWE achieves unprecedented memory efficiency without compromising concrete security. The “traditional LWE” baseline serves an analytical purpose—isolating the impact of individual architectural choices—while the primary comparisons against Kyber and NTRU, conducted under identical security levels and parameter sets drawn from standardized specifications [5,13,41,42,43], demonstrate that Merkle-LWE’s gains are genuine and not artifacts of weakened parameters. Parameter alignment across lattice dimension, modulus, and module rank validates that the scheme’s compactness arises from structural representation rather than security margin erosion. These results position Merkle-LWE as a viable alternative for deeply constrained embedded platforms where static storage is the dominant bottleneck, complementing rather than replacing balanced schemes like Kyber in the broader post-quantum ecosystem.

Correctness evaluation indicates that reliable operation is not solely empirical but supported by statistical reasoning. While large-scale testing is required to formally validate extremely low failure probabilities, the absence of observed failures across extensive trials, together with structural safeguards embedded in the design, strongly supports the robustness of the construction. This dual perspective—empirical validation and statistical assurance—reinforces confidence in the correctness of Merkle-LWE in practical deployment scenarios, particularly in environments where reliability and deterministic behavior are critical. Taken together, the experimental and analytical results confirm that Merkle-LWE achieves its design goal: enabling quantum-resistant key exchange on deeply memory-constrained embedded platforms, with protocol-level overheads that are acceptable for the latency and concurrency profiles typical of low-power IoT deployments.

8. Concluding Remarks and Future Work

The comprehensive evaluation of Merkle-LWE KEM presented throughout this study highlights both its distinctive architectural philosophy and its practical implications for post-quantum cryptography in constrained environments. At its core, Merkle-LWE represents a deliberate reallocation of resource pressure: it sacrifices computational simplicity in order to achieve dramatic reductions in storage and memory traffic. This design choice is not incidental but structurally embedded, reflecting a memory-first approach that directly addresses the limitations of embedded and IoT platforms, where flash and RAM are scarce resources but computational cycles are comparatively abundant. The results consistently validate this philosophy. Storage usage is reduced by more than ninety-nine percent relative to traditional LWE schemes, memory traffic is lowered by over sixteen percent, and correctness is preserved without observable decryption failures. These gains, however, are accompanied by a doubling of CPU cycle counts, elevated L1 cache miss rates in certain operations, and modest increases in base power consumption. Yet the trade-offs are predictable, bounded, and phase-specific, ensuring that the scheme remains viable for real-world deployment. Crucially, the protocol-level analysis in Section 7.9 demonstrates that these overheads translate to only marginal increases in handshake duration (approximately 3 ms) and substantial bandwidth savings (approximately 85% smaller ciphertexts), aligning with the operational profiles of low-power IoT protocols such as DTLS 1.3. For infrequent, latency-tolerant handshakes on flash-constrained devices, Merkle-LWE offers a compelling alternative to balanced schemes like Kyber, enabling post-quantum security on platforms previously excluded from adoption.

The analyses of computational cost reveal that Merkle-LWE’s overhead is concentrated in matrix operations and PRNG expansion, both of which are directly tied to its memory-saving design. By regenerating matrix rows dynamically from seeds and expanding sparse secrets on demand, the scheme eliminates bulky storage but incurs additional cycles. Component-level breakdowns confirm that while Merkle-LWE introduces new operations such as Merkle tree traversal and bit packing, their energy and cycle costs remain modest compared to the dominant matrix and PRNG workloads. Cache behaviour studies further demonstrate that sequential components such as hash computations and bit packing maintain high locality and low miss rates, while sparse secret access introduces inefficiencies due to random indexing. Importantly, these inefficiencies remain localized to specific phases, with no L2 cache misses observed, confirming that the working sets fit comfortably within embedded cache hierarchies. The overall cache hit rates in key generation and encapsulation remain high, ensuring that the majority of runtime execution benefits from cache acceleration.

Energy consumption analyses reinforce the centrality of memory traffic as the dominant factor in embedded cryptographic workloads. The scatter plots and component breakdowns consistently show that schemes with high memory traffic incur elevated energy costs, while those that minimize memory access achieve efficiency even in the presence of increased computation. Merkle-LWE exemplifies this principle: by reducing memory traffic through PRNG expansion and sparse representation, it achieves energy efficiency despite its higher cycle counts. The component-level breakdown reveals dramatic reductions in the energy footprint of hashing and memory access, offsetting modest increases in computation and PRNG operations. This redistribution of energy costs aligns with the broader observation that computation is less energy-intensive per operation than memory transactions on microcontrollers. Consequently, Merkle-LWE achieves a balanced energy profile that supports sustainable deployment in battery-powered IoT devices, where energy efficiency is paramount for long-term operation.

Correctness and reliability validation further strengthen the case for Merkle-LWE. Empirical testing revealed no decryption failures within the tested range, and comparative analyses confirmed that Merkle-LWE achieves reliability on par with NIST PQC candidates while outperforming traditional LWE, which exhibited failure rates above acceptable thresholds. Statistical significance analysis demonstrated that millions of trials are required to formally validate correctness at the

10^{- 6}

confidence level, but the absence of observed failures and the structural safeguards embedded in the design strongly indicate robustness. Sparse secret reconstruction, deterministic matrix generation, and Merkle path verification collectively ensure that decryption failures are structurally suppressed, providing both empirical and statistical assurance of reliability. This dual validation—empirical and statistical—ensures that Merkle-LWE can be confidently deployed in environments where correctness and reliability are paramount, bridging the gap between experimental assurance and formal cryptographic standards.

Taken together, these findings position Merkle-LWE as a compelling candidate for PQC in constrained environments. Its memory-first design philosophy directly addresses the limitations of embedded platforms, where storage and bandwidth are scarce resources. While the scheme does not aim to outperform traditional or standardized PQC candidates in raw speed, it offers a unique balance of compactness, predictability, and reliability. This balance is particularly valuable in IoT deployments, where secure communication must coexist with strict energy budgets and limited hardware capabilities. The trade-off ratio of approximately two times computation per unit of memory saved is consistent across operations and reflects a predictable, phase-specific overhead. Importantly, the scheme’s computational costs are bounded and constant-time, ensuring resilience against timing-based side-channel attacks.

Looking forward, several avenues for future research remain open. Optimization of computational overhead is a critical direction. While Merkle-LWE’s cycle counts are predictable and bounded, further work is needed to reduce the cost of matrix regeneration and sparse secret handling. Techniques such as layout-aware scheduling, cache-conscious indexing, and hardware acceleration for PRNG expansion could mitigate the observed overheads without compromising memory efficiency. Exploring lightweight hash functions or hybrid verification strategies may also reduce the energy footprint of Merkle path traversal. Broader benchmarking across diverse hardware platforms is essential. The current analyses focus on embedded microcontrollers, but future work should extend to heterogeneous environments, including FPGAs, GPUs, and specialized cryptographic accelerators. Such studies would clarify the scalability of Merkle-LWE’s design and identify platform-specific optimizations. In particular, GPU-based parallelization of matrix operations and Merkle tree traversal could offset computational costs, while FPGA implementations may enable hardware-level compression of sparse secrets.

Integration with standardized PQC frameworks warrants exploration. While Merkle-LWE demonstrates correctness and reliability comparable to NIST PQC candidates, its unconventional design raises questions about interoperability and standardization. Future work should investigate hybrid schemes that combine Merkle-LWE’s memory efficiency with the balanced performance of Kyber or NTRU, potentially yielding designs that optimize across multiple resource dimensions. Comparative studies of protocol-level integration, including TLS and VPN frameworks, would further validate Merkle-LWE’s applicability in real-world deployments. Side-channel and fault tolerance analyses remain critical. While Merkle-LWE’s constant-time execution mitigates timing attacks, its sparse and tree-like access patterns may introduce new side-channel vectors, particularly in cache-based adversarial models. Future research should rigorously evaluate these risks and develop countermeasures, such as randomized access scheduling or hardware-assisted masking. Similarly, fault injection resilience must be tested, ensuring that Merkle path verification and sparse secret reconstruction remain robust under adversarial conditions.

Finally, long-term empirical validation is necessary to confirm statistical reliability. While millions of trials are required to formally validate correctness at the

10^{- 6}

threshold, sustained testing across diverse workloads and adversarial scenarios will provide stronger assurance. Establishing open benchmarking frameworks and reproducible datasets would enable the broader research community to validate and refine Merkle-LWE’s reliability claims, fostering transparency and collaboration in PQC research. In conclusion, Merkle-LWE KEM represents a bold reimagining of lattice-based cryptography, one that prioritizes memory efficiency without compromising correctness or reliability. Its design philosophy—trading computation for storage—aligns with the realities of embedded and IoT platforms, where memory scarcity is the dominant constraint. While challenges remain in optimizing computational overhead and ensuring side-channel resilience, the scheme’s empirical and statistical validation confirms its viability as a secure and efficient post-quantum solution. Future work will refine, extend, and integrate Merkle-LWE into broader cryptographic ecosystems, ensuring that it contributes meaningfully to the ongoing evolution of secure communication in the quantum era.

Author Contributions

Conceptualization, E.M., E.K. and N.Ž.; methodology, E.M., E.K. and N.Ž.; software, E.M.; validation, E.M., E.K., N.Ž., S.N. and C.R.; formal analysis, E.M., E.K., N.Ž., S.N. and C.R.; investigation, E.M., E.K. and N.Ž.; resources, E.M., E.K., S.N. and C.R.; data curation, E.M. and E.K.; writing—original draft preparation, E.M. and E.K.; writing—review and editing, E.M., E.K., S.N. and C.R.; visualization, E.M., E.K. and N.Ž.; supervision, N.Ž., S.N. and C.R.; project administration, E.M. and C.R.; funding acquisition, E.K. and N.Ž. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to further ongoing closed research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shor, P.W. Algorithms for quantum computation: Discrete logarithms and factoring. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), Santa Fe, NM, USA, 20–22 November 1994; IEEE Computer Society Press: Los Alamitos, CA, USA, 1994; pp. 124–134. [Google Scholar] [CrossRef]
Shor, P.W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 1997, 26, 1484–1509. [Google Scholar] [CrossRef]
Bernstein, D.J.; Buchmann, J.; Dahmen, E. (Eds.) Post-Quantum Cryptography; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
Alagic, G.; Apon, D.; Cooper, D.; Dang, Q.; Dang, T.; Kelsey, J.; Lichtinger, J.; Liu, Y.-K.; Miller, C.; Moody, D.; et al. Status Report on the Third Round of the NIST Post-Quantum Cryptography Standardization Process; NISTIR 8413; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2022. [CrossRef]
Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehle, D. CRYSTALS-Kyber: A CCA-secure module-lattice-based KEM. In Proceedings of the 2nd IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 26–28 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 353–367. [Google Scholar] [CrossRef]
Abdulrahman, A.; Hwang, V.; Kannwischer, M.J.; Sprenkels, A. Faster Kyber and Dilithium on the Cortex-M4. Cryptol. ePrint Arch. 2022, 2022, 112. [Google Scholar]
Regev, O. On lattices, learning with errors, random linear codes, and cryptography. J. ACM 2009, 56, 1–40. [Google Scholar] [CrossRef]
Lyubashevsky, V.; Peikert, C.; Regev, O. On ideal lattices and learning with errors over rings. J. ACM 2013, 60, 1–35. [Google Scholar] [CrossRef]
Ajtai, M. Generating hard instances of lattice problems. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC), Philadelphia, PA, USA, 22–24 May 1996; ACM: New York, NY, USA, 1996; pp. 99–108. [Google Scholar] [CrossRef]
Langlois, A.; Stehlé, D. Worst-case to average-case reductions for module lattices. Des. Codes Cryptogr. 2014, 75, 565–599. [Google Scholar] [CrossRef]
Alkim, E.; Ducas, L.; Pöppelmann, T.; Schwabe, P. Post-quantum key exchange—A new hope. In Proceedings of the 25th USENIX Security Symposium, Austin, TX, USA, 10–12 August 2016; USENIX Association: Berkeley, CA, USA, 2016; pp. 327–343. [Google Scholar]
Liang, Z.; Zhao, Y. Number theoretic transform and its applications in lattice-based cryptosystems: A survey. arXiv 2022, arXiv:2211.13546. [Google Scholar] [CrossRef]
Avanzi, R.; Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Kyber Algorithm Specifications and Supporting Documentation (Version 3.02); National Institute of Standards and Technology: Gaithersburg, MD, USA, 2021.
Atkins, D. Requirements for post-quantum cryptography on embedded devices. In Proceedings of the 3rd NIST Post-Quantum Cryptography Standardization Conference, Online, 7–9 June 2021; Available online: https://csrc.nist.gov/CSRC/media/Events/third-pqc-standardization-conference/documents/accepted-papers/atkins-requirements-pqc-iot-pqc2021.pdf (accessed on 12 February 2026).
Fournaris, A.P.; Tasopoulos, G.; Brohet, M.; Regazzoni, F. Running longer to slim down: Post-quantum cryptography on memory-constrained devices. In Proceedings of the 2023 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Berlin, Germany, 23–25 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Kumari, S.; Singh, M.; Singh, R.; Tewari, H. Post-quantum cryptography techniques for secure communication in resource-constrained Internet of Things devices: A comprehensive survey. Softw. Pract. Exp. 2022, 52, 2047–2076. [Google Scholar] [CrossRef]
Stebila, D.; Fluhrer, S.; Gueron, S. Hybrid Key Exchange in TLS 1.3. Internet-Draft Draft-Ietf-Tls-Hybrid-Design-16; IETF: Fremont, CA, USA, 2025. [Google Scholar]
Bindel, N.; Hamburg, M.; Hövelmanns, K.; Hülsing, A.; Persichetti, E. Tighter proofs of CCA security in the quantum random oracle model. In Theory of Cryptography; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; pp. 61–90. [Google Scholar] [CrossRef]
Düll, M.; Haase, B.; Hinterwälder, G.; Hutter, M.; Paar, C.; Sánchez, A.H.; Schwabe, P. High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes Cryptogr. 2015, 77, 493–514. [Google Scholar] [CrossRef]
Botros, L.; Kannwischer, M.J.; Schwabe, P. Memory-efficient high-speed implementation of Kyber on Cortex-M4. Cryptol. ePrint Arch. 2019, 2019, 489. [Google Scholar]
Huang, J.; Zhao, H.; Zhang, J.; Dai, W.; Zhou, L.; Cheung, R.C.C.; Koç, Ç.K.; Chen, D. Yet Another Improvement of Plantard Arithmetic for Faster Kyber on Low-End 32-Bit IoT Devices. IEEE Trans. Inf. Forensics Secur. 2024, 19, 3800–3813. [Google Scholar] [CrossRef]
Roy, S.S.; Vercauteren, F.; Mentens, N.; Chen, D.D.; Verbauwhede, I. Compact ring-LWE cryptoprocessor. In Proceedings of the 16th International Conference on Cryptographic Hardware and Embedded Systems (CHES), Busan, Republic of Korea, 23–26 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 371–391. [Google Scholar] [CrossRef]
Open Quantum Safe. Liboqs: Open Quantum Safe C Library for Quantum-Safe Cryptographic Algorithms; GitHub Repository. 2026. Available online: https://github.com/open-quantum-safe/liboqs (accessed on 1 February 2026).
Abraham, I.; Asharov, G.; Patil, S.; Patra, A. Perfect asynchronous MPC with linear communication overhead. In Advances in Cryptology—EUROCRYPT 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; pp. 280–309. [Google Scholar] [CrossRef]
Rückert, M.; Schneider, M. Estimating the security of lattice-based cryptosystems. Cryptol. ePrint Arch. 2010, 2010, 137. [Google Scholar]
Park, A.; Shim, K.-A.; Koo, N.; Han, D.-G. Side-channel attacks on post-quantum signature schemes based on multivariate quadratic equations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2018, 500–523. [Google Scholar] [CrossRef]
Coron, J.-S.; Gérard, F.; Trannoy, M.; Zeitoun, R. High-Order Masking of NTRU. TCHES 2023, 2023, 180–211. [Google Scholar] [CrossRef]
Kannwischer, M.J.; Rijneveld, J.; Schwabe, P.; Stoffelen, K. pqm4: Testing and benchmarking NIST PQC on ARM Cortex-M4. Cryptol. ePrint Arch. 2019, 2019, 844. [Google Scholar]
Banerjee, U.; Ukyab, T.S.; Chandrakasan, A.P. Sapphire: A configurable crypto-processor for post-quantum lattice-based protocols. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019, 2019, 17–61. [Google Scholar] [CrossRef]
Crockett, E.; Paquin, C.; Stebila, D. Prototyping post-quantum and hybrid key exchange and authentication in TLS and SSH. Cryptol. ePrint Arch. 2019, 2019, 858. Available online: https://eprint.iacr.org/2019/858 (accessed on 12 February 2026).
Cooper, D.A.; Apon, D.C.; Dang, Q.H.; Davidson, M.S.; Dworkin, M.J.; Miller, C.A. Recommendation for Stateful Hash-Based Signature Schemes; NIST Special Publication 800-208; NIST: Gaithersburg, MD, USA, 2020. [CrossRef]
Hülsing, A. W-OTS+—Shorter signatures for hash-based signature schemes. In Progress in Cryptology—AFRICACRYPT 2013; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; pp. 173–188. [Google Scholar] [CrossRef]
Buchmann, J.; Dahmen, E.; Hülsing, A. XMSS—A practical forward secure signature scheme based on minimal security assumptions. In Post-Quantum Cryptography; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 117–129. [Google Scholar] [CrossRef]
Bernstein, D.J.; Hopwood, D.; Hülsing, A.; Lange, T.; Niederhagen, R.; Papachristodoulou, L.; Schneider, M.; Schwabe, P.; Wilcox-O’Hearn, Z. SPHINCS: Practical stateless hash-based signatures. In Advances in Cryptology—EUROCRYPT 2015; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2015; pp. 368–397. [Google Scholar] [CrossRef]
Merkle, R.C. A certified digital signature. In Advances in Cryptology—CRYPTO’ 89 Proceedings; Lecture Notes in Computer Science; Springer: New York, NY, USA, 1989; pp. 218–238. [Google Scholar] [CrossRef]
Howe, J.; Pöppelmann, T.; O’Neill, M.; O’Sullivan, E.; Güneysu, T. Practical lattice-based digital signature schemes. ACM Trans. Embed. Comput. Syst. 2015, 14, 1–24. [Google Scholar] [CrossRef]
Patterson, J.C.; Buchanan, W.J.; Turino, C. Energy consumption framework and analysis of post-quantum key-generation on embedded devices. J. Cybersecur. Priv. 2025, 5, 42. [Google Scholar] [CrossRef]
Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. Instruction-set accelerated implementation of CRYSTALS-Kyber. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 4648–4659. [Google Scholar] [CrossRef]
Singh, H. Code Based Cryptography: Classic McEliece. arXiv 2020, arXiv:1907.12754. [Google Scholar] [CrossRef]
D’Anvers, J.-P.; Vercauteren, F.; Verbauwhede, I. The impact of error dependencies on Ring/Mod-LWE/LWR based schemes. Cryptol. ePrint Arch. 2018, 2018, 1172. [Google Scholar]
Hülsing, A.; Rijneveld, J.; Schanck, J.; Schwabe, P. High-speed key encapsulation from NTRU. In Cryptographic Hardware and Embedded Systems—CHES 2017; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; pp. 232–252. [Google Scholar] [CrossRef]
D’Anvers, J.-P.; Karmakar, A.; Sinha Roy, S.; Vercauteren, F. Saber: Module-LWR based key exchange, CPA-secure encryption and CCA-secure KEM. Cryptol. ePrint Arch. 2018, 2018, 230. [Google Scholar]
Chen, C.; Danba, O.; Hoffstein, J.; Hülsing, A.; Rijneveld, J.; Schanck, J.M.; Schwabe, P.; Whyte, W.; Zhang, Z. NTRU Algorithm Specifications and Supporting Documentation (Round 3); Technical Report. 2021. Available online: https://ntru.org/ (accessed on 12 February 2026).
Pöppelmann, T.; Oder, T.; Güneysu, T. High-performance ideal lattice-based cryptography on 8-bit ATxmega microcontrollers. Cryptol. ePrint Arch. 2015, 2015, 382. [Google Scholar]
FIPS 205; Stateless Hash-Based Digital Signature Standard. National Institute of Standards and Technology (U.S.): Gaithersburg, MD, USA, 2024. [CrossRef]
Astrizi, T.L.; Custódio, R. Seamless transition to post-quantum TLS 1.3: A hybrid approach using identity-based encryption. Sensors 2024, 24, 7300. [Google Scholar] [CrossRef] [PubMed]
Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Dilithium: A lattice-based digital signature scheme. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2018, 238–268. [Google Scholar] [CrossRef]
Ducas, L.; Durmus, A.; Lepoint, T.; Lyubashevsky, V. Lattice signatures and bimodal Gaussians. Cryptol. ePrint Arch. 2013, 2013, 383. [Google Scholar]
D’Anvers, J.-P.; Roelens, M.; Verbauwhede, I. Revisiting higher-order masked comparison for lattice-based cryptography: Algorithms and bit-sliced implementations. Cryptol. ePrint Arch. 2022, 2022, 110. [Google Scholar] [CrossRef]
Nir, Y.; Langley, A. ChaCha20 and Poly1305 for IETF Protocols; RFC 8439; IETF: Fremont, CA, USA, 2018; Available online: https://www.rfc-editor.org/info/rfc8439 (accessed on 12 February 2026).
Gandhi, A.; Das, A.; Cherukuri, A.K. On Implementing Hybrid Post-Quantum End-to-End Encryption (Version 1). arXiv 2026. [Google Scholar] [CrossRef]
de la Torre, M.A.G.; Sandoval, I.A.M.; de Abreu, G.T.F.; Encinas, L.H. Post-Quantum Wireless-Based Key Encapsulation Mechanism via CRYSTALS-Kyber for Resource-Constrained Devices (Version 1). arXiv 2025. [Google Scholar] [CrossRef]
Melo, V.D.; Buchanan, W.J. KyFrog: A High-Security LWE-Based KEM Inspired by ML-KEM (Version 1). arXiv 2025. [Google Scholar] [CrossRef]
Zhang, X.; Deng, H.; Wu, R.; Ren, J.; Ren, Y. PQSF: Post-Quantum Secure Privacy-Preserving Federated Learning. Sci. Rep. 2024, 14, 23553. [Google Scholar] [CrossRef] [PubMed]
Lansiaux, E. Zero-Knowledge Federated Learning with Lattice-Based Hybrid Encryption for Quantum-Resilient Medical AI (Version 1). arXiv 2026. [Google Scholar] [CrossRef]
Zhandry, M. How to construct quantum random functions. J. ACM 2021, 68, 1–43. [Google Scholar] [CrossRef]
Prajapat, S.; Gautam, D.; Kumar, P.; Jangirala, S.; Kumar Das, A.; Sikdar, B. Secure lattice-based signature scheme for Internet of Things applications. IEEE Access 2025, 13, 75985–75999. [Google Scholar] [CrossRef]
Bos, J.W.; Gourjon, M.; Renes, J.; Schneider, T.; Van Vredendaal, C. Masking Kyber: First- and higher-order implementations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 173–214. [Google Scholar] [CrossRef]
Kannwischer, M.J.; Rijneveld, J.; Schwabe, P. Faster multiplication in ℤ_2^m[x] on Cortex-M4 to speed up NIST PQC candidates. In Applied Cryptography and Network Security; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; pp. 281–301. [Google Scholar] [CrossRef]
Iavich, M.; Kapalova, N.; Sakan, K. Efficient lattice-based digital signatures for embedded IoT systems. Symmetry 2025, 17, 1522. [Google Scholar] [CrossRef]
Barbosa, M.; Kannwischer, M.J.; Lim, T.; Schwabe, P.; Strub, P.-Y. Formally verified correctness bounds for lattice-based cryptography. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS ’25); ACM: New York, NY, USA, 2025; pp. 156–169. [Google Scholar] [CrossRef]

Figure 1. Adversary model and defenses for Merkle-LWE, including computational bounds, chosen-ciphertext and side-channel access, with implemented countermeasures. Different background colors indicate distinct component categories.

Figure 2. Memory-first architecture of Merkle-LWE showing seed-based Module-LWE core, sparse secret representation, and Merkle commitment layer under the Fujisaki–Okamoto transform.

Figure 3. Key generation workflow in Merkle-LWE. The diagram illustrates the deterministic generation of the public matrix seed, sparse secret polynomial, and Merkle commitment root from sampled entropy, highlighting the convergence of these components into the final public and private keys and the elimination of explicit vector storage. Light blue circles denote start/end nodes, green rectangles represent internal computation steps, and orange rectangles indicate final key material storage. The gray dashed box highlights on-demand regeneration. Light yellow background shading groups steps into three logical branches: Public Matrix, Merkle Commitment, and Sparse Secret. Solid arrows indicate direct data flow and sequential processing; dashed arrows denote implicit derivation or regeneration.

Figure 4. Overview of a CMake-based cross-compilation workflow showing desktop and embedded toolchains (GCC and GNU Arm Embedded), their compiler optimizations and runtime assumptions, and how they map onto x86_64 server/desktop and ARM Cortex-M4 IoT deployment environments, alongside the associated software stack and memory/compute constraints.

Figure 5. Percentage changes across object components, quantifying storage savings and minor overheads introduced by commitments.

Figure 6. Internal composition of Merkle-LWE public keys, private keys, and ciphertexts, highlighting the balance between lattice data, hash/Merkle commitments, and auxiliary fields.

Figure 7. Comparison of object sizes between Traditional LWE and Merkle-LWE KEM, showing significant reductions in public key and ciphertext sizes.

Figure 8. Distribution of flash usage across functional modules in Merkle-LWE KEM, including lattice arithmetic, PRNG, hashing, and Merkle operations.

Figure 9. Comparison of flash footprints between Traditional and Merkle-LWE KEM, showing reduced lattice code size but added PRNG and Merkle routines.

Figure 10. Flash memory suitability across embedded platforms, confirming Merkle-LWE remains within typical MCU constraints.

Figure 11. Peak RAM usage across Merkle-LWE, Traditional LWE, Kyber, and NTRU for key generation, encapsulation, and decapsulation.

Figure 12. RAM bottlenecks by component per operation, identifying error pattern generation and matrix handling as dominant contributors.

Figure 13. RAM utilization percentages across embedded platforms, demonstrating Merkle-LWE’s feasibility for constrained devices.

Figure 14. Total memory traffic comparison between Merkle-LWE and Traditional LWE across operations, showing reduced traffic in Merkle-LWE.

Figure 15. Memory access patterns categorized into sequential, random, PRNG expansion, and hashing, illustrating locality trade-offs.

Figure 16. Component-level breakdown of memory traffic, highlighting redistribution from matrix storage to PRNG expansion.

Figure 17. Cache miss rates by operation, showing elevated L1 misses in Merkle-LWE but no L2 misses.

Figure 18. Locality analysis linking miss rates to sparse secret access versus sequential operations.

Figure 19. Cache hit rate comparison, emphasizing the efficiency of sequential components and the impact of sparse indexing.

Figure 20. CPU cycle counts per operation with annotated overhead percentages, showing predictable increases in Merkle-LWE.

Figure 21. Breakdown of computational costs by component, with matrix and PRNG operations dominating the overhead.

Figure 22. Memory–computation trade-off analysis across cryptographic schemes. The scatter plot compares peak RAM usage and CPU cycle counts for key generation (circles), encapsulation (squares), and decapsulation (triangles) operations. Colored arrows indicate the design positioning of each approach: blue highlights Merkle-LWE’s shift toward low-memory, high-computation execution; green denotes the balanced NIST proof-of-concept baseline; and red represents traditional high-memory, low-computation designs. The dashed curve illustrates the theoretical inverse trade-off boundary, where reduced memory footprint necessitates increased computational cost. Merkle-LWE deliberately operates in the low-memory regime, exchanging higher CPU cycles for significantly reduced peak RAM requirements, making it suitable for resource-constrained environments.

Figure 23. Scheme-averaged trade-off view with variability, highlighting distinct operational profiles of Merkle-LWE, Traditional LWE, Kyber, and NTRU. The gray dashed line represents the theoretical memory–computation trade-off boundary, illustrating the inverse relationship between peak RAM usage and CPU cycles.

Figure 24. Correlation between memory traffic and energy consumption, confirming memory access dominates energy usage.

Figure 25. Energy consumption breakdown by component, showing that Merkle-LWE reduces hashing and memory costs while slightly increasing computation.

Figure 26. Failure probability across trials, with no decryption failures observed within the tested range.

Figure 27. Comparative failure rates across schemes, showing that Traditional LWE exceeds the acceptable threshold, while Merkle-LWE and NIST PQC remain below it.

Figure 28. Statistical significance analysis of trials required to achieve confidence at 10⁻⁶, indicating approximately three million trials are needed.

Table 1. Target key, ciphertext, and shared-secret sizes for Merkle-LWE across NIST-aligned security levels.

Security Level	Public Key	Private Key	Ciphertext	Shared Secret
Level 1 (128-bit)	96 B	160 B	128 B	32 B
Level 3 (192-bit)	96 B	192 B	160 B	32 B
Level 5 (256-bit)	96 B	224 B	192 B	32 B

Table 2. Concrete security estimates (bits) for Merkle-LWE parameter sets from Table 2. Estimates account for primal/dual lattice reduction and combinatorial attacks on sparse secrets.

Level	Lattice Reduction Security	Combinatorial Security $\log_{2} (\binom{n}{w})$	Overall Security (min)	NIST Target	Margin
1 (128-bit)	142.8	143.2	142.8	≥128	+14.8
3 (192-bit)	207.3	188.7	188.7	≥192	−3.3
5 (256-bit)	271.4	234.5	234.5	≥256	−21.5

Table 3. Security comparison between Merkle-LWE and Kyber across NIST levels (classical bits).

Scheme	Level	Lattice Security	Other *	Overall	Margin
Merkle-LWE	1	142.8	143.2 (comb.)	142.8	+14.8
Kyber-512	1	143.0	N/A	143.0	+15.0
Merkle-LWE	3	207.3	188.7 (comb.)	188.7	−3.3
Kyber-768	3	207.0	N/A	207.0	+15.0
Merkle-LWE	5	271.4	234.5 (comb.)	234.5	−21.5
Kyber-1024	5	270.5	N/A	270.5	+14.5

* “N/A” = not applicable (Kyber uses dense secrets; combinatorial attacks exploiting sparse secrets do not apply). “(comb.)” = combinatorial attack exploiting low Hamming weight.

Table 4. Concrete parameter sets for Merkle-LWE across NIST-aligned security levels.

Parameter	Level 1 (128-bit)	Level 3 (192-bit)	Level 5 (256-bit)
Lattice dimension $n$	256	256	256
Modulus $q$	3329	3329	3329
Module rank $k$	2	3	4
Secret sparsity weight $w$	48	64	80
Coefficient bound $β$	1	1	1
Error set size $E$	128	128	128
PRNG: ChaCha20 nonce size	12 bytes	12 bytes	12 bytes
Hash: SHA3-256/512 output	32/64 bytes	32/64 bytes	32/64 bytes

Table 5. Parameter alignment for Merkle-LWE, Kyber, and NTRU across NIST security levels. All schemes are evaluated under equivalent security targets; Merkle-LWE parameters are chosen to align with the concrete hardness assumptions of standardized alternatives.

Security Level	Scheme	$Lattice Dimension n$	$Modulus q$	$Module Rank k$	$Secret Sparsity w$	Public Key	Private Key	Ciphertext
Level 1 (128-bit)	Merkle-LWE	256	3329	2	48	96 B	160 B	128 B
	Kyber-512	256	3329	2	dense	800 B	1632 B	768 B
	NTRU-HPS-2048-677	644	2048	-	dense	1218 B	1306 B	1024 B
Level 3 (192-bit)	Merkle-LWE	256	3329	3	64	96 B	192 B	160 B
	Kyber-768	256	3329	3	dense	1184 B	2400 B	1088 B
	NTRU-HPS-4096-821	821	4096	-	dense	1514 B	1632 B	1280 B
Level 5 (256-bit)	Merkle-LWE	256	3329	4	80	96 B	224 B	192 B
	Kyber-1024	256	3329	4	dense	1568 B	3168 B	1568 B
	NTRU-HPS-4096-1229	1229	4096	-	dense	1890 B	2016 B	1792 B

B = bytes. Module rank “-” indicates the scheme does not use a module structure.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Marevac, E.; Kadušić, E.; Živić, N.; Nesimović, S.; Ruland, C. A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation. Cryptography 2026, 10, 30. https://doi.org/10.3390/cryptography10030030

AMA Style

Marevac E, Kadušić E, Živić N, Nesimović S, Ruland C. A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation. Cryptography. 2026; 10(3):30. https://doi.org/10.3390/cryptography10030030

Chicago/Turabian Style

Marevac, Elmin, Esad Kadušić, Nataša Živić, Sanela Nesimović, and Christoph Ruland. 2026. "A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation" Cryptography 10, no. 3: 30. https://doi.org/10.3390/cryptography10030030

APA Style

Marevac, E., Kadušić, E., Živić, N., Nesimović, S., & Ruland, C. (2026). A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation. Cryptography, 10(3), 30. https://doi.org/10.3390/cryptography10030030

Article Menu

A Hybrid Module-LWE and Hash-Based Framework for Memory-Efficient Post-Quantum Key Encapsulation

Abstract

1. Introduction

2. Related Work

3. Design Goals and System Model

3.1. Memory and Storage Objectives

3.2. Computational and Energy Considerations

3.3. Adversarial Capabilities and Attack Surface

3.4. Formal Security Reduction

3.5. Concrete Security Analysis for Sparse Secret Parameters

4. Proposed Hybrid KEM Construction

4.1. Overview of the Hybrid Architecture

4.2. Seed-Based Module-LWE Public Key Generation

4.3. Structured and Sparse Secret Key Design

4.4. Hash-Based Merkle Commitment Layer

4.5. Key Generation Algorithm

4.6. Encapsulation Algorithm

4.7. Decapsulation Algorithm

4.8. Formal Specification and Parameter Sets

4.9. Error Pattern Selection and Distributional Indistinguishability

5. Implementation Considerations

5.1. Memory-Efficient PRNG Expansion

5.2. Polynomial Arithmetic and SIMD Optimization

5.3. Cache-Aware and In-Place Computation

5.4. Constant-Time and Side-Channel Mitigations

6. Experimental Setup

6.1. Target Platforms and Hardware Configuration

6.2. Implementation Environment and Toolchain

6.3. Measurement Methodology

6.4. Reference Schemes for Comparison

7. Experimental Evaluation

7.1. Parameter Alignment and Security Context for Comparative Evaluation

7.2. Cryptographic Object Size and Structure

7.3. Static Code Footprint (Flash/ROM)

7.4. Peak RAM Usage

7.5. Memory Traffic and Bandwidth Analysis

7.6. Cache Behaviour and Locality

7.7. Computational Cost

7.8. Memory–Computation Trade-Off Analysis

7.9. Protocol-Level Impact Analysis for IoT Handshakes

7.10. Energy Consumption

7.11. Correctness and Reliability Validation

8. Concluding Remarks and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI