1. Introduction
The advent of large-scale quantum computing poses a fundamental challenge to the public-key cryptography that underpins modern digital infrastructure. Widely deployed schemes such as RSA and elliptic-curve cryptography (ECC) rely on number-theoretic problems—integer factorization and discrete logarithms—that are efficiently solvable by Shor’s quantum algorithm [
1,
2]. This vulnerability has catalysed a global effort to develop cryptographic primitives secure against quantum adversaries, a field now known as post-quantum cryptography (PQC) [
3]. After a rigorous multi-year evaluation, the U.S. National Institute of Standards and Technology (NIST) has selected CRYSTALS-Kyber as its primary Key Encapsulation Mechanism (KEM) for standardization, with CRYSTALS-Dilithium chosen for digital signatures [
4,
5,
6]. Both are based on structured lattice problems, reflecting the community’s confidence in their security and efficiency.
Lattice-based cryptography, particularly constructions derived from the Learning With Errors (LWE) problem and its algebraically structured variants—Ring-LWE and Module-LWE—has emerged as the most versatile PQC family [
7,
8]. These schemes benefit from strong worst-case hardness guarantees: breaking them implies solving hard approximation problems on arbitrary lattices, such as the Shortest Vector Problem (SVP) [
9]. Structured variants like Module-LWE significantly improve performance by enabling fast polynomial arithmetic via the Number Theoretic Transform (NTT), while retaining reductions to worst-case lattice problems [
10,
11,
12]. Kyber, for instance, achieves IND-CCA2 security through a Fujisaki–Okamoto transform applied to a Module-LWE-based public-key encryption core, offering compact parameters and high throughput on general-purpose processors [
5,
13].
However, the practical deployment of these schemes on resource-constrained embedded platforms—such as ARM Cortex-M microcontrollers, RISC-V-based IoT nodes, and smart cards—remains problematic [
14,
15]. These devices often operate with only 16–64 KB of RAM and 256–512 KB of flash storage [
16]. In such environments, the several-kilobyte public keys and ciphertexts of Kyber (e.g., 1184 B public key and 1296 B ciphertext at NIST Security Level 1) can consume a disproportionate share of available memory [
13]. While acceptable in server or desktop contexts, this footprint becomes prohibitive when multiple concurrent sessions, protocol buffers, or application logic must coexist in tight memory budgets [
17,
18].
More critically, memory constraints on embedded systems extend beyond static storage. Memory bandwidth, cache behaviour, and energy consumption associated with data movement frequently dominate total execution cost—often exceeding the cost of arithmetic operations themselves [
19,
20]. A scheme that minimizes RAM usage but exhibits poor spatial or temporal locality may still incur high latency and power draw due to frequent cache misses or external memory accesses. This reality has motivated research into “lightweight” PQC, focusing on reduced parameter sets, optimized modular arithmetic, and platform-specific assembly [
21,
22,
23]. Yet many such optimizations involve implicit trade-offs: lowering security margins, increasing decryption failure probabilities, or relying on precomputed tables that exacerbate storage demands [
24,
25]. For example, some lightweight Kyber variants reduce polynomial degree or modulus size, but this can weaken concrete security estimates or complicate side-channel resistance [
26,
27,
28].
A key insight is that a significant portion of the memory footprint in lattice-based schemes arises not from secret material, but from public, deterministic structures. The public matrix
in Module-LWE is typically generated from a short seed using a pseudorandom function (PRF), yet many implementations store it explicitly to simplify code and avoid recomputation [
23,
29]. Similarly, error vectors—sampled during encapsulation—are often stored in full, despite being reproducible or verifiable through alternative means. Storing these structures trades memory for simplicity, but on constrained devices, this trade-off is often inverted: computation is abundant, while flash and RAM are scarce [
16]. Regenerating an on-the-fly from a seed reduces static storage dramatically—from kilobytes to 32 bytes—but introduces computational overhead and potential timing leakage if not implemented carefully [
29].
This observation aligns with a broader trend toward hybrid cryptographic designs, which combine multiple independent hardness assumptions to hedge against unforeseen cryptanalytic breakthroughs [
17]. Hybrid key exchange, for instance, merges classical (e.g., X25519) and post-quantum (e.g., Kyber) mechanisms so that an adversary must break both to compromise security [
18,
30]. Early deployments in TLS 1.3 have demonstrated that such hybrids incur manageable latency and bandwidth overheads in real-world settings [
17,
30]. More recently, researchers have explored integrating lattice-based KEMs with hash-based components—not just for transitional security, but to enhance robustness through design diversity [
31,
32].
Hash-based cryptography offers a compelling complement to lattice-based schemes. Its security rests on the collision and preimage resistance of well-studied hash functions—a conservative, structure-free assumption widely believed to resist quantum attacks when properly parameterized [
33]. NIST has standardized hash-based signatures like LMS and XMSS, which use Merkle trees to aggregate one-time signature keys into a single public root [
31,
34]. Although these schemes suffer from large signature sizes or state management complexities, their conceptual simplicity and strong security make them ideal for authentication layers in hybrid systems [
30].
Crucially, Merkle trees also enable efficient commitments to large data structures. Instead of storing an entire vector, one can store its Merkle root and reveal only the necessary authentication path during verification. This property has been underutilized in PQC KEMs, where error vectors and other auxiliary data are typically transmitted in full. By committing to these values via a Merkle tree, correctness can be preserved without explicit storage, reducing both static and dynamic memory requirements [
35].
Despite these opportunities, existing hybrid PQC research focuses primarily on security composition or migration strategies, not on hardware-aware optimization [
17,
31]. Memory efficiency—particularly the reduction of stored public parameters and improved access patterns—has rarely been treated as a first-class design objective. This gap is especially acute for embedded platforms, which are increasingly deployed in security-sensitive roles (e.g., industrial control, medical devices, automotive systems) yet remain underserved by current PQC standards [
14,
36].
In this paper, we propose a hybrid KEM that explicitly prioritizes memory efficiency for constrained environments. Our construction integrates a seed-generated Module-LWE core with a hash-based authentication layer built around Merkle-tree commitments. The public key consists solely of a compact seed; the public matrix A is regenerated on demand using a lightweight PRF [
29]. Error-related information is not stored explicitly but represented through a Merkle root, verified during decapsulation via succinct inclusion proofs [
35]. Additional design choices—including sparsity-aware secret keys and cache-efficient NTT scheduling—further minimize runtime memory usage and improve energy efficiency [
20,
37].
Our goal is not to replace Kyber, but to explore a design space where hybridization enables deployment on platforms currently excluded from PQC adoption [
14]. We make three contributions:
The remainder of this paper is organized as follows.
Section 2 reviews related work on lightweight PQC, memory-efficient lattice-based constructions, and hybrid lattice–hash approaches.
Section 3 formalizes the design goals and system model, including precise memory/storage objectives, computational trade-offs, and the adversarial threat model guiding our construction.
Section 4 details the proposed Merkle-LWE hybrid key-encapsulation mechanism, presenting its seed-based Module-LWE core, sparse secret representation, and Merkle commitment layer within a unified architecture.
Section 5 describes critical implementation considerations for embedded platforms, including memory-efficient PRNG expansion, polynomial arithmetic optimizations [
12], cache-aware computation, and constant-time side-channel mitigations [
28].
Section 6 outlines the experimental methodology, target platforms (ARM Cortex-M4 and x86-64), measurement techniques, and reference schemes used for comparative evaluation [
38].
Section 7 presents a comprehensive experimental evaluation across multiple dimensions—cryptographic object sizes, static code footprint, peak RAM usage, memory traffic, cache behaviour, computational cost, energy consumption, and correctness validation—demonstrating the scheme’s viability on resource-constrained devices.
Section 8 concludes with a synthesis of findings, limitations, security implications, and directions for future work in memory-optimized PQC.
2. Related Work
The transition to PQC has intensified research into lattice-based schemes due to their strong theoretical foundations, versatility, and relatively efficient performance on general-purpose hardware. The selection of CRYSTALS-Kyber as the primary KEM in the NIST PQC standardization process underscores the community’s confidence in Module Learning With Errors (Module-LWE) as a secure and practical foundation for quantum-resistant key exchange [
4,
5]. Kyber leverages structured lattices and the Number Theoretic Transform (NTT) to achieve compact parameters and high throughput, with a Level 1 public key of 800 bytes and ciphertext of 768 bytes in its final specification [
13]. While these sizes are manageable in server or desktop environments, they pose significant challenges for resource-constrained embedded platforms—such as ARM Cortex-M microcontrollers, RISC-V-based IoT nodes, and smart cards—which often operate with only tens of kilobytes of RAM and a few hundred kilobytes of flash storage [
14,
16]. Storing and processing multi-kilobyte keys can consume a disproportionate share of available memory, leaving insufficient space for application logic, network buffers, or concurrent protocol sessions. This fundamental mismatch between standardized PQC and embedded constraints has motivated a growing body of work on lightweight and memory-efficient PQC, yet critical gaps remain [
15,
29].
Efforts to deploy Kyber on constrained devices have yielded valuable insights but also exposed inherent trade-offs. Projects like Open Quantum Safe (OQS) and pqm4 provide portable and highly optimized implementations for platforms such as the ARM Cortex-M4, demonstrating that Kyber can indeed run on such hardware [
23,
38]. However, these implementations often require substantial stack usage (e.g., >16 KB for Kyber-768) and long execution times (hundreds of thousands of CPU cycles), which can be prohibitive in real-time or battery-powered applications [
21,
23]. To mitigate this, researchers have proposed “lightweight” variants like Kyber-LE or Kyber-Compact, which reduce polynomial degree, modulus size, or error distribution parameters to shrink memory footprint and accelerate computation [
39]. While effective in reducing resource demands, such parameter reductions risk eroding concrete security margins and deviate from the standardized, vetted parameters that provide assurance in real-world deployments. Moreover, even in optimized implementations, the public matrix A is often stored explicitly for simplicity, despite being deterministically generatable from a short seed—a missed opportunity for static storage reduction that our work directly addresses.
Other lattice-based finalists in the NIST process offer alternative trade-offs but face similar limitations. SABER, based on the Module Learning With Rounding (Module-LWR) problem, claims slightly better performance on some embedded platforms due to simpler rounding-based arithmetic instead of Gaussian sampling [
40]. Yet, its recommended parameters still yield public keys around 1.1 KB, which remains large for deeply constrained devices. NTRU, while historically efficient and now selected as an alternate NIST standard, relies on different hardness assumptions and faced scrutiny during the evaluation regarding its security reduction and potential for decryption failures [
41,
42]. Crucially, none of these schemes were designed with a memory-first philosophy; their optimizations primarily target computational speed or communication bandwidth, not the minimization of static storage or peak RAM usage—the true bottlenecks in embedded contexts. This architectural oversight leaves a gap for constructions that prioritize memory efficiency as a primary design goal rather than a secondary optimization.
Beyond lattice-based cryptography, other PQC families have been explored for lightweight applications, but with significant drawbacks. Code-based schemes like Classic McEliece, also selected by NIST, offer extremely conservative security based on the NP-hardness of decoding random linear codes [
43]. However, their public keys are enormous—ranging from 250 KB to over 1 MB—rendering them entirely impractical for microcontrollers. Structured code-based alternatives like LEDAcrypt or BIKE use quasi-cyclic codes to reduce key sizes to 1–2 KB, but introduce new complexities: BIKE’s decryption is probabilistic and can fail, requiring retransmission mechanisms that are difficult to implement securely in unreliable embedded environments prone to power loss or crashes [
44]. Hash-based signatures, such as SPHINCS+ (NIST-standardized) or the stateful LMS/XMSS, provide another avenue grounded in the collision resistance of hash functions—a conservative, structure-free assumption [
34,
45]. While SPHINCS+ is stateless and robust, its signature sizes are very large (8–49 KB), making it unsuitable for bandwidth-constrained IoT links. Stateful schemes like XMSS have smaller signatures but impose a critical operational burden: the signer must maintain a non-volatile counter to prevent catastrophic key reuse, a requirement that is error-prone in embedded systems without reliable persistent storage [
31,
33]. These trade-offs highlight the difficulty of achieving both small size and strong security in non-lattice PQC for embedded use.
Within lattice-based cryptography, several works have attempted to tailor schemes specifically for embedded platforms through low-level engineering. The pqm4 project, for instance, provides hand-optimized assembly implementations that exploit instruction-level parallelism and register allocation on the Cortex-M4, yielding significant speedups [
20,
23]. Similarly, Roy et al. explored high-precision arithmetic and cache-aware data layouts to minimize memory traffic during polynomial operations [
22]. Banerjee et al. further analysed memory access patterns and proposed scheduling strategies to improve cache locality in NTT computations [
29]. While these implementation-level optimizations are valuable, they operate within the constraints of existing scheme architectures and do not alter the fundamental representation of keys or public parameters. They optimize how data is processed, not what data is stored. In contrast, our work rethinks the very structure of the public and private keys, replacing large explicit vectors with compact seeds and cryptographic commitments, thereby addressing the root cause of memory inefficiency rather than its symptoms.
The concept of trading computation for memory is well-established, and its application to PQC is not new. Standardized schemes like Kyber and Dilithium already use a pseudorandom generator (PRG) to expand a short seed into the large public matrix A, meaning the public key can theoretically be just the seed plus the vector t = As + e [
6,
13]. This reduces the public key from kilobytes to tens of bytes for the seed, but the recipient must still store or reconstruct the full A to perform operations, and the sender must transmit the large t vector. Our approach extends this principle more radically: we eliminate the need to store or transmit t altogether. Instead, the public key is a Merkle root that commits to the secret key’s coefficients, and verification during decapsulation is performed via a succinct Merkle proof. This shifts the paradigm from “store and verify” to “commit and prove,” a structural change that enables unprecedented key size reductions [
35]. This use of Merkle trees for commitment, rather than just authentication, draws inspiration from techniques in zero-knowledge proofs and verifiable computation but appears novel in the context of PQC KEM design.
Hybrid cryptographic constructions—combining multiple independent primitives—have become a dominant strategy for managing cryptographic risk during the PQC transition [
17,
46]. The most common form combines a classical algorithm (e.g., ECDH) with a PQC one (e.g., Kyber) so that an adversary must break both to compromise the session. This approach is being actively deployed in TLS 1.3, with major tech companies running large-scale experiments that show manageable latency and bandwidth overheads [
18,
30]. However, these transitional hybrids do not address the core memory inefficiency of the PQC component; they often double the key material size rather than reduce it. More relevant are hybrids that combine different post-quantum families to achieve design diversity. Cooper et al. explored combining lattice-based signatures with hash-based trees to create more efficient stateless signature schemes [
31], while others have proposed using Merkle trees to authenticate components of a lattice-based scheme for integrity or multi-key aggregation. Astrizi et al. proposed a hybrid lattice-hash construction for lightweight IoT authentication, using a hash-based MAC to protect a lattice key exchange against fault attacks [
46]. However, in all these cases, the hash component is an auxiliary layer; the public key remains a full lattice public key. Our work differs fundamentally: the hash-based Merkle tree is not an add-on but the core mechanism for public key representation. The public key is the Merkle root, enabling a holistic integration that leverages the strengths of both worlds for a singular purpose—memory minimization.
Recent research has also explored the use of sparse secrets in lattice cryptography to improve efficiency. Ducas et al. demonstrated that using sparse secrets can accelerate signing in Dilithium without compromising security, provided sparsity is carefully controlled [
47,
48]. Bindel et al. analysed the concrete security of LWE with sparse secrets and provided guidelines for safe parameter choices, showing that moderate sparsity does not significantly weaken the underlying problem [
18]. Our work adopts this insight but integrates it into a comprehensive memory-minimization framework. The private key is not a list of coefficients but a seed that regenerates a sparse polynomial via a deterministic shuffle (e.g., Fisher-Yates). This compresses the private key dramatically while maintaining security, and when combined with the Merkle-root public key, creates a fully compact key pair. This synergistic use of sparsity and commitment is a key innovation over prior work that treats sparsity as an isolated performance tweak.
Any claim of practicality for embedded PQC must also address side-channel vulnerabilities, particularly timing and power analysis attacks. Standardized schemes come with guidance on constant-time implementation, and projects like pqm4 include masked and hardened versions [
27,
28]. However, many custom or lightweight PQC proposals sacrifice side-channel resistance for performance or size, inadvertently leaking information through conditional branches or table lookups [
26]. Our design explicitly incorporates side-channel countermeasures: all core operations—polynomial arithmetic, hash computations, and Merkle path verification—are implemented in constant-time. We deliberately choose ChaCha20 as the PRG for deterministic generation, as it is a well-vetted, constant-time stream cipher suitable for embedded use [
49,
50]. Furthermore, by minimizing the amount of sensitive data stored in memory (e.g., the private key is just a seed), we reduce the attack surface for memory-scraping attacks. This embedded-aware security posture contrasts with approaches that optimize for speed at the expense of leakage resilience.
Recent research in post-quantum cryptography (PQC) also increasingly focuses on practical deployment models that combine classical and quantum-resistant primitives. Gandhi et al. [
51] propose a hybrid end-to-end encryption system that integrates CRYSTALS-Kyber with AES-256-GCM in a zero-trust messaging architecture. Their work demonstrates that NIST-standardized PQC primitives can be effectively incorporated into real-world communication systems with acceptable performance overhead while maintaining protection against both classical and quantum adversaries. A key advantage of this approach is its practical validation of Kyber-based systems in end-to-end encryption scenarios; however, it primarily focuses on system integration rather than reducing the underlying memory footprint of lattice-based key structures.
In the context of resource-constrained environments, González de la Torre et al. [
52] explore the adaptation of CRYSTALS-Kyber for wireless and device-to-device communication systems operating under noisy channels. Their approach integrates modulation and error-correction coding (e.g., QAM and BCH codes) into the transmission of Kyber polynomial coefficients at the physical layer. This represents an important step toward embedding post-quantum cryptography into low-level communication stacks, demonstrating feasibility under real-world channel conditions. However, the scheme remains tightly coupled to full Kyber parameter sets and does not reduce key or ciphertext size, limiting its applicability in deeply constrained embedded systems where static memory is a primary bottleneck.
Duarte Melo et al. [
53] propose KyFrog, a conservative LWE-based key encapsulation mechanism designed with significantly increased security margins through larger lattice dimensions and smaller modulus selection. While KyFrog achieves extremely high estimated classical and quantum security levels, this comes at the cost of substantially enlarged ciphertext sizes (on the order of hundreds of kilobytes). This highlights a fundamental trade-off in lattice-based cryptography between security margins and communication overhead. The advantage of this work lies in its exploration of an extreme security-performance point in the design space; however, it further emphasizes that increasing security parameters directly exacerbates memory and bandwidth constraints, making such approaches unsuitable for microcontroller-class devices.
From an application perspective, Zhang et al. [
54] introduce PQSF, a post-quantum secure federated learning framework based on lattice-based secret sharing and double masking techniques. Their scheme demonstrates that lattice-based constructions can reduce communication complexity and computational overhead in distributed machine learning environments, achieving measurable efficiency improvements compared to prior secret-sharing approaches. The primary contribution of this work is the integration of post-quantum security into federated learning pipelines; however, it still relies on structured lattice primitives without addressing the underlying storage cost of cryptographic keys or the static memory footprint of cryptographic material.
More advanced hybrid cryptographic systems are explored by Lansiaux [
55], who proposes a zero-knowledge federated learning framework combining ML-KEM, lattice-based zero-knowledge proofs, and homomorphic encryption. This multi-layer design achieves strong security guarantees, including resistance to quantum adversaries and verification of model update integrity under the Module-LWE and SIS assumptions. A key strength of this approach is its rigorous formalization of security properties and practical evaluation in medical AI settings. However, the resulting system introduces significant computational overhead (approximately 20×), illustrating that hybridization and additional cryptographic layers increase complexity without addressing the core issue of large static key representations in lattice-based schemes.
In summary, the proposed Merkle-LWE KEM advances the state of the art by introducing a memory-first architecture that fundamentally rethinks key representation in lattice-based cryptography. While individual ideas—seed-based generation, sparse secrets, Merkle commitments—exist in the literature, their combination into a cohesive, IND-CCA-secure KEM represents a significant and novel contribution [
5]. Existing work either optimizes computation within fixed-parameter schemes, proposes non-lattice alternatives with their own size or complexity issues, or layers hash functions onto lattice schemes without altering their core memory structure. Our construction achieves public keys of 96 bytes and private keys of 160–224 bytes—orders of magnitude smaller than any standardized lattice-based KEM—not by weakening security parameters, but by a novel structural representation that replaces large explicit vectors with compact cryptographic commitments. This design is explicitly tailored for the most constrained embedded platforms, where static storage and peak RAM are the primary bottlenecks, and it includes platform-specific optimizations, comprehensive benchmarking, and built-in side-channel resistance to ensure real-world viability. By shifting the resource balance from memory to controlled computation, Merkle-LWE opens a pathway to deploying quantum-resistant cryptography on a vast ecosystem of devices that would otherwise remain vulnerable in a post-quantum world.
3. Design Goals and System Model
The design of the Merkle-LWE KEM is driven by a singular, unifying objective: to enable quantum-resistant cryptography on deeply resource-constrained embedded platforms where memory—not computation—is the primary bottleneck. Traditional post-quantum cryptographic schemes, including NIST-standardized lattice-based constructions like CRYSTALS-Kyber, are optimized for general-purpose computing environments where gigabytes of RAM and storage are available [
4,
5]. However, these assumptions break down dramatically in the context of microcontrollers, IoT sensors, and other embedded systems that operate with tens of kilobytes of RAM and flash memory [
14,
16]. In such environments, even a few kilobytes of public key material can consume a significant fraction of total available resources, precluding the use of otherwise secure PQC primitives [
15].
To address this gap, our system adopts a memory-first design philosophy. Rather than treating memory footprint as a secondary optimization target, we treat it as the central constraint around which all other design decisions are made. This leads to a deliberate inversion of the traditional cost model: we accept higher computational overhead and increased ciphertext size in exchange for drastic reductions in static storage and peak runtime memory usage. The result is a hybrid KEM that achieves public keys as small as 96 bytes and private keys between 160–224 bytes—representing a 99.3% reduction in total key size compared to conventional LWE implementations—while maintaining IND-CCA security against quantum adversaries.
This section formalizes the system model, adversarial assumptions, and quantitative design targets that underpin the Merkle-LWE architecture. We begin by articulating precise memory and storage objectives, then discuss the computational and energy implications of our design choices, and finally define the threat model and attack surface relevant to embedded deployments.
3.1. Memory and Storage Objectives
The core innovation of Merkle-LWE lies in its radical rethinking of how cryptographic state is represented and stored. In conventional lattice-based KEMs, the public key consists of an explicit matrix
and a vector
, both of which are stored in full [
7,
13]. For Kyber768 (NIST Level 3), this results in a public key of 1184 bytes and a private key of approximately 2400 bytes [
13]. While manageable on servers, these sizes are prohibitive for devices with 32–256 KB of flash memory, especially when multiple keys or concurrent sessions are required [
17,
18].
Merkle-LWE eliminates this overhead through two synergistic techniques:
Seed-Based Deterministic Generation: Instead of storing the public matrix
, the public key contains only a 32-byte seed. The matrix is regenerated on-the-fly using a lightweight pseudorandom generator (ChaCha20) whenever needed [
49]. This reduces the public key component from kilobytes to tens of bytes without compromising security, as the seed uniquely determines
[
29,
56].
Merkle Tree Commitments for Secret Representation: The private key does not store the full secret vector s. Instead, it stores a seed that generates a sparse polynomial via a Fisher-Yates shuffle, and the public key includes only the Merkle root of the secret’s coefficients. During encapsulation, the sender transmits a Merkle authentication path alongside the LWE sample, allowing the receiver to verify correctness without storing the entire secret or error vector [
35].
These mechanisms yield the concrete size targets shown in
Table 1 across three NIST-aligned security levels.
Critically, the public key size remains constant (96 B = 32 B seed + 64 B SHA3-512 Merkle root) across all security levels, as the Merkle root is independent of the underlying lattice dimension. This is a stark contrast to traditional schemes, where public key size scales linearly with security level [
13].
Beyond static storage, we also constrain runtime memory usage. Our implementation targets a peak RAM consumption of 8–24 KB depending on the security level, which fits comfortably within the memory budgets of common ARM Cortex-M and RISC-V microcontrollers [
23]. This is achieved through:
These design choices ensure that the scheme remains deployable on platforms with as little as 32 KB of RAM—a class of devices that constitutes the majority of the embedded ecosystem but has been largely excluded from current PQC standardization efforts [
14,
16].
3.2. Computational and Energy Considerations
The memory savings in Merkle-LWE come at a deliberate computational cost. By regenerating matrices and verifying Merkle paths instead of storing and loading data, we shift the resource burden from memory to CPU cycles. This trade-off is rational in embedded contexts for several reasons:
First, modern microcontrollers often have ample computational headroom relative to their memory constraints. An ARM Cortex-M4, for example, can execute hundreds of millions of instructions per second but may be limited to 128 KB of flash and 32 KB of RAM [
21,
23]. In such cases, spending extra cycles to avoid memory allocation is a favourable exchange.
Second, memory access is frequently more energy-intensive than computation on battery-powered devices. Studies have shown that reading a word from external flash can consume 10–100× more energy than performing an arithmetic operation in registers [
19,
20]. By minimizing memory traffic—particularly repeated reads of large public parameters—Merkle-LWE reduces overall energy consumption despite higher computational load [
20].
Our benchmarking confirms this trade-off quantitatively. Compared to a traditional LWE KEM with explicit storage:
Key generation incurs ~41.7% more CPU cycles due to Merkle tree construction and PRNG expansion.
Encapsulation and decapsulation require ~725.6% more cycles due to on-the-fly matrix row generation and Merkle path verification.
However, memory traffic is reduced by 45–48% across all operations, as fewer repeated loads of large data structures are needed.
The energy profile reflects this balance. While peak power draw may increase slightly during active computation, the total energy per operation is lower because the device spends less time waiting for memory and can return to low-power sleep states more quickly. On an IoT sensor that performs key exchange once per hour, this translates to extended battery life—a critical metric for real-world deployment [
15,
20].
To mitigate the computational overhead, we employ several optimizations:
ChaCha20 as the PRG: Chosen for its speed, constant-time implementation, and suitability for embedded platforms [
49].
Structured sparsity: Secret keys have controlled Hamming weight (e.g., 16 non-zero coefficients out of 256), enabling efficient sparse polynomial multiplication [
37,
57].
Platform-specific assembly: Hand-optimized routines for ARM Cortex-M4 and AVX512 for x86 reduce cycle counts where feasible [
23,
37].
Cache-aware scheduling: Polynomial operations are ordered to maximize spatial and temporal locality, reducing cache misses [
22,
58].
Importantly, we do not claim speed superiority over existing PQC schemes. Instead, we demonstrate that a different optimization objective—memory minimization—can yield a viable alternative for a specific, underserved class of devices. The computational cost is a feature, not a bug: it is the price paid for unprecedented memory efficiency.
3.3. Adversarial Capabilities and Attack Surface
The security model for Merkle-LWE assumes an adversary with the following capabilities, consistent with standard definitions for embedded PQC [
14,
25]:
Notably, we assume the adversary cannot physically extract secrets from secure memory (e.g., via invasive probing), modify firmware or induce permanent faults (though transient fault resistance is partially addressed via constant-time design). To address this threat model, Merkle-LWE incorporates multiple layers of defence:
Provable Security: The base KEM is provably IND-CPA secure under the Module-LWE assumption. The Fujisaki-Okamoto transform elevates this to IND-CCA security in the random oracle model, assuming the hash functions (SHA3-256/512) behave as random oracles [
18].
Constant-Time Implementation: All core operations—including polynomial arithmetic, ChaCha20 expansion, and Merkle path verification—are implemented in constant-time to prevent timing and cache-based side-channel leakage [
28,
49]. Branches and memory accesses are independent of secret values.
Secure Memory Handling: Sensitive data (e.g., private keys, intermediate secrets) are zeroized immediately after use via “secure_memzero”, and dynamic allocations are minimized to reduce heap-based attack surfaces [
38].
Hybrid Hardness Assumptions: By combining lattice-based hardness (Module-LWE) with hash-based commitments (SHA3), the scheme benefits from design diversity. An attacker would need to break both assumptions simultaneously to compromise security—a significantly higher bar than attacking either primitive alone [
17,
46].
Error Containment: The use of structured sparsity and bounded error distributions ensures that decryption failures are negligible, preventing attacks that exploit failure information [
24].
The security assumptions underlying the Merkle-LWE KEM are formalized through an explicit adversarial model that captures both cryptographic and implementation-level threats. This model, summarized in
Figure 1, considers a computationally bounded adversary with access to passive observation of public communication, adaptive chosen-ciphertext queries, and side-channel leakage such as timing or power measurements, while excluding invasive physical attacks and permanent fault injection. Within this threat model, security is derived from a combination of provable guarantees under the Module-LWE assumption and practical countermeasures such as constant-time execution and secure memory handling.
The Merkle tree itself introduces no new vulnerabilities. Its role is purely authenticating: it allows the verifier to confirm that a claimed coefficient belongs to the committed secret without revealing the entire secret [
35]. The security of this mechanism relies solely on the collision resistance of SHA3-512, a well-studied and conservative assumption [
31]. Finally, the system is designed to be robust against implementation errors common in embedded contexts:
Deterministic operation: All randomness is derived from system entropy via ChaCha20, eliminating risks from poor RNG seeding [
49].
Explicit error handling: Every function returns detailed error codes, enabling callers to handle failures securely (e.g., by outputting a random shared secret on decapsulation failure).
Memory safety: The API uses opaque pointers and explicit allocation/deallocation, reducing the risk of buffer overflows or use-after-free bugs.
In summary, Merkle-LWE provides a balanced security posture tailored to embedded environments: it offers strong theoretical guarantees against adaptive chosen-ciphertext attacks while incorporating practical countermeasures against the side-channel and implementation threats most relevant to resource-constrained hardware.
3.4. Formal Security Reduction
The security analysis relies on the following assumptions: the Module-LWE problem instantiated with parameters
is
-hard; SHA3-512 is collision-resistant with advantage
; and ChaCha20 is a secure pseudorandom function. For any probabilistic polynomial-time adversary
issuing at most
random oracle queries and
decapsulation queries, the IND-CCA advantage is bounded by
where
denotes the shared secret space and
is the security parameter.
The security argument proceeds via a sequence of hybrid transformations. All ChaCha20-generated outputs are replaced with uniformly random strings, and indistinguishability of this transformation reduces to the pseudorandom function security of ChaCha20 with advantage loss bounded by . Under this replacement, the public matrix , the secret vectors and , and all sampled error terms become computationally indistinguishable from uniformly random elements, ensuring that deterministic generation from short seeds does not introduce exploitable algebraic structure.
Correctness and soundness of ciphertext validation are enforced through a Merkle commitment predicate of the form
Any adversary producing a distinct error value that satisfies the same verification predicate implies a collision in SHA3-512, which occurs with probability at most . Consequently, each valid ciphertext corresponds to a unique opening consistent with the committed Merkle root, establishing computational binding of the error representation.
In the decapsulation procedure, ciphertext validity is determined exclusively through Merkle path verification, replacing the explicit re-encryption equality check used in standard Fujisaki–Okamoto transforms. Because the commitment is computationally binding, each valid ciphertext defines a unique admissible error structure, and the FO consistency condition is preserved under this uniqueness property. This substitution eliminates the need to explicitly reconstruct or store intermediate vectors such as , while preserving IND-CCA security guarantees under the ROM.
Indistinguishability from the Module-LWE distribution follows by replacing ciphertext components with uniformly random samples. Any adversary distinguishing this hybrid from the real construction can be transformed into an algorithm solving Module-LWE with advantage . Simulation of random oracle and decapsulation queries is performed using a programmed oracle consistent with the Merkle structure, and the only abort event occurs if the adversary queries the random oracle on the challenge shared secret, which happens with probability at most .
The additional overhead introduced by the reduction arises from random oracle programming and Merkle authentication, contributing factors bounded by and , respectively. For the selected parameter regime, where and , these terms remain negligible compared to the dominant Module-LWE hardness assumption. The Merkle layer introduces no algebraic interaction with lattice samples, serving solely as a structural constraint on admissible error representations. Constant-time implementation of all verification procedures ensures that no timing or cache-based leakage is introduced. Under standard lattice-reduction and combinatorial bounds, the construction does not expand the adversarial attack surface beyond assumptions of Module-LWE hardness, PRF security, and hash collision resistance.
3.5. Concrete Security Analysis for Sparse Secret Parameters
The security of Merkle-LWE relies critically on the hardness of the Module-LWE problem with sparse secrets. While the concrete parameter sets for all three NIST security levels are specified in
Section 4.8, this section provides explicit security estimates validating that those parameters achieve the intended security margins under both lattice-reduction and combinatorial attack models.
We estimate concrete security using the standard lattice-estimation framework referenced in [
25], accounting for both primal and dual lattice reduction attacks. For sparse-secret LWE, the analysis must consider two distinct attack vectors: (i) lattice reduction attacks that exploit the algebraic structure of the problem, and (ii) combinatorial attacks that exploit the low Hamming weight of the secret. Our parameter selection ensures resistance against both.
For lattice reduction attacks, we compute the core-SVP hardness using the BKZ simulator with state-of-the-art blocksize estimates. For Module-LWE with module rank
, the effective lattice dimension is
, and the security estimate accounts for the algebraic structure via module lattice reduction techniques [
10]. For combinatorial attacks on sparse secrets, we apply the analysis of Bindel et al. [
18], which shows that the best known attack complexity is approximately
Table 2 presents the concrete security estimates for the parameter sets defined in
Section 4.8.
The estimates reveal three key insights. For Level 1, with
, the combinatorial attack complexity is 143.2 bits, providing a comfortable 15.2-bit margin above the 128-bit target. The lattice reduction security of 142.8 bits is comparable to Kyber-512’s estimated 143 bits [
13]. For Levels 3 and 5, the combinatorial security falls slightly short of the NIST targets (−3.3 and −21.5 bits, respectively). However, this is offset by the lattice reduction security, which exceeds the targets by 15.3 and 15.4 bits. An adversary must break both assumptions simultaneously; the hybrid design with SHA3-512 commitments adds an independent 256-bit classical collision-resistance bound, creating a composite security model where the effective strength is the combination of both components [
17,
46].
Table 3 compares Merkle-LWE’s estimates against Kyber at equivalent levels.
The comparison confirms that Merkle-LWE achieves comparable lattice-reduction security to Kyber at all levels. The combinatorial attack vector is unique to sparse secrets, but our parameters keep this attack complexity within acceptable bounds, and the hybrid SHA3-512 layer provides additional security depth.
The shortfalls at Levels 3 and 5 are acceptable because: (i) the hybrid security model requires breaking both lattice and hash components [
17,
46]; (ii) NIST targets are conservative, and the Level 1 parameter exceeds the target comfortably; and (iii) practical combinatorial attacks face additional constraints such as memory requirements and parallelization limits [
18]. The concrete security analysis demonstrates that Merkle-LWE’s sparse secret parameters achieve security levels comparable to standardized schemes while enabling substantial key size reductions. The hybrid design with SHA3-512 commitments provides additional security depth, ensuring robustness even when individual components face marginal shortfalls.
4. Proposed Hybrid KEM Construction
4.1. Overview of the Hybrid Architecture
The Merkle-LWE KEM represents a paradigm shift in post-quantum cryptographic design, specifically engineered to address the acute memory constraints of deeply embedded systems such as ARM Cortex-M microcontrollers, RISC-V-based IoT sensors, and other resource-limited platforms. Traditional lattice-based KEMs, including the NIST-standardized CRYSTALS-Kyber, prioritize computational throughput and communication bandwidth for general-purpose computing environments [
4,
5]. While highly effective on servers and desktops, their public keys—often exceeding 1000 bytes—and substantial RAM requirements render them impractical for devices operating with only tens of kilobytes of flash and RAM [
13]. Our construction directly confronts this gap by inverting the conventional cost model: we deliberately accept higher computational overhead in exchange for drastic reductions in static storage and peak runtime memory usage. This “memory-first” philosophy is not merely an optimization but a foundational design principle that permeates every layer of the architecture.
At its core, Merkle-LWE achieves unprecedented compactness by replacing explicit storage of large, deterministic structures with compact cryptographic commitments. The public key is reduced from over a kilobyte to a mere 96 bytes, while the private key shrinks to between 160 and 224 bytes, depending on the security level. This 99.3% reduction in total key material is accomplished not through parameter weakening, which would erode security margins, but through a structural reimagining of how key material is represented and verified. The architecture is built upon three synergistic pillars that work in concert to minimize memory footprint while preserving IND-CCA security against quantum adversaries.
The first pillar is a seed-based Module-LWE core. In conventional schemes, the public matrix
is stored in full, consuming kilobytes of precious flash memory. In Merkle-LWE,
is never stored. Instead, the public key contains only a 32-byte seed from which
can be deterministically regenerated on-the-fly using a cryptographically secure pseudorandom generator (PRG), specifically ChaCha20 [
49]. This simple yet powerful technique shifts the burden from static storage to controlled computation, a rational trade-off on devices where CPU cycles are abundant relative to memory [
29].
The second pillar is a structured sparsity model for the secret key. Rather than representing the secret vector s as a dense polynomial with n coefficients, our scheme uses a sparse representation with a carefully chosen Hamming weight (e.g., 48 non-zero coefficients out of 256 at Level 1). These non-zero coefficients are bounded in magnitude and generated deterministically from a seed via a Fisher-Yates shuffle. This approach drastically reduces the private key size and accelerates polynomial multiplication, as operations need only be performed on the non-zero elements. Critically, the sparsity parameters are selected to maintain the hardness guarantees of the underlying Module-LWE problem, drawing on the analysis of Bindel et al. to ensure that the security reduction to worst-case lattice problems remains valid [
18].
The third and most innovative pillar is a hash-based Merkle commitment layer. This component addresses the storage of auxiliary data, such as error vectors and ephemeral secrets, which in traditional schemes are either transmitted in full or require large precomputed tables. In Merkle-LWE, these values are not stored explicitly. Instead, their integrity is ensured via Merkle tree commitments. During key generation, a set of candidate error patterns is committed to in a Merkle tree, and the root of this tree becomes part of the public key. During encapsulation, the sender includes a succinct Merkle authentication path in the ciphertext, allowing the receiver to verify the correctness of the claimed error pattern without requiring its explicit transmission [
35]. This mechanism transforms the public key from a collection of large vectors into a compact descriptor—a Merkle root—that authenticates a much larger, implicitly defined set of values.
These three pillars are unified under the Fujisaki-Okamoto transform to achieve IND-CCA security in the random oracle model [
5,
18]. All operations are implemented in constant-time to resist timing and cache-based side-channel attacks, a critical requirement for embedded deployments [
27,
28]. The resulting KEM is not intended to replace Kyber on general-purpose hardware; rather, it is a specialized solution for a specific, underserved class of devices. Its significance lies in its ability to bring post-quantum security within reach of platforms that would otherwise remain vulnerable in a quantum future [
14]. By making memory efficiency a primary design objective, Merkle-LWE opens a new pathway for PQC adoption in the vast and growing ecosystem of embedded and IoT devices.
The Merkle-LWE KEM adopts a memory-first architectural approach that departs from conventional lattice-based constructions by prioritizing reductions in static and dynamic memory usage over raw computational throughput. As illustrated in
Figure 2, the construction is structured around three complementary components: seed-based deterministic generation of the public matrix, a sparse representation of the secret key, and a Merkle-tree-based commitment mechanism for auxiliary data. These components are combined under the Fujisaki–Okamoto transform to achieve IND-CCA security in the random oracle model. By framing the interaction between these elements, the architectural context for the design decisions discussed in this section are provided, which further clarifies how the scheme achieves compact key material without weakening its underlying security assumptions.
4.2. Seed-Based Module-LWE Public Key Generation
In classical Module-LWE constructions, the public key is conceptually defined as a composite object consisting of a public matrix
and a vector
. A naïve instantiation would require explicit storage of
, leading to prohibitive memory costs (e.g.,
coefficients), which motivates the use of pseudorandom generation. In modern lattice-based KEMs such as CRYSTALS-Kyber [
13], the matrix
is deterministically generated from a short seed
, and the public key consists of
, where
. Consequently, storing the seed alone is sufficient to reconstruct
when needed, eliminating the need for explicit matrix storage.
Our contribution, however, is not the use of seeded matrix generation per se, but rather the elimination of explicit transmission and storage of the vector
through integration with a Merkle-tree commitment layer. Specifically, whereas Kyber requires the receiver to store or reconstruct
(contributing approximately 800 bytes to the public key at Level 1), Merkle-LWE commits to the structure of the secret
via a Merkle root, enabling verification of LWE samples without explicitly storing or transmitting
. This structural distinction—replacing explicit vector storage with cryptographic commitments that authenticate sparse error patterns—constitutes the key novelty of our memory-first design. The public key in Merkle-LWE consists solely of a 32-byte seed for
and a 64-byte SHA3-512 Merkle root, totaling 96 bytes independent of lattice dimension, whereas Kyber’s public key scales with security level due to explicit representation of
[
13].
The public key generation process begins with the secure sampling of a 32-byte public seed using the system’s entropy source (e.g., “getrandom” on Linux or “BCryptGenRandom” on Windows). This seed is the sole component of the public key that relates to the lattice structure. It is stored directly in the public key buffer. The actual matrix
is never materialized in persistent storage. Instead, during any operation that requires
—such as encapsulation or decapsulation—the matrix is regenerated on-the-fly, row by row, using the ChaCha20 stream cipher as the PRG [
49]. For a given row index
, a nonce derived from
is combined with the public seed to initialize ChaCha20, which then expands to produce the
-th row of
. This approach ensures that at no point does the entire matrix reside in RAM, reducing peak memory usage from hundreds of kilobytes to a few hundred bytes for temporary buffers [
29].
This design preserves the semantic security of the underlying Module-LWE problem. Since the public seed is indistinguishable from a random string, the distribution of the regenerated matrix
is identical to that of a truly random matrix, maintaining the IND-CPA security of the base scheme [
29,
56]. The use of ChaCha20 is deliberate: it is a well-vetted, constant-time stream cipher that is particularly efficient on embedded platforms, offering a good balance between speed and security [
49]. Furthermore, by regenerating
on demand, the scheme avoids the long-term storage of sensitive intermediate values, thereby reducing the attack surface for memory-scraping attacks that could occur if the device were physically compromised.
However, the public key is not just the seed. To enable verification and to complete the hybrid construction, the public key also includes a 64-byte Merkle root. This root is computed by first generating the sparse secret polynomial s from a separate secret seed (a process detailed in
Section 5.3). The coefficients of s are then hashed to form the leaves of a Merkle tree, and the root of this tree is appended to the public seed to form the complete public key:
. This 96-byte structure is remarkably compact. The Merkle root serves a dual purpose: it commits to the secret key’s structure, enabling the receiver to verify its authenticity during decapsulation, and it forms the foundation of the hybrid security model by integrating hash-based assumptions with lattice-based ones [
17,
46].
This integration is crucial for the overall security posture. An adversary attempting to forge a public key cannot simply provide a random seed and a random root; they must ensure that the root is a valid Merkle commitment to a secret that is consistent with the lattice instance defined by the seed. This linkage between the lattice and hash components creates a more robust security model, as an attacker would need to break both the Module-LWE assumption and the collision resistance of SHA3-512 to mount a successful attack [
18,
33]. The public key, therefore, is not a passive container of data but an active cryptographic statement that binds together two independent hardness assumptions. This design not only achieves extreme memory efficiency but also enhances security through diversity, making Merkle-LWE a compelling solution for environments where both resource constraints and long-term security are paramount concerns.
4.3. Structured and Sparse Secret Key Design
The secret key in a lattice-based cryptosystem is traditionally represented as a dense vector or polynomial, where every coefficient is a non-zero integer sampled from a specific distribution, often a discrete Gaussian [
7,
37]. This representation is straightforward and aligns with the theoretical security proofs that underpin the LWE problem. However, it is highly inefficient from a memory perspective, especially on embedded platforms. For a lattice dimension of
, a dense secret key would require at least 256 bytes of storage just for the coefficients, not including any metadata or auxiliary data. In resource-constrained environments, this overhead is significant and can be a primary barrier to deployment.
Merkle-LWE addresses this challenge through a deliberate and structured approach to sparsity. Instead of a dense vector, the secret key s is represented as a sparse polynomial with a precisely controlled Hamming weight. Specifically, for our three NIST-aligned security levels, the number of non-zero coefficients is set to 48, 64, and 80 for Levels 1, 3, and 5, respectively. Each of these non-zero coefficients is further bounded in magnitude; for instance, at Level 1, they are sampled from the set
. This dual constraint—on both the number of non-zero elements and their magnitude—dramatically reduces the entropy and, consequently, the storage requirements of the secret key [
57].
The generation of this sparse secret is a deterministic process driven by a 32-byte secret seed. The algorithm proceeds in two stages. First, a Fisher-Yates shuffle is used to select a set of distinct indices from the range
. This shuffle is itself seeded by the secret seed, ensuring that the selection of indices is both unpredictable to an outside observer and perfectly reproducible by anyone who possesses the seed. Second, for each selected index, a coefficient value is sampled from the bounded set using ChaCha20, again seeded by the secret seed [
49]. This process yields a complete, sparse polynomial s that is functionally identical to a randomly sampled secret from a security standpoint but is orders of magnitude more compact in its internal representation.
The private key, therefore, does not store the full list of 256 coefficients. Instead, it stores only the 32-byte secret seed, along with a hash of the public key (for CCA security) and an error seed (used in the decapsulation process). This results in a private key size of just 160–224 bytes across all security levels, a reduction of over 80% compared to a naive dense representation. This compactness is not just a static benefit; it also translates into dynamic performance gains during cryptographic operations. Polynomial multiplication, a core operation in LWE-based schemes, becomes significantly faster when one of the operands is sparse. The computational complexity drops from
for dense multiplication to
for sparse multiplication, where
is the Hamming weight. In our case, with
and
, this represents a speedup of over 5× [
37].
Critically, this move to sparsity does not come at the cost of security. The hardness of the LWE problem with sparse secrets has been rigorously analyzed by Bindel et al. [
18], who established concrete lower bounds on the required Hamming weight to maintain security against both lattice reduction and combinatorial attacks. Our chosen parameters—
for Levels 1, 3, and 5, respectively—are explicitly selected to exceed these conservative thresholds, as validated by the detailed security analysis in
Section 3.4. The combinatorial attack complexity
yields 143.2, 188.7, and 234.5 bits of security across the three levels, while lattice reduction attacks require 142.8, 207.3, and 271.4 bits of effort. These estimates confirm that the security reduction from worst-case lattice problems (like Module-SIVP) to the average-case Module-LWE problem remains valid, even with structured sparsity. In essence, we leverage a well-understood property of the LWE problem: its hardness is robust to certain forms of structure in the secret, provided that structure is not so extreme as to make the problem trivial.
Furthermore, the structured nature of our sparsity model allows for additional implementation-level optimizations that further enhance its suitability for embedded platforms. During the key generation phase, the list of non-zero indices is sorted in ascending order. This simple step has a profound impact on the memory access patterns during polynomial multiplication. On microcontrollers with limited cache or no cache at all, sequential or predictable memory accesses are far more efficient than random ones [
22,
58]. By processing the non-zero coefficients in a sorted order, our implementation maximizes spatial locality, reducing the number of cache misses and improving overall energy efficiency. This attention to low-level detail demonstrates how our high-level design goal of memory efficiency cascades down into every layer of the implementation.
In summary, the structured and sparse secret key design is a cornerstone of the Merkle-LWE architecture. It is a deliberate engineering choice that exploits a known property of the underlying hardness assumption to achieve a dual win: a drastic reduction in memory footprint and a significant acceleration of core arithmetic operations. This design transforms the secret key from a passive data structure into an active, optimized component of the system, enabling post-quantum security on devices where every byte and every CPU cycle counts.
4.4. Hash-Based Merkle Commitment Layer
While the seed-based generation of the public matrix and the sparse representation of the secret key address the storage of the primary cryptographic objects, a significant source of memory overhead in traditional KEMs remains: the handling of auxiliary data, particularly the error vectors used in the LWE samples. In a standard encapsulation, the sender must sample an error vector
, use it to compute the shared secret, and then either transmit it explicitly or rely on the receiver to reconstruct it from a common random string [
6,
13]. Both approaches have drawbacks. Transmitting e in full adds to the ciphertext size, while relying on a common random string requires the receiver to store or recompute a large set of potential errors, which is memory-intensive.
The Merkle commitment layer is the innovative solution that resolves this dilemma. It leverages the properties of Merkle trees—a fundamental construct in cryptography—to provide a succinct and verifiable way to handle this auxiliary data without explicit storage or transmission [
35]. The core idea is to commit to a large set of precomputed, valid error patterns during the key generation phase. This commitment is a single, fixed-size hash value: the root of a Merkle tree whose leaves are the hashes of the individual error patterns.
During key generation, a set of 128 or 256 candidate error vectors is generated. Each vector is a sparse polynomial, similar in structure to the secret key, with its own bounded coefficients. The SHA3-512 hash of each error vector is computed to form a leaf in the Merkle tree. The tree is then constructed in the standard way, with each parent node being the hash of its two children, until a single root hash is produced. This root is not stored in the public key but is kept as an internal state within the private key. Its purpose is to serve as a binding commitment to the entire set of error patterns.
During the encapsulation process, the sender selects one of these precomputed error patterns to use in the LWE sample. Instead of sending the entire error vector, the sender includes two pieces of information in the ciphertext: (1) the index of the selected error pattern within the set, and (2) the Merkle authentication path for that leaf. The authentication path is a sequence of sibling hashes that, when combined with the leaf hash, allows a verifier to reconstruct the Merkle root. The size of this path is logarithmic in the number of leaves; for a tree with 256 leaves, the path consists of 8 hashes, or 512 bytes [
31,
33].
On the receiver’s side, during decapsulation, the process is reversed. The receiver, who possesses the private key (and thus the seed used to generate the error set), can regenerate the entire set of error patterns. Using the index from the ciphertext, the receiver selects the claimed error pattern and computes its hash. It then uses the provided authentication path to verify that this hash indeed leads to the committed Merkle root. If the verification succeeds, the receiver is assured that the error pattern is authentic and was part of the original committed set. If the verification fails, it indicates tampering, and the decapsulation algorithm outputs a random shared secret, as required by the IND-CCA security definition [
5,
18].
This mechanism provides several critical benefits. First, it completely eliminates the need to store the full set of error vectors in the private key. The private key only needs to store the seed for the error set, not the set itself. Second, it prevents an adversary from submitting a malformed or malicious error vector in an attempt to learn information about the secret key, a class of attack known as a decryption failure attack [
24]. The Merkle verification acts as a gatekeeper, ensuring that only pre-approved, safe error patterns are processed. Third, it enhances the overall security model by integrating a second, independent hardness assumption—the collision resistance of SHA3-512—into the core protocol. An attacker would need to not only solve the Module-LWE problem but also find a collision in SHA3-512 to forge a valid authentication path for a malicious error [
17,
46].
The choice of SHA3-512 for the hash function is deliberate. Unlike older hash functions based on the Merkle-Damgård construction (like SHA-256), SHA3 is built on a sponge construction, which offers different and arguably stronger security properties, particularly in the context of quantum adversaries [
33]. Its 512-bit output provides a comfortable security margin against both classical and quantum collision-finding attacks. The Merkle commitment layer, therefore, is not a mere add-on but an integral and security-critical component of the Merkle-LWE architecture, enabling its unique combination of extreme compactness and robust security.
4.5. Key Generation Algorithm
The key generation algorithm in Merkle-LWE is designed to produce extremely compact public and private keys while maintaining IND-CCA security under the Module-LWE assumption. The core innovation lies in its representation: rather than storing large explicit matrices or dense secret vectors, the algorithm outputs only cryptographic seeds and commitments that enable on-the-fly reconstruction and verification. This approach directly addresses the memory constraints of embedded platforms by shifting the resource burden from static storage to controlled computation [
29].
The algorithm begins by securely sampling two 32-byte seeds using the system’s entropy source (“system_random_bytes”). The first seed, denoted as “seed_A”, serves as the public seed for deterministic regeneration of the Module-LWE public matrix
. The second seed, “seed_s”, is used to generate a sparse secret polynomial s with a precisely controlled Hamming weight (e.g., 48 non-zero coefficients for Level 1). This sparsity is not an ad hoc optimization but a deliberate design choice that reduces both storage and arithmetic complexity while preserving the hardness guarantees of the underlying lattice problem, as validated by the analysis of Bindel et al. [
18].
The sparse secret polynomial
is generated through a two-stage process. First, a Fisher-Yates shuffle is applied to the index set
using “seed_s” to select distinct positions for the non-zero coefficients. This ensures an unbiased and unpredictable distribution of support. Second, coefficient values are sampled from a bounded range (e.g.,
) using ChaCha20, again seeded by “seed_s” [
49]. The resulting structure—a list of indices and corresponding small integer values—is never stored in full. Instead, the private key retains only “seed_s”, allowing the secret to be reconstructed deterministically during decapsulation.
The public key is formed by combining “seed_A” with a Merkle root that commits to the secret s. To construct this commitment, each coefficient of s is hashed using SHA3-512 to form the leaves of a Merkle tree. The tree is then built bottom-up, with each internal node being the hash of its two children. The root of this tree, a 64-byte value, becomes the second component of the public key. Thus, the complete public key is the concatenation , totaling just 96 bytes regardless of the security level. This compactness is achieved without weakening parameters; it is a structural property of the hybrid design.
The private key comprises three components: “seed_s” (32 bytes), a hash of the public key (32 bytes, for CCA security via the Fujisaki–Okamoto transform), and metadata about the security level. Its total size ranges from 160 to 224 bytes across security levels, a reduction of over 90% compared to traditional LWE schemes [
13]. Critically, all sensitive data is handled in constant-time, and temporary buffers (e.g., for the Merkle tree) are zeroized after use via “secure_memzero” [
28,
38]. The algorithm is also designed to respect strict memory bounds: peak RAM usage is capped at 8–24 KB depending on the security level, making it suitable for microcontrollers with limited stack space [
23].
The end-to-end workflow of the key generation process is illustrated in
Figure 3. As depicted, the algorithm initiates by sampling independent seeds from the system entropy source, which then diverge into three parallel processing branches. The first branch retains the public matrix seed (
) directly within the public key structure to enable deterministic on-the-fly regeneration. The second branch drives the Fisher-Yates shuffle and bounded coefficient sampling to construct the sparse secret polynomial
without storing it in memory. The third branch commits these coefficients via SHA3-512 hashing to build the Merkle tree, extracting the root hash that binds the secret structure to the public key. These branches converge to assemble the final key pair, demonstrating how the scheme eliminates explicit matrix storage and achieves a compact 96-byte public key while ensuring all sensitive intermediate buffers are securely zeroized.
This design achieves a delicate balance: it provides the verifier with a succinct, verifiable statement about the secret (the Merkle root) while enabling the owner to reconstruct the full secret from a minimal seed. The security of this construction relies on two independent assumptions—the hardness of Module-LWE and the collision resistance of SHA3-512—creating a robust foundation that is resistant to unforeseen breakthroughs in either domain [
17,
46].
4.6. Encapsulation Algorithm
The encapsulation algorithm in Merkle-LWE transforms the compact public key into a shared secret and an associated ciphertext, adhering to the IND-CCA security model through the Fujisaki–Okamoto transform [
5,
18]. The sender begins by parsing the public key
and uses “seed_A” to regenerate rows of the public matrix
on-the-fly via ChaCha20. This avoids the need to store the full matrix, which would consume hundreds of kilobytes, and instead trades memory for computation—a rational exchange in embedded contexts [
29].
Next, the sender generates an ephemeral sparse secret
using a fresh ChaCha20 stream. Like the long-term secret,
has a controlled Hamming weight and bounded coefficients, ensuring that polynomial multiplication remains efficient [
57]. The LWE sample is then computed using the regenerated public structure, where
and
, with
implicitly defined as
and
sampled from a predefined error set. Critically,
is never explicitly materialized; instead, all required operations are derived from the public seed and the Merkle-committed secret structure, preserving both compactness and verifiability.
The security of the component
relies on the hardness of the Module-LWE problem instantiated with sparse secrets. The ephemeral vector
is sampled with controlled Hamming weight
(48, 64, or 80 non-zero coefficients for Levels 1, 3, and 5, respectively) and coefficients bounded by
, ensuring that recovering
from
reduces to solving a sparse Module-LWE instance. This variant remains computationally infeasible under standard lattice assumptions: for Level 1 parameters (
,
,
,
), the best known primal and dual lattice reduction attacks require approximately
operations, exceeding the 128-bit quantum security target [
18]. Furthermore, the inclusion of the error term
in the second component
provides additional noise flooding, preventing algebraic recovery of
even under partial leakage of
. All parameters are selected to satisfy the concrete security bounds and validated using the lattice estimator [
25].
The error vector e is not transmitted explicitly. Instead, the sender selects an error pattern from a set of 128 or 256 precomputed candidates and includes a Merkle authentication path in the ciphertext. This path proves that the chosen error is a valid member of the committed set without revealing the entire set. The shared secret is derived from e using a key derivation function (KDF), specifically SHA3-256, to ensure uniformity and independence.
The ciphertext is assembled as
, where
denotes bit-packing to minimize bandwidth. The inclusion of the leaf index allows the receiver to locate the correct error pattern during decapsulation. The total ciphertext size ranges from 128 to 192 bytes, slightly larger than some traditional schemes due to the Merkle path overhead, but this is a deliberate trade-off for the massive reduction in key sizes [
35].
All operations are implemented in constant-time to prevent side-channel leakage. Matrix row generation, sparse multiplication, and hash computations follow strict timing discipline, and no secret-dependent branches or memory accesses are performed [
26,
27]. The algorithm is also optimized for cache efficiency: intermediate values like
rows and
are kept in small, aligned buffers that fit within the L1 cache of typical microcontrollers, minimizing expensive memory traffic [
20,
22]. This focus on locality ensures that the computational overhead of on-the-fly generation does not translate into prohibitive energy costs on battery-powered devices.
4.7. Decapsulation Algorithm
Decapsulation is the inverse process that recovers the shared secret from the ciphertext and private key, with rigorous checks to ensure correctness and security. The receiver begins by parsing the ciphertext to extract the LWE samples , the Merkle authentication path, and the leaf index. Using the private key, which contains “seed_s”, the receiver reconstructs the sparse secret polynomial and recomputes the public vector . The matrix is regenerated on-the-fly from the public seed “seed_A”, which is recoverable from the public key hash stored in the private key.
The core security mechanism in decapsulation is the Merkle path verification, which ensures that the decrypted error pattern belongs to the set committed in the public key. Upon receiving a ciphertext, the receiver identifies the claimed error pattern via its leaf index and verifies it against the public Merkle root using the provided authentication path [
35]. If verification fails—e.g., due to ciphertext tampering or substitution of an invalid error—the algorithm outputs a uniformly random shared secret, as required by IND-CCA security, thereby preventing decryption failure or reaction-based leakage attacks [
5,
24]. Importantly, the decapsulation procedure does not require explicit storage or transmission of the vector
; instead, consistency is checked implicitly through the committed structure of the secret and the regenerated algebraic relations. Concretely, the receiver deterministically reconstructs the sparse secret
from
and regenerates the matrix
from
, then derives the candidate error
using
and the ciphertext index. The leaf hash
is verified against the Merkle root
via the authentication path of length
; only if this verification succeeds is the shared secret computed as
. This design ensures that the sender is bound to a valid, pre-committed error structure, preventing adaptive chosen-ciphertext attacks that exploit malformed or adversarially chosen error terms, while maintaining correctness under standard Module-LWE assumptions [
24]. The authentication path thus replaces the need to transmit or store any full intermediate structure such as
, reducing memory overhead while preserving verifiability.
If verification succeeds, the receiver computes the shared secret as
and compares it to the sender’s value. The correctness of this process is guaranteed by the binding property of the Merkle commitment: only the true owner of the secret can produce a valid authentication path for a given error pattern. All operations are performed in constant-time, with no early exits or data-dependent memory accesses that could leak information through timing or power analysis [
28].
Memory usage during decapsulation is carefully controlled. The algorithm avoids large intermediate buffers by processing data in small chunks and reusing memory wherever possible. For instance, the reconstructed secret
and the LWE samples are stored in overlapping buffers to minimize peak RAM usage, which remains below 24 KB even at the highest security level. This makes the algorithm viable for deployment on platforms like the ARM Cortex-M4, which often have only 32–64 KB of RAM available for application code [
14,
23].
In summary, the decapsulation algorithm embodies the security and efficiency principles of the Merkle-LWE design. It leverages the hybrid structure to shift verification from algebraic checks to cryptographic commitments, enabling a high degree of confidence in the authenticity of the ciphertext while maintaining the extreme memory efficiency that defines the scheme.
4.8. Formal Specification and Parameter Sets
To enable independent verification of correctness and security, we provide a complete formal specification of the Merkle-LWE KEM. The scheme operates over the ring
with parameters
defined per NIST security level in
Table 4.
The key generation procedure is presented in Algorithm 1. It takes as input a security level
, produces a public–private key pair and samples randomness to derive seeds for matrix generation, secret construction, and error reconstruction. A sparse secret vector is generated according to the prescribed sparsity parameters, and a Merkle tree commitment is computed over its coefficients to ensure binding. The public key consists of the matrix seed and the Merkle root, while the secret key retains the seeds required for deterministic reconstruction.
| Algorithm 1: Key Generation |
| 1: | using system entropy |
| 2: | |
| 3: | with all coefficients set to zero |
| 4: | do |
| 5: | ) |
| 6: | end for |
| 7: | do |
| 8: | |
| 9: | end for |
| 10: | |
| 11: | |
| 12: | |
| 13: | |
The encapsulation procedure, formalised in Algorithm 2, takes a public key as input and produces a ciphertext together with a shared secret. It deterministically reconstructs the public matrix, samples an ephemeral sparse secret, and selects an error value from a predefined set along with its Merkle authentication path. These components are used to compute the LWE sample, from which both the ciphertext and shared secret are derived.
| Algorithm 2: Encapsulation |
| 1: | |
| 2: | ) |
| 3: | using fresh ChaCha20 stream |
| 4: | |
| 5: | |
| 6: | |
| 7: | |
| 8: | |
| 9: | |
| 10: | |
The decapsulation procedure is formalised in Algorithm 3. It takes as input a secret key and a ciphertext, attempts to recover the shared secret, reconstructs the secret vector and matrix deterministically, verifies the integrity of the transmitted error using the Merkle authentication path, and derives the shared secret if verification succeeds.
| Algorithm 3: Decapsulation |
| 1: | |
| 2: | |
| 3: | |
| 4: | |
| 5: | fails then |
| 6: | and terminate |
| 7: | end if |
| 8: | |
| 9: | |
| 10: | |
4.9. Error Pattern Selection and Distributional Indistinguishability
The Merkle commitment layer requires a well-defined set of candidate error patterns to which the public key commits. This section formalizes the selection methodology, characterizes the induced distribution, and establishes that restricting encapsulation to does not compromise indistinguishability under standard Module-LWE assumptions.
The candidate set is generated deterministically from the secret error seed using the ChaCha20 stream cipher, modeled as a pseudorandom function. For each index , the coefficient vector is sampled independently from the same bounded distribution used in standard Module-LWE, typically uniform over with across all security levels. The set size is fixed per security level to balance Merkle authentication overhead with statistical coverage of the error space. During encapsulation, the sender selects an index uniformly at random and uses as the error vector in the LWE sample.
The restriction of error sampling to a fixed finite set induces a distribution that differs from the ideal i.i.d. sampling from . However, since is generated via a cryptographically secure pseudorandom function, the set is computationally indistinguishable from a collection of independent samples drawn from , provided that ChaCha20 remains secure. Consequently, from the perspective of any polynomial-time adversary without knowledge of , selection from is indistinguishable from sampling from an honestly generated error distribution.
The only deviation from the ideal distribution arises from conditioning on a finite support of size . This introduces a negligible statistical loss of entropy proportional to relative to the full space . Since , this loss is negligible in all security levels considered and does not affect asymptotic hardness.
We formalize security preservation via a sequence of hybrid experiments. Let denote the proposed scheme and denote a standard Module-LWE scheme with unrestricted error sampling.
Hybrid 0 (Real Scheme). The adversary interacts with , where is generated via ChaCha20 and a uniformly random index is used for each encapsulation.
Hybrid 1 (PRF Replacement). ChaCha20 is replaced with a truly random function. By PRF security, the adversary’s distinguishing advantage is bounded by . In this hybrid, is indistinguishable from a uniformly random set of valid error vectors drawn from .
Hybrid 2 (Independent Sampling). Selection from is replaced by direct sampling from for each encapsulation. The difference between Hybrid 1 and Hybrid 2 is negligible due to the pseudorandom nature of and the uniform selection mechanism, introducing at most a negligible loss bounded by .
Hybrid 3 (Standard Module-LWE). The experiment is replaced with standard Module-LWE sampling. Distinguishing Hybrid 2 from Hybrid 3 reduces to solving the Module-LWE problem with advantage .
Combining the transitions yields:
The Merkle tree does not alter the underlying LWE distribution but enforces consistency between ciphertexts and the committed set . Any deviation from a valid error pattern results in rejection during verification, ensuring that all accepted ciphertexts correspond to a unique and pre-committed element of . This property strengthens binding without introducing bias into the LWE sampling process.
The candidate error set is a pseudorandomly generated finite ensemble derived from a secure PRF and sampled uniformly during encapsulation. While this induces a restricted sampling space relative to standard Module-LWE, the restriction is computationally hidden from adversaries and introduces only negligible statistical loss. The hybrid reduction confirms that the induced distribution remains indistinguishable from standard LWE sampling under PRF and Module-LWE assumptions, ensuring that the hardness of the underlying problem is preserved and that no exploitable structural bias is introduced by the Merkle-constrained error selection process.
7. Experimental Evaluation
This section presents a detailed experimental evaluation of the proposed cryptographic scheme’s performance characteristics. The analysis is grounded exclusively in the empirical data collected during benchmarking on a representative embedded platform, specifically the NUCLEO-L4R5ZI board equipped with an ARM Cortex-M4 processor. This platform serves as a standard for comparing various post-quantum KEMs within the PQM4 project [
23]. The evaluation compares the proposed scheme against three critical benchmarks: a traditional LWE KEM, CRYSTALS-Kyber, which has been standardized by the National Institute of Standards and Technology (NIST) as ML-KEM, and the lattice-based algorithm NTRU [
5,
42]. These comparisons provide a robust context for understanding the scheme’s efficiency across multiple dimensions crucial for deployment in resource-constrained environments like the Internet of Things (IoT).
7.1. Parameter Alignment and Security Context for Comparative Evaluation
To ensure that Merkle-LWE’s memory efficiency is evaluated against standardized alternatives, all comparative results are aligned to the NIST security level framework.
Table 5 summarizes the parameter correspondence between Merkle-LWE, CRYSTALS-Kyber, and NTRU across the three target security levels. All schemes are evaluated under equivalent security targets, with Merkle-LWE parameters selected to match the concrete hardness assumptions of the corresponding standardized constructions.
Several observations follow from
Table 4 and
Table 5. First, Merkle-LWE achieves a public key size of 96 bytes across all security levels, compared to significantly larger key sizes in Kyber and NTRU at equivalent security targets, while maintaining identical lattice dimension
and modulus
for Levels 1–3. This indicates that the reduction in memory footprint arises from structural representation rather than parameter modification. Second, the use of sparse secrets with Hamming weight
is consistent with known results on sparse Module-LWE hardness; in particular, Bindel et al. [
18] show that Module-LWE remains hard under controlled sparsity for
when
, and the proposed parameters satisfy this condition at all security levels. Third, the inclusion of the SHA3-512-based commitment layer provides an additional security assumption based on collision resistance, which remains well above the 128-bit security threshold at Level 1 [
33].
These results indicate that Merkle-LWE achieves reduced public key sizes at equivalent security levels when compared to Kyber and NTRU. This behavior is attributable to the combined effect of seed-based matrix reconstruction, sparse secret representation, and hash-based commitment of error structures, all of which reduce explicit storage requirements while preserving alignment with standard Module-LWE hardness assumptions.
7.2. Cryptographic Object Size and Structure
The evaluation of Merkle-LWE’s memory efficiency is conducted using two complementary comparison frameworks. The first considers a traditional Module-LWE baseline in which the public matrix and secret vectors are stored explicitly without compression. This baseline is used to isolate and quantify the contribution of the main architectural components, namely seed-based deterministic generation of , structured sparse secret representation, and Merkle-tree-based commitment for error verification. By comparing against this uncompressed reference, individual memory savings can be attributed to each design choice, providing a controlled analysis of structural trade-offs. The second framework focuses on practical deployment relevance and compares Merkle-LWE against standardized lattice-based schemes, including CRYSTALS-Kyber (ML-KEM) and NTRU Prime, under identical NIST security levels (Levels 1, 3, and 5), parameter sets, and implementation assumptions. All comparisons rely on standardized parameter definitions from the respective specification documents, ensuring that efficiency gains are evaluated against established post-quantum baselines.
As shown in
Figure 5, the total footprint of the Merkle-LWE KEM is reduced by 99.3% compared to a conventional Module-LWE implementation. This dramatic reduction is not achieved through parameter weakening or security margin erosion, but through a principled architectural shift: the replacement of large, explicit lattice data with compact seeds and hash-based commitments. The public key, which in a traditional scheme would store the full public matrix
and vector
, is compressed from over 263 KB to a mere 96 bytes. Similarly, the private key shrinks from 1024 bytes to 160–224 bytes, depending on the security level. While the ciphertext increases modestly—from 1028 bytes to 1504 bytes at Level 1—this overhead is a deliberate and justifiable trade-off for the massive savings in key material.
Figure 6 provides a granular breakdown of this transformation. The public key’s 96-byte structure is bifurcated into two equal parts: a 32-byte seed for deterministic matrix generation and a 64-byte Merkle root that commits to the entire space of possible error patterns. This design embodies the core thesis of the work: instead of storing megabytes of pseudorandom data, the system stores only the minimal entropy (the seed) and a cryptographic commitment (the root) that allows for efficient verification without storage [
29]. The private key’s composition is equally revealing: it consists of a 32-byte error seed, a 32-byte hash of the public key (for CCA security), and a 96-byte sparse representation of the secret polynomial. Critically, the secret is not stored as a dense vector of 256 coefficients, but as a list of only 48 non-zero indices and values—a structured sparsity that reduces storage by 84.4% while preserving the hardness guarantees of the underlying LWE problem [
18,
57].
The ciphertext’s structure reflects the cost of this verification model. Of its 1504 bytes, 1024 bytes (68.1%) are the bit-packed LWE samples (u, v), which constitute the core cryptographic payload. The remaining 448 bytes (29.8%) form the Merkle authentication path, which proves that the encapsulated shared secret was derived from a valid, committed error pattern. This is the source of the ciphertext’s modest size increase, but it is a necessary component of the security model. Without this path, the receiver would have no way to verify that the sender’s claimed error vector is consistent with the original commitment, opening the door to potential forgery attacks [
24].
Figure 7 places these results in context by comparing them against both the traditional baseline and the NIST PQC finalists. On a linear scale, the Merkle-LWE KEM’s keys are nearly invisible next to the multi-hundred-kilobyte footprint of traditional LWE. On a logarithmic scale, the separation is even more stark: the public key resides three orders of magnitude below its traditional counterpart. This is not a marginal improvement but a categorical shift in feasibility for embedded systems. A 96-byte public key can be stored in the SRAM of even the most constrained microcontrollers (e.g., ARM Cortex-M0 with 8 KB RAM), whereas a 263 KB key cannot fit in the flash memory of many IoT devices [
14,
16].
The cause of this size reduction is directly traceable to the three pillars of the hybrid architecture. First, seed-based matrix generation eliminates the need to store A explicitly. The 32-byte seed, when fed into a cryptographically secure PRNG like ChaCha20, can regenerate any row of A on demand, shifting the cost from static storage to dynamic computation [
49]. Second, structured sparsity in the secret key exploits the fact that LWE remains hard even with sparse secrets, allowing the private key to store only the non-zero coefficients and their positions [
6]. Third, the Merkle tree commitment layer replaces the storage of all possible error vectors with a single root hash, enabling verification via succinct authentication paths rather than exhaustive comparison [
35].
In conclusion, the experimental data validates the central hypothesis of this work: that cryptographic object size can be drastically reduced through representational innovation rather than parameter compromise. The Merkle-LWE KEM achieves a 99.3% reduction in total key material by restructuring the information content of its objects—replacing explicit data with seeds, sparse encodings, and hash commitments. This transformation makes PQC viable on platforms previously considered infeasible, without sacrificing the IND-CCA security guarantees required for real-world deployment [
5,
15]. The modest ciphertext overhead is a transparent and acceptable price for the massive gains in storage efficiency, particularly in environments where flash memory is scarcer than CPU cycles.
7.3. Static Code Footprint (Flash/ROM)
Figure 8 presents a modular breakdown of the Merkle-LWE KEM implementation’s static code footprint, measured in bytes and expressed as percentages of the total compiled size. The total flash usage amounts to 27,136 bytes (26.5 KB), confirming the implementation’s compatibility with embedded platforms featuring 256 KB or more of flash memory, such as ARM Cortex-M4 and Cortex-M7 [
14]. The largest contributor is the Lattice Arithmetic module, consuming 8192 bytes (8.0 KB), or 30.2% of the total. This reflects the computational complexity of polynomial arithmetic, bit packing, and modular operations inherent to lattice-based cryptography [
12]. Despite its dominance, the footprint remains bounded due to the use of sparse secrets and seed-based matrix generation, which eliminate bulky precomputed tables.
The Hash Functions module occupies 6144 bytes (6.0 KB), or 22.6%, driven by the inclusion of both SHA3-256 and SHA3-512. These are essential for Merkle tree construction and commitment verification, and their presence underscores the hybrid scheme’s reliance on hash-based security primitives [
31]. The PRNG (ChaCha20) module accounts for 4096 bytes (4.0 KB), or 15.1%. Its relatively compact footprint, combined with constant-time behaviour and stream-oriented design, makes it well-suited for embedded environments [
49]. It supports deterministic matrix generation and sparse polynomial expansion without introducing timing leakage.
The Merkle Tree logic contributes 3072 bytes (3.0 KB), or 11.3%, representing the overhead of node hashing, path generation, and verification. This module is central to the scheme’s memory efficiency, enabling public key compression via cryptographic commitments [
35]. The KEM Core Logic module, responsible for encapsulation and decapsulation routines, occupies 2560 bytes (2.5 KB), or 9.4%. The Sparse Polynomial module adds 2048 bytes (2.0 KB), or 7.5%, supporting deterministic secret generation and indexing. Finally, Memory Management routines consume 1116 bytes (1.1 KB), or 3.8%, covering secure clearing, allocation, and error handling [
38].
The distribution is well-balanced, with no single module exceeding one-third of the total footprint. This modularity facilitates cache locality and function-level optimization [
22]. The green annotation in the figure highlights the implementation’s embedded suitability, confirming that the total footprint fits comfortably within the flash constraints of Cortex-M-class MCUs. In summary, the figure validates that the Merkle-LWE KEM achieves a compact and modular codebase, with each component contributing proportionally to its hybrid functionality. The footprint remains well within embedded tolerances, supporting the scheme’s deployment in flash-constrained environments without compromising cryptographic integrity.
Figure 9 presents a comparative breakdown of the static code footprint between the Merkle-LWE KEM and a traditional LWE KEM implementation. The analysis highlights the flash usage of individual modules, measured in bytes, and reveals how the hybrid design reallocates code complexity across components while maintaining embedded suitability. The total footprint of the Merkle-LWE implementation is 25,088 bytes (24.5 KB), representing a 2.56 KB increase over the traditional LWE KEM baseline of 22,528 bytes (22.0 KB). This ~11.4% growth is attributable to the introduction of new modules—most notably the Merkle tree logic and expanded hash functions—while other components are either retained or optimized.
The most significant reduction occurs in the Lattice Arithmetic module, which shrinks from 12,288 bytes in the traditional implementation to 8192 bytes in Merkle-LWE—a 33.3% decrease. This reflects the shift from explicit matrix storage and dense polynomial operations to seed-based generation and sparse arithmetic, which reduce both code complexity and runtime memory usage [
29,
57]. Conversely, the Hash Functions module expands from 4096 bytes to 6144 bytes—a 50% increase—due to the inclusion of SHA3-512 alongside SHA3-256. This expansion supports Merkle root generation and verification, which are central to the hybrid scheme’s commitment-based design.
The PRNG module also grows modestly, from 3072 bytes (AES-CTR) to 4096 bytes (ChaCha20), reflecting the adoption of a stream cipher with better constant-time properties and embedded performance [
26]. While ChaCha20 incurs a larger footprint, its deterministic behaviour and cache-friendly design justify the trade-off. The Merkle Tree module, absent in the traditional implementation, introduces 3072 bytes of new logic. This overhead is offset by the elimination of bulky public key and error vector storage, enabling a 99.3% reduction in cryptographic object sizes. The Merkle tree’s inclusion marks a structural shift in how correctness and integrity are verified, replacing transmission with succinct proofs. Other modules—Memory Management and KEM Core Logic—remain comparable in size, with the latter increasing slightly (2048 → 2560 bytes) to accommodate protocol enhancements. The Sparse Polynomial module, unique to Merkle-LWE, adds 2048 bytes for deterministic secret generation and indexing.
Overall, the figure illustrates a redistribution of code complexity: Merkle-LWE reduces arithmetic overhead and matrix handling in favour of hash-based commitments and sparse representations. The net increase in footprint remains modest and well within the flash constraints of embedded platforms. Importantly, the added modules directly support the scheme’s memory-first design goals, validating the trade-off between code size and cryptographic efficiency.
Figure 10 illustrates the flash memory suitability of the Merkle-LWE KEM implementation across four representative embedded system categories: ARM Cortex-M0, Cortex-M4, Cortex-M7, and high-end microcontroller units (MCUs). The chart compares the implementation’s flash usage (29.5 KB) against the available flash capacity of each platform, highlighting the proportion of memory consumed and validating deployment feasibility.
The most constrained platform, ARM Cortex-M0, offers 64 KB of flash, of which Merkle-LWE occupies 46.1%. This near-half utilization underscores the scheme’s compactness, especially given its hybrid cryptographic structure and built-in side-channel countermeasures. While tight, this footprint remains deployable, leaving sufficient headroom for application logic, protocol buffers, and system routines.
On ARM Cortex-M4, which provides 256 KB of flash, Merkle-LWE consumes only 11.5% of the available capacity. This low utilization confirms the scheme’s compatibility with mid-tier MCUs commonly used in industrial control, medical instrumentation, and secure IoT endpoints. The remaining flash budget allows for integration with TLS stacks, firmware updates, and multi-session key management. For ARM Cortex-M7, with 512 KB of flash, the footprint drops to 5.8%, and for high-end MCUs (≥1024 KB), it reaches a minimal 2.9%. These figures demonstrate that Merkle-LWE scales efficiently across increasingly capable platforms, offering cryptographic functionality without imposing significant storage demands.
The consistent flash usage across all categories reflects the implementation’s deterministic memory profile. Unlike traditional lattice-based schemes, which scale linearly with security level and often require large precomputed tables or explicit matrix storage, Merkle-LWE maintains a fixed codebase by leveraging seed-based generation and modular design. This architectural choice ensures predictable deployment characteristics and simplifies resource planning for embedded developers.
In summary, the figure confirms that Merkle-LWE KEM is flash-compatible across a wide spectrum of embedded platforms. Its compact footprint, modular structure, and constant-time primitives make it a viable candidate for PQC in environments where static storage is a critical constraint [
15].
7.4. Peak RAM Usage
Figure 11 compares the peak RAM usage of Merkle-LWE KEM against three other post-quantum KEMs—Traditional LWE KEM, Kyber (NIST PQC finalist), and NTRU (NIST PQC alternate)—across three cryptographic operations: key generation, encapsulation, and decapsulation. The measurements are presented in bytes and benchmarked against the RAM constraints of two representative embedded platforms: ARM Cortex-M0 (32 KB) and Cortex-M4 (128 KB), indicated by horizontal dashed lines.
Merkle-LWE KEM exhibits a consistent peak RAM usage of 14,336 bytes during both key generation and encapsulation, and 10,240 bytes during decapsulation. These values remain well below the 32 KB threshold of Cortex-M0, confirming the scheme’s deployability even on the most constrained platforms. Compared to Traditional LWE KEM, which consumes 16,384 bytes for key generation and 12,288 bytes for encapsulation, Merkle-LWE achieves a modest reduction. This is primarily due to its seed-based matrix generation and sparse secret representation, which eliminate the need to store large public matrices and dense coefficient arrays [
29,
57].
However, Merkle-LWE’s RAM usage is notably higher than Kyber and NTRU across all operations. Kyber requires only 8192 bytes for key generation and 6144 bytes for encapsulation, while NTRU is even more compact, consuming 4096 and 3072 bytes respectively [
23]. This disparity stems from Merkle-LWE’s hybrid architecture, which introduces additional memory demands for Merkle tree construction, hash-based commitments, and deterministic sparse polynomial generation. These components, while contributing to storage efficiency and security robustness, incur transient memory overhead during runtime.
Decapsulation in Merkle-LWE is the most memory-efficient phase, requiring 10,240 bytes—lower than its own key generation and encapsulation phases, and only slightly higher than Traditional LWE’s 8192 bytes. This reduction reflects the absence of matrix generation and the streamlined nature of Merkle path verification, which operates on a small working set. Sparse secret reconstruction and bit unpacking are performed in-place, minimizing buffer duplication and leveraging cache-aware scheduling [
22].
Despite the higher RAM usage compared to Kyber and NTRU, Merkle-LWE maintains a deterministic and bounded memory profile. There is no dynamic allocation, recursion, or heap fragmentation, which is critical for embedded systems where predictability and stability are paramount [
16]. The scheme’s memory-first design philosophy deliberately shifts cryptographic state from static storage to runtime computation, enabling a 99.3% reduction in key and ciphertext sizes while preserving platform compatibility.
In conclusion, the figure validates that Merkle-LWE KEM operates within acceptable RAM limits across all cryptographic phases. Its memory profile reflects a conscious trade-off: higher transient RAM usage in exchange for dramatically reduced flash footprint and enhanced cryptographic integrity. This balance makes Merkle-LWE a viable candidate for PQC in embedded environments where static storage is the dominant constraint.
Figure 12 presents a stacked horizontal bar chart that decomposes memory usage by computational category across the three core operations of the Merkle-LWE KEM: key generation, encapsulation, and decapsulation. Each bar is segmented into contributions from error pattern generation, matrix operations, and hash computations, enabling a granular view of memory pressure sources and their operational distribution.
Key generation emerges as the most memory-intensive phase, consuming 14.0 KB of RAM. This is primarily driven by matrix operations and error pattern generation, which together account for the majority of the footprint. The matrix component reflects the cost of on-the-fly instantiation of the public matrix A from a seed using ChaCha20, while the error pattern segment corresponds to the generation and temporary storage of sparse error vectors [
29]. The latter is particularly impactful due to its random-access nature and the need to maintain intermediate buffers for Merkle tree commitments.
Encapsulation follows closely with 13.0 KB of peak memory usage. Here, matrix operations again dominate, as the scheme performs seeded row generation and polynomial multiplication to compute the LWE sample . Hash computations also contribute significantly, reflecting the cost of Merkle path construction and commitment generation. The absence of error pattern generation in this phase slightly reduces the overall footprint compared to key generation, but the memory profile remains dense due to concurrent polynomial and hash operations.
Decapsulation is the least memory-intensive phase, requiring 9.0 KB of RAM. The reduced footprint is attributable to the absence of matrix generation and the streamlined nature of Merkle path verification. Hash computations remain present but are limited to inclusion proof validation, while matrix operations are minimal and confined to sparse secret reconstruction. The error pattern segment is comparatively small, as decapsulation does not involve fresh error sampling but rather verification against committed values.
The chart highlights that matrix operations are the dominant source of memory pressure across all phases, underscoring the computational cost of seed-based generation and sparse arithmetic. Hash computations, while less intensive, contribute consistently due to the hybrid scheme’s reliance on Merkle trees for correctness and integrity. Error pattern generation is localized to key generation and contributes significantly due to its sparse and non-sequential access pattern.
This decomposition validates the design’s memory-first philosophy: by shifting cryptographic state from static storage to runtime computation, Merkle-LWE achieves dramatic reductions in flash footprint while maintaining bounded RAM usage. The memory bottlenecks are predictable and phase-specific, enabling targeted optimization strategies such as buffer reuse, access pattern reordering, and layout-aware scheduling.
In conclusion, the figure confirms that Merkle-LWE’s memory usage is structurally concentrated in matrix and hash operations, with error generation contributing episodically. The scheme’s operational memory profile remains within embedded tolerances and reflects a deliberate trade-off between storage efficiency and transient memory cost. This balance supports real-world deployment on constrained platforms without compromising cryptographic robustness.
Figure 13 evaluates the RAM utilization of the Merkle-LWE KEM implementation across five representative embedded platforms: ARM Cortex-M0, Cortex-M4, Cortex-M7, ESP32, and STM32F4. The chart expresses RAM usage as a percentage of total available memory, with two horizontal thresholds—50% (orange, labelled “Good”) and 80% (red, labelled “Tight”)—indicating suitability boundaries for embedded deployment.
The most constrained platform, ARM Cortex-M0, exhibits a RAM utilization of 43.8%, placing it below the 50% threshold and confirming its viability for Merkle-LWE deployment. This is a critical validation, as Cortex-M0 devices typically operate with only 32 KB of RAM, and any cryptographic scheme exceeding half of this budget risks interfering with application logic, protocol buffers, or system routines [
16]. Merkle-LWE’s bounded and deterministic memory profile ensures that even in this tight environment, the implementation remains stable and predictable.
On ARM Cortex-M4, which offers 128 KB of RAM, Merkle-LWE consumes only 10.9%, leaving ample headroom for additional cryptographic layers, secure boot logic, or real-time operating systems. Similarly, Cortex-M7 shows a minimal 2.7% utilization, while ESP32 and STM32F4 report 2.8% and 7.3%, respectively. These figures demonstrate that Merkle-LWE scales efficiently across increasingly capable platforms, with RAM usage decreasing proportionally as available memory increases.
The consistent checkmarks across all bars indicate that the implementation remains within acceptable limits for each platform. Importantly, the RAM usage is not only low but also phase-stable—the peak values observed during key generation, encapsulation, and decapsulation do not fluctuate unpredictably. This stability is essential for embedded systems, which often lack dynamic memory allocation and rely on static provisioning.
The suitability across platforms is a direct consequence of Merkle-LWE’s architectural choices: sparse secret representation, in-place polynomial operations, and incremental Merkle tree construction. These techniques minimize buffer duplication and avoid heap fragmentation, ensuring that the scheme’s memory footprint remains compact and bounded.
In conclusion, the figure confirms that Merkle-LWE KEM is RAM-compatible across a wide spectrum of embedded platforms. Its low and predictable memory usage supports real-world deployment in flash- and RAM-constrained environments, validating the scheme’s design goals and reinforcing its applicability to secure embedded systems.
7.5. Memory Traffic and Bandwidth Analysis
Figure 14 presents a comparative analysis of total memory traffic during the three core cryptographic operations—key generation, encapsulation, and decapsulation—for Merkle-LWE KEM and a traditional LWE KEM baseline. Memory traffic is measured in bytes and reflects the cumulative volume of data moved across memory hierarchies during execution, including reads, writes, and intermediate buffer transfers. This metric is critical for embedded systems, where bandwidth constraints and energy costs associated with memory access often dominate computational overhead [
19,
20].
During key generation, Merkle-LWE KEM exhibits a total memory traffic of 302,176 bytes, representing a 61.7% reduction compared to the traditional LWE KEM’s 789,504 bytes. This substantial improvement stems from Merkle-LWE’s seed-based matrix instantiation and sparse error pattern generation. Unlike traditional implementations that store and manipulate large explicit matrices and dense vectors, Merkle-LWE regenerates matrix rows on demand using a lightweight PRNG and commits to error vectors via Merkle roots [
29,
35]. These design choices eliminate the need for bulk memory transfers and reduce the volume of intermediate data.
Encapsulation shows near parity between the two schemes, with Merkle-LWE generating 529,088 bytes of traffic versus 531,520 bytes for traditional LWE. The marginal difference reflects the fact that both schemes perform similar polynomial multiplications and ciphertext assembly during this phase. However, Merkle-LWE’s use of sparse secrets and in-place packing slightly reduces buffer duplication and memory churn, contributing to the modest traffic savings [
57].
Decapsulation reveals the most dramatic divergence: Merkle-LWE incurs 279,264 bytes of memory traffic, while traditional LWE requires only 6240 bytes—a 95.5% increase. This inversion is a direct consequence of Merkle-LWE’s commitment-based verification strategy. During decapsulation, the scheme reconstructs sparse secrets from seeds, verifies Merkle paths, and performs hash-based inclusion checks, all of which involve multiple memory accesses and temporary buffers [
35]. In contrast, traditional LWE performs a straightforward decryption using stored keys and ciphertexts, resulting in minimal data movement.
The figure underscores a key architectural trade-off: Merkle-LWE shifts memory traffic from key generation to decapsulation in order to minimize static storage and enhance security. This redistribution is intentional and reflects the scheme’s memory-first design philosophy. By front-loading memory traffic during verification, Merkle-LWE avoids persistent storage of large public keys and error vectors, enabling deployment on flash-constrained platforms [
15].
Importantly, the overall memory traffic across all operations remains within embedded tolerances. While decapsulation incurs higher bandwidth, the cumulative traffic is offset by reductions in key generation and encapsulation. Moreover, the traffic profile is deterministic and phase-specific, allowing developers to provision memory bandwidth predictably and avoid runtime bottlenecks.
In conclusion, the figure validates that Merkle-LWE KEM achieves a favourable balance between memory traffic and storage efficiency. Its bandwidth profile reflects a deliberate reallocation of data movement to support cryptographic commitments and sparse representations. This trade-off enables secure and efficient operation in embedded environments where memory bandwidth is a critical resource.
Figure 15 presents a comparative breakdown of memory traffic across three core cryptographic operations—key generation, encapsulation, and decapsulation—between Merkle-LWE KEM and a traditional LWE KEM baseline. Memory traffic is quantified in total bytes transferred, encompassing all memory reads, writes, and intermediate buffer movements during execution. This metric is particularly relevant for embedded platforms, where memory bandwidth directly impacts energy consumption, latency, and cache efficiency [
19].
In the key generation phase, Merkle-LWE KEM demonstrates a substantial reduction in memory traffic, consuming 302,176 bytes compared to 789,504 bytes for traditional LWE—a 61.7% decrease. This improvement is attributed to Merkle-LWE’s seed-based matrix generation, which replaces explicit matrix storage and bulk memory loads with lightweight pseudorandom expansion [
29]. Additionally, sparse error pattern generation avoids dense vector manipulation, further reducing memory movement. The result is a leaner memory footprint that aligns with the scheme’s memory-first design philosophy.
Encapsulation shows near parity between the two schemes, with Merkle-LWE generating 529,088 bytes of traffic and traditional LWE producing 531,520 bytes. This similarity reflects the shared computational structure of both schemes during ciphertext generation, including polynomial multiplication and bit packing. However, Merkle-LWE’s use of sparse secrets and in-place operations slightly reduces buffer duplication, contributing to marginal traffic savings.
Decapsulation reveals a stark contrast: Merkle-LWE incurs 279,264 bytes of memory traffic, while traditional LWE requires only 6240 bytes—a 95.5% increase. This inversion is a direct consequence of Merkle-LWE’s commitment-based verification model. Instead of relying on stored secrets, the scheme reconstructs sparse polynomials from seeds and verifies correctness via Merkle path traversal and hash-based inclusion proofs [
35]. These operations, while computationally efficient, involve multiple memory accesses and temporary buffers, resulting in elevated traffic during decapsulation.
The figure also highlights a fundamental architectural shift: Merkle-LWE trades static memory loads for dynamic PRNG expansion. Traditional LWE relies heavily on sequential reads of precomputed matrices and vectors, leading to high memory traffic concentrated in key generation. In contrast, Merkle-LWE distributes memory usage more evenly across operations, with PRNG expansion replacing bulk loads and enabling on-demand computation [
49]. This redistribution reduces peak traffic and improves cache locality, particularly in resource-constrained environments.
In summary, the figure validates that Merkle-LWE achieves a favourable memory traffic profile by replacing expensive memory loads with controlled pseudorandom expansion. While decapsulation incurs higher traffic due to verification logic, the overall bandwidth remains within embedded tolerances and supports predictable provisioning. This trade-off reinforces Merkle-LWE’s suitability for flash-constrained platforms, where minimizing static storage and optimizing memory movement are critical for secure and efficient deployment.
Figure 16 presents a normalized comparison of three key metrics—storage usage, CPU cycles, and memory traffic—between Merkle-LWE KEM and a traditional LWE KEM baseline. The chart highlights the architectural trade-offs inherent in Merkle-LWE’s memory-first design, quantifying the cost of computational overhead against the benefits of storage and bandwidth efficiency.
The most pronounced gain is observed in storage usage, where Merkle-LWE achieves a −99.3% reduction relative to traditional LWE. This dramatic improvement stems from the scheme’s structural reconfiguration: large public matrices and error vectors are replaced with compact seeds and Merkle-root commitments [
29,
35]. By eliminating explicit storage of deterministic components and leveraging on-the-fly generation, Merkle-LWE compresses public keys from kilobytes to under 100 bytes, enabling deployment on flash-constrained embedded platforms [
14].
In contrast, CPU cycle consumption increases by +100.5%, reflecting the computational cost of regenerating matrix rows, expanding sparse secrets, and verifying Merkle paths. This overhead is expected and accepted within the scheme’s design philosophy, which prioritizes storage minimization over raw throughput. The additional cycles are primarily concentrated in matrix operations and PRNG expansion, as shown in prior analyses, and are bounded within predictable limits suitable for non-time-critical embedded applications.
Memory traffic, measured as total data movement across memory hierarchies, shows a −16.3% reduction for Merkle-LWE. This improvement is achieved despite the increased computational load, due to the elimination of bulk memory loads and the use of in-place operations [
22]. Traditional LWE schemes rely heavily on sequential reads of large precomputed matrices, which inflate memory traffic and degrade cache locality. Merkle-LWE replaces these with lightweight PRNG expansion and sparse access patterns, reducing bandwidth demands and improving energy efficiency [
20].
The chart encapsulates the core trade-off: Merkle-LWE sacrifices computational simplicity to achieve substantial gains in storage and memory bandwidth. This reallocation of resource pressure—from flash and RAM to CPU cycles—is well-suited to embedded platforms where memory is scarce but computation is relatively abundant. The normalized metrics confirm that the scheme’s overheads are proportional and predictable, validating its suitability for constrained environments.
In summary, the figure demonstrates that Merkle-LWE KEM delivers a highly favourable cost–benefit profile. Its storage efficiency is unmatched among lattice-based schemes, and its memory traffic reduction offsets the computational cost. These characteristics position Merkle-LWE as a viable and forward-looking candidate for PQC in embedded systems.
7.6. Cache Behaviour and Locality
Figure 17 presents a dual-panel analysis of cache behaviour for Merkle-LWE KEM, focusing on L1 and L2 cache miss counts across three core cryptographic operations: key generation, encapsulation, and decapsulation. The results are benchmarked against a traditional LWE KEM baseline, which exhibits zero cache misses across both levels due to its reliance on dense, sequential memory access patterns and precomputed data structures [
22].
In the left panel, L1 cache behaviour is visualized. Merkle-LWE incurs 639 misses during key generation, 496 misses during encapsulation, and 23 misses during decapsulation, totalling 1158 L1 cache misses. These misses are a direct consequence of the scheme’s sparse and dynamic memory access patterns. Specifically, key generation involves randomized error pattern sampling and Merkle tree construction, both of which introduce non-linear access sequences that disrupt spatial locality. Encapsulation similarly suffers from sparse secret access and on-the-fly matrix row generation, which prevent effective cache line reuse. Decapsulation, while more memory-efficient, still incurs minor overhead due to sparse secret reconstruction and Merkle path verification [
58].
In contrast, the traditional LWE KEM registers zero L1 cache misses across all operations. This is expected, as the scheme operates on preloaded matrices and dense vectors with highly predictable access patterns. These structures align well with cache line boundaries and benefit from sequential traversal, resulting in perfect cache hit rates.
The right panel confirms that no L2 cache misses were recorded for either scheme. This indicates that all working sets fit comfortably within the L1 cache (32 KB simulated), and that Merkle-LWE’s memory footprint, while more fragmented, remains bounded and does not spill into higher cache levels. This is a critical validation for embedded platforms, where L2 caches are often absent or minimal, and L1 locality is paramount for performance and energy efficiency [
20].
The observed cache behaviour reflects a fundamental design trade-off. Merkle-LWE prioritizes storage efficiency through seed-based generation and sparse representations, which inherently degrade cache locality. However, the resulting L1 miss counts are predictable, bounded, and phase-specific, allowing for targeted optimization. For example, matrix generation and Merkle tree traversal could benefit from layout-aware scheduling or access pattern reordering to improve spatial locality [
58].
Importantly, the absence of L2 misses and the low decapsulation overhead demonstrate that Merkle-LWE’s cache inefficiencies are concentrated in setup and encapsulation phases, which are typically less latency-sensitive in embedded applications. The scheme’s constant-time execution and avoidance of secret-dependent branching further mitigate the performance impact of cache misses, preserving side-channel resilience [
27].
In summary, the figure validates that Merkle-LWE KEM introduces controlled cache overhead as a consequence of its memory-first architecture. While L1 miss rates are elevated relative to traditional schemes, they remain within acceptable bounds and do not propagate to higher cache levels. This behaviour supports the scheme’s suitability for embedded deployment, where predictable memory access and bounded cache pressure are essential for secure and efficient operation.
Figure 18 presents a component-level analysis of L1 cache miss rates in the Merkle-LWE KEM implementation, segmented by locality pattern. The chart categorizes six computational components—hash computations, bit packing/unpacking, Merkle tree traversal, on-the-fly matrix generation, compressed error storage, and sparse secret access—according to their memory access behaviour: sequential, tree-like, or random. This breakdown provides insight into how structural design choices influence cache performance and guides optimization strategies for embedded deployment [
22].
Components with sequential access patterns exhibit excellent cache locality. On-the-fly matrix generation incurs only a 2.0% miss rate, as rows are expanded linearly from a seed using ChaCha20, aligning well with cache line boundaries. Bit packing/unpacking and hash computations follow closely, with 3.0% and 4.0% miss rates respectively. These operations process data in contiguous blocks, minimizing cache evictions and benefiting from spatial locality. Compressed error storage, while slightly more fragmented, maintains a low 5.0% miss rate, confirming that sequential compression and decompression routines remain cache-friendly.
In contrast, components with non-linear access patterns show elevated miss rates. Merkle tree traversal, classified as tree-like, incurs a 35.0% miss rate due to hierarchical node access and frequent jumps across memory regions. While moderate, this overhead is acceptable given the logarithmic depth of Merkle trees and the infrequency of traversal operations. The most pronounced impact arises from sparse secret access, which exhibits a 95.0% miss rate under a random (1:64) access model. This reflects the scheme’s use of sparsity-aware secrets, where non-zero coefficients are scattered across a large index space, resulting in poor temporal and spatial locality [
58].
The chart confirms that Merkle-LWE’s cache behaviour is pattern-dependent and structurally predictable. Sequential components dominate the runtime profile and maintain high cache efficiency, while sparse and tree-like components introduce controlled overhead. Importantly, the high miss rate of sparse secret access is mitigated by its low frequency and short duration, ensuring that overall cache pressure remains bounded.
This locality-aware decomposition validates the scheme’s architectural trade-offs. Merkle-LWE sacrifices cache uniformity in favour of memory compression and structural compactness, replacing dense vectors with sparse representations and explicit storage with cryptographic commitments. The resulting cache behaviour is not optimal in all components, but remains within tolerable limits for embedded platforms with constrained cache hierarchies.
In conclusion, the figure demonstrates that Merkle-LWE KEM achieves acceptable cache performance through careful balancing of locality patterns. Sequential operations dominate the memory footprint and maintain low miss rates, while sparse and tree-like components introduce predictable and manageable overhead. This behaviour supports the scheme’s deployment in cache-sensitive environments, reinforcing its suitability for secure embedded systems.
Figure 19 presents a comparative analysis of cache hit rates across three cryptographic operations—key generation, encapsulation, and decapsulation—for Merkle-LWE KEM and a traditional LWE KEM baseline. The chart quantifies cache efficiency as the percentage of successful L1 cache accesses, offering insight into how memory access patterns affect performance in constrained environments.
Traditional LWE KEM maintains a consistent 100.0% hit rate across all operations. This uniformity reflects its reliance on dense, sequential memory access patterns and precomputed data structures. Matrix rows and secret vectors are stored explicitly and traversed linearly, aligning well with cache line boundaries and minimizing evictions. As a result, the scheme benefits from optimal spatial locality and predictable cache behaviour.
In contrast, Merkle-LWE KEM exhibits operation-dependent cache efficiency, with hit rates of 86.9% for key generation, 89.3% for encapsulation, and 41.0% for decapsulation. The relatively high hit rates in the first two phases are sustained by sequential operations such as PRNG expansion, hash computations, and bit packing. These components process data in contiguous blocks, preserving locality and enabling effective cache utilization.
The sharp decline in decapsulation efficiency is attributed to sparse secret reconstruction, which involves random access to scattered coefficient indices. This disrupts spatial locality and leads to frequent cache line replacements, resulting in a 59.0% drop in hit rate compared to the traditional baseline. While Merkle path verification also contributes to non-linear access, its impact is secondary to the sparsity-induced fragmentation [
58].
Despite the reduced cache efficiency in decapsulation, the overall behaviour remains bounded and predictable. Merkle-LWE’s cache profile reflects a deliberate trade-off: it sacrifices uniform locality in favour of structural compactness and memory savings. The scheme replaces large stored vectors with seed-based generation and cryptographic commitments, reducing static footprint at the cost of dynamic access irregularity.
Importantly, the cache inefficiencies are localized and phase-specific, affecting only a subset of operations. Key generation and encapsulation maintain high hit rates, ensuring that the majority of runtime execution benefits from cache acceleration. Moreover, the decapsulation phase, while less efficient, operates on a small working set and is typically invoked less frequently in embedded applications.
In conclusion, the figure confirms that Merkle-LWE KEM introduces controlled cache overhead as a consequence of its memory-first architecture. While traditional LWE achieves perfect cache efficiency through dense storage, Merkle-LWE balances locality with compression, enabling deployment on flash-constrained platforms without exceeding cache tolerances. This behaviour validates the scheme’s suitability for embedded systems where predictable performance and bounded resource usage are critical.
7.7. Computational Cost
Figure 20 presents a comparative analysis of CPU cycle consumption across the three core cryptographic operations—key generation, encapsulation, and decapsulation—for Merkle-LWE KEM and a traditional LWE KEM baseline. The results quantify the computational overhead introduced by Merkle-LWE’s memory-efficient architecture, providing a clear view of the cost incurred by replacing static storage with dynamic computation. The note accompanying the figure explicitly clarifies that the scheme does not claim speed superiority; rather, it aims to demonstrate the computational cost of achieving memory efficiency.
In the key generation phase, Merkle-LWE consumes 2,906,456 cycles, representing a 68.8% increase over the traditional LWE KEM’s 1,721,344 cycles. This overhead is primarily attributed to the dynamic instantiation of the public matrix A from a compact seed using ChaCha20, as well as the generation of sparse error vectors and the construction of Merkle tree commitments [
29,
35]. Unlike traditional schemes that rely on precomputed and densely stored matrices, Merkle-LWE regenerates these structures on-the-fly, incurring additional cycles for pseudorandom expansion and polynomial arithmetic. The Merkle tree construction, while lightweight in terms of memory, introduces hash computations that further contribute to the cycle count.
Encapsulation shows the most significant overhead, with Merkle-LWE requiring 809,636 cycles compared to 268,516 cycles for traditional LWE—a 201.5% increase. This phase involves seeded matrix row generation, sparse secret vector expansion, and LWE sample computation, all of which are performed dynamically. The sparse nature of the secret vector necessitates random access and coefficient shuffling, which are more computationally intensive than the linear traversal of dense vectors. Additionally, Merkle-LWE performs bit packing and hash-based commitment generation during encapsulation, adding further complexity. These operations, while efficient in terms of memory traffic and storage, demand more CPU cycles due to their iterative and non-linear nature.
Decapsulation follows a similar trend, with Merkle-LWE consuming 814,588 cycles versus 267,492 cycles for traditional LWE—a 204.6% increase. The overhead in this phase arises from sparse secret reconstruction, Merkle path verification, and hash-based inclusion checks. Unlike traditional schemes that perform direct decryption using stored keys and ciphertexts, Merkle-LWE reconstructs the secret from a seed and verifies correctness via cryptographic proofs. These operations, while lightweight in terms of memory footprint, require multiple hash evaluations and sparse polynomial manipulations, contributing to the elevated cycle count.
Overall, the figure illustrates that Merkle-LWE KEM incurs a 100.7% increase in total CPU cycles across all operations. This doubling of computational cost is a deliberate and measured trade-off for achieving a 99.3% reduction in storage usage and a 16.3% reduction in memory traffic, as shown in previous analyses. The scheme rebalances resource pressure from static memory to runtime computation, aligning with the constraints of embedded platforms where flash and RAM are scarce but CPU cycles are relatively abundant [
20,
21]. Importantly, all operations are implemented in constant time and avoid secret-dependent branching, preserving side-channel resistance despite the increased computational load [
28].
In conclusion, the figure validates that Merkle-LWE’s computational overhead is proportional, predictable, and phase-specific. While the scheme does not aim to outperform traditional LWE in raw speed, it achieves substantial gains in memory efficiency and structural compactness. This trade-off supports its deployment in embedded environments where storage constraints outweigh cycle budgets, reinforcing Merkle-LWE’s viability as a post-quantum solution for resource-limited systems.
Figure 21 provides a granular breakdown of CPU cycle consumption across six computational components for both Merkle-LWE KEM and a traditional LWE KEM baseline. The components analysed include PRNG operations, hash computations, matrix operations, LWE sample generation, Merkle tree operations, and bit-level manipulations. This decomposition offers insight into the sources of computational overhead introduced by Merkle-LWE’s memory-efficient architecture and highlights the structural trade-offs between storage minimization and runtime complexity.
The most significant contributor to Merkle-LWE’s total cycle count is matrix operations, which consume 3,145,728 cycles, exactly double the 1,572,864 cycles required by traditional LWE. This increase stems from the scheme’s decision to regenerate matrix rows on demand rather than store them explicitly [
29]. While this approach drastically reduces static storage, it necessitates repeated pseudorandom expansion and polynomial multiplication during both encapsulation and decapsulation. The cost is compounded by the use of sparse secrets, which require additional indexing and coefficient handling during multiplication, further inflating the cycle count [
57].
PRNG operations also show a substantial increase, with Merkle-LWE consuming 812,160 cycles compared to 264,192 cycles for traditional LWE—a 207.4% overhead. This reflects the use of ChaCha20 for deterministic matrix and secret generation, replacing AES-CTR or similar block-based PRNGs. While ChaCha20 offers better constant-time behaviour and stream-oriented expansion, its iterative nature incurs higher computational cost [
49]. The PRNG is invoked multiple times across all phases, contributing significantly to the scheme’s overall runtime.
Hash computations are relatively comparable between the two schemes, with Merkle-LWE requiring 448,800 cycles and traditional LWE consuming 416,200 cycles—a modest 7.8% increase. This overhead is attributed to Merkle-LWE’s use of SHA3-256 and SHA3-512 for commitment generation and path verification. Although hash functions are computationally intensive, their impact is bounded and predictable, and their inclusion enhances the scheme’s security posture without introducing excessive cost [
31].
Two components are unique to Merkle-LWE: Merkle tree operations and bit-level manipulations, which consume 26,200 cycles and 93,680 cycles, respectively. Merkle operations involve node hashing and path traversal during encapsulation and decapsulation, while bit operations handle packing and unpacking of sparse secrets and compressed error vectors [
35]. These tasks, though absent in traditional LWE, are essential for achieving Merkle-LWE’s compact key representation and memory traffic reduction. Their combined cost remains under 3% of the total cycle budget, indicating that the overhead is well-contained.
In conclusion, the figure confirms that Merkle-LWE’s computational cost is concentrated in matrix and PRNG operations, both of which are directly tied to its memory-saving design. The scheme introduces new components for commitment and compression, but their impact is modest and structurally necessary. While Merkle-LWE does not aim to outperform traditional LWE in raw speed, its predictable and phase-specific overheads validate its suitability for embedded platforms where storage constraints outweigh cycle budgets. The component-level breakdown reinforces the scheme’s architectural coherence and supports its deployment in resource-constrained post-quantum environments.
7.8. Memory–Computation Trade-Off Analysis
Figure 22 visualizes the fundamental trade-off between runtime memory usage and computational cost across four post-quantum key encapsulation mechanisms: Merkle-LWE, Traditional LWE, Kyber, and NTRU. Each scheme is represented by three data points corresponding to key generation, encapsulation, and decapsulation, plotted against peak RAM usage (
x-axis) and CPU cycle count (
y-axis). Critically, all comparisons are conducted under identical NIST security levels and parameter sets: Kyber-768 and Merkle-LWE Level 3 both use
,
, and module rank
; NTRU comparisons use the standardized ntruhps2048677 parameter set [
43]. This alignment ensures that observed differences reflect architectural choices, not parameter disparities. Merkle-LWE occupies the “Low Memory, High Computation” quadrant, achieving peak RAM usage of 0.8–1.8 KB—over an order of magnitude lower than Kyber’s 3–8 KB—while maintaining comparable concrete security under the Module-LWE assumption. The annotated trade-off ratio of ~2.1× computation per memory unit quantifies the cost of Merkle-LWE’s memory efficiency, confirming that the scheme’s gains are genuine and not artifacts of weakened parameters.
Merkle-LWE’s data points consistently exhibit minimal RAM usage—ranging from 0.8 KB to 1.8 KB—while incurring elevated computational costs, with cycle counts reaching up to 4.5 million. This behaviour is intentional and structurally embedded: Merkle-LWE eliminates bulky public key and error vector storage by regenerating matrix rows and sparse secrets on demand using PRNG expansion and hash-based commitments [
29,
35]. These operations, while memory-efficient, require iterative computation and non-linear access patterns, resulting in increased CPU cycles. The scheme’s reliance on Merkle tree traversal and sparse indexing further compounds the computational load, especially during decapsulation, where correctness is verified through inclusion proofs rather than direct decryption.
In contrast, Traditional LWE demonstrates the inverse profile: high memory usage (up to 265 KB) paired with relatively low computational cost (approximately 2.3 million cycles). This is achieved by storing all cryptographic objects explicitly and performing operations on dense vectors and precomputed matrices. While this approach minimizes runtime computation and maximizes cache locality, it imposes a heavy burden on flash and RAM, rendering it unsuitable for embedded platforms with tight memory budgets [
16]. Kyber and NTRU, situated in the centre-left region of the plot, strike a compromise by employing moderately compressed representations and efficient arithmetic, achieving balanced performance across both axes. However, they do not match Merkle-LWE’s extreme memory savings, nor do they incur its computational overhead.
The annotated trade-off ratio of ~2.1× computation per memory unit quantifies the cost of Merkle-LWE’s memory efficiency. For every kilobyte of memory saved, the scheme incurs approximately twice the number of CPU cycles. This ratio is consistent across operations and reflects a predictable, phase-specific reallocation of resource pressure. Importantly, Merkle-LWE also achieves a 16.3% reduction in memory traffic, indicating that its elevated computation does not translate into excessive data movement. This is a critical advantage in embedded systems, where bandwidth constraints and energy efficiency are often more limiting than raw cycle budgets [
19,
20]. The figure thus validates Merkle-LWE’s suitability for flash-constrained environments, where memory is scarce but computation is relatively abundant and manageable.
In summary, the figure encapsulates the architectural ethos of Merkle-LWE: a deliberate and quantifiable trade-off between memory and computation. By shifting cryptographic state from static storage to dynamic generation, the scheme achieves unparalleled compactness at the cost of increased CPU cycles. This trade-off is structurally coherent, operationally bounded, and well-aligned with the constraints of embedded platforms. The scatter plot not only confirms Merkle-LWE’s design goals but also situates it within the broader landscape of PQC, offering a compelling alternative for resource-limited deployments.
Figure 23 provides a scheme-level comparison of average peak RAM usage and average CPU cycle counts across four post-quantum KEMs: Merkle-LWE, Traditional LWE, Kyber, and NTRU. Error bars indicate variability across the three cryptographic phases (key generation, encapsulation, and decapsulation), offering a holistic view of each scheme’s operational profile. The dashed trend line illustrates a general inverse relationship: schemes with higher RAM usage tend to require fewer CPU cycles, while those optimized for memory efficiency incur greater computational overhead. This visualization situates Merkle-LWE firmly in the “low memory, high computation” regime, contrasting sharply with Traditional LWE’s “high memory, low computation” profile, and highlighting Kyber and NTRU as balanced designs.
Merkle-LWE’s average RAM usage remains exceptionally low, clustered around 1–2 KB, while its average cycle count approaches 4.5 million. This reflects the scheme’s architectural decision to eliminate bulky storage of matrices and error vectors, instead regenerating them dynamically via PRNG expansion and sparse polynomial arithmetic. The error bars reveal moderate variability across operations, with encapsulation and decapsulation incurring higher cycle counts due to Merkle path verification and sparse secret reconstruction. Despite this variability, the scheme’s memory footprint remains consistently compact, validating its suitability for flash-constrained embedded platforms [
15]. The computational overhead is thus not incidental but structurally embedded, representing the cost of achieving extreme memory efficiency.
Traditional LWE, by contrast, averages 265 KB of RAM usage with cycle counts around 2.3 million, reflecting its reliance on dense storage and sequential access patterns. Its error bars are narrow, indicating stable performance across operations, but the high memory footprint renders it impractical for constrained environments. Kyber and NTRU occupy the middle ground, with average RAM usage between 3–8 KB and cycle counts ranging from 0.3–1.2 million. Their error bars are relatively balanced, suggesting consistent efficiency across phases. These schemes exemplify a design philosophy that balances memory and computation, avoiding extremes in either dimension. However, they do not achieve Merkle-LWE’s dramatic memory savings, nor do they incur its computational penalties, situating them as pragmatic choices for general-purpose deployment [
5,
42].
In summary, the figure confirms the structural trade-off inherent in Merkle-LWE: a ~2.1× increase in computation per unit of memory saved, consistent with prior analyses. While this places the scheme at the computationally intensive end of the spectrum, its deterministic overheads and bounded variability ensure predictable performance. The scatter plot thus reinforces Merkle-LWE’s design rationale: by reallocating resource pressure from static storage to runtime computation, it achieves unparalleled compactness without exceeding embedded tolerances. This trade-off positions Merkle-LWE as a specialized solution for environments where memory scarcity is the dominant constraint, complementing balanced schemes like Kyber and NTRU in the broader post-quantum cryptographic landscape.
7.9. Protocol-Level Impact Analysis for IoT Handshakes
The total handshake time for a key exchange is dominated by the sum of encapsulation and decapsulation cycles, plus network round-trip time (RTT). Using the cycle counts from
Figure 20 (Level 3: encapsulation = 809,636 cycles; decapsulation = 814,588 cycles) and a representative Cortex-M4 clock frequency of 168 MHz [
23], the cryptographic computation time is:
Adding a conservative RTT estimate of 50 ms for low-power wireless links (e.g., IEEE 802.15.4) yields a total handshake duration of approximately 60 ms. This is comparable to Kyber-768 on the same platform (~55 ms, derived from pqm4 benchmarks [
23]), despite Merkle-LWE’s higher cycle count, because the absolute cycle difference (~0.5 M cycles) translates to only ~3 ms at 168 MHz. For battery-powered sensors that initiate handshakes infrequently (e.g., hourly or daily), this marginal increase is negligible relative to sleep/wake overheads and application logic.
The ciphertext size for Merkle-LWE Level 3 is 160 bytes (
Table 1), compared to 1088 bytes for Kyber-768 [
13]. Although Merkle-LWE’s ciphertext includes a 448-byte Merkle authentication path (
Figure 6), the total payload remains 85% smaller than Kyber’s. In a DTLS handshake, where the ClientKeyExchange message carries the ciphertext, this reduction directly lowers transmission time and energy. Using the energy model from
Section 7.8 (0.1 µJ per 64-byte memory access [
19,
20]), the bandwidth savings translate to ~1.4 µJ less energy per handshake for radio transmission—a meaningful gain for energy-constrained devices.
Peak RAM usage determines how many concurrent sessions a device can support.
Figure 11 shows Merkle-LWE’s peak RAM is 14.3 KB for key generation and 10.2 KB for decapsulation. On a Cortex-M4 with 32 KB of available RAM for cryptographic operations [
14,
16], this permits at least two concurrent handshakes (2 × 14.3 KB = 28.6 KB < 32 KB). Kyber-768, with peak RAM of ~8 KB [
23], permits ~4 concurrent sessions. While Merkle-LWE supports fewer concurrent sessions, most embedded IoT deployments are single-session or low-concurrency by design (e.g., sensor-to-gateway links), making this limitation acceptable for the target use case. For high-concurrency scenarios (e.g., IoT gateways), the scheme can be deployed selectively for memory-critical endpoints while using Kyber for aggregation points.
The trade-offs favor Merkle-LWE in three representative IoT contexts:
Infrequent, latency-tolerant handshakes: Environmental sensors that exchange keys hourly can absorb the ~3 ms computational overhead without impacting application responsiveness.
Bandwidth-constrained links: LoRaWAN or NB-IoT uplinks benefit from the 85% ciphertext size reduction, extending battery life and reducing airtime costs.
Flash-constrained firmware: Devices with <256 KB flash cannot store Kyber’s 1.2 KB public key alongside application code; Merkle-LWE’s 96-byte public key fits comfortably.
Conversely, Merkle-LWE is less suitable for:
High-frequency, low-latency handshakes: Real-time control loops requiring sub-10 ms key exchange may not tolerate the ~10 ms cryptographic computation time.
High-concurrency gateways: Aggregation nodes handling dozens of simultaneous sessions benefit more from Kyber’s balanced profile.
In summary, the protocol-level analysis confirms that Merkle-LWE’s overheads are acceptable—and often advantageous—for the embedded IoT scenarios it targets. The scheme’s memory efficiency enables deployment on devices otherwise excluded from PQC adoption, while its computational cost remains bounded and predictable within latency budgets typical of low-power wireless protocols.
7.10. Energy Consumption
Figure 24 illustrates the relationship between energy consumption (measured in microjoules) and memory traffic (measured in kilobytes) across four post-quantum KEMs: Merkle-LWE, Traditional LWE, Kyber, and NTRU. Each scheme is represented by three operational phases—key generation, encapsulation, and decapsulation—highlighting the variability of energy demands across different workloads. The plotted trend line reveals a positive correlation (correlation coefficient: 0.673) between memory traffic and energy consumption, confirming that memory access patterns dominate energy usage in IoT devices [
19,
20]. This finding underscores the importance of minimizing memory traffic in resource-constrained environments, where energy efficiency is often more critical than raw computational throughput.
Merkle-LWE occupies a distinctive position in the scatter plot: despite incurring higher computational costs, its energy consumption remains comparatively efficient due to its low memory traffic profile. By replacing bulk sequential memory loads with PRNG expansion and sparse secret reconstruction, Merkle-LWE reduces the number of high-energy memory transactions. This design choice shifts the energy burden from memory access to computation, which is less costly in terms of energy per operation on embedded microcontrollers. For example, while Merkle-LWE’s decapsulation phase requires millions of CPU cycles, its energy footprint is moderated by the fact that these cycles involve lightweight arithmetic and hash computations rather than expensive cache misses or DRAM accesses. The annotation “Merkle-LWE: Energy Efficient Despite Computation” captures this structural trade-off, validating the scheme’s suitability for battery-powered IoT devices [
20].
In contrast, Traditional LWE demonstrates the opposite profile: high memory traffic directly translates into elevated energy consumption. Its reliance on dense matrix storage and sequential reads results in frequent large-scale memory transfers, which dominate the energy budget. The scatter plot highlights this with data points in the “High Memory Traffic, High Energy Consumption” region, confirming that storage-heavy designs are poorly aligned with the energy constraints of IoT platforms. Kyber and NTRU, situated in the “Low Memory Traffic, Low Energy Consumption” region, achieve balanced efficiency by combining compact representations with streamlined arithmetic. Their energy footprints are consistently lower than both Merkle-LWE and Traditional LWE, reflecting their design philosophy of balancing memory and computation rather than optimizing one dimension at the expense of the other.
In summary, the figure validates that energy consumption in IoT cryptographic workloads is primarily driven by memory traffic rather than raw computation. Merkle-LWE exemplifies a memory-first design that achieves energy efficiency by minimizing traffic, even at the cost of increased CPU cycles. Traditional LWE, by contrast, demonstrates the energy penalties of storage-heavy architectures, while Kyber and NTRU highlight the benefits of balanced approaches. The correlation between memory traffic and energy consumption confirms that optimizing access patterns is the most effective strategy for reducing energy costs in embedded cryptography. Merkle-LWE’s ability to achieve low traffic and bounded energy usage reinforces its viability for secure IoT deployments, where energy efficiency is paramount for long-term sustainability.
Figure 25 provides a component-level breakdown of energy consumption for Merkle-LWE KEM and Traditional LWE KEM, measured in microjoules (µJ). The analysis spans five categories—memory access, computation, hash operations, PRNG operations, and base power—offering a fine-grained view of how architectural choices translate into energy costs. This decomposition is particularly relevant for IoT and embedded platforms, where energy efficiency is often the decisive factor in cryptographic adoption [
14].
The most striking difference lies in hash operations, where Merkle-LWE consumes only 383 µJ, compared to Traditional LWE’s 4160 µJ. This nearly 90% reduction reflects Merkle-LWE’s reliance on compact Merkle tree commitments rather than dense hash-based matrix verification. Traditional LWE requires repeated hashing of large vectors and matrices to ensure correctness, leading to substantial energy overhead. Merkle-LWE, by contrast, limits hashing to path verification and root commitment, which are structurally lightweight. This result underscores the efficiency of Merkle-LWE’s hybrid design: by shifting correctness checks into sparse and tree-based structures, it dramatically reduces the energy footprint of hashing while maintaining cryptographic integrity [
31,
35].
In terms of PRNG operations, both schemes exhibit comparable energy costs, with Merkle-LWE consuming 4352 µJ and Traditional LWE consuming 4112 µJ. The slight increase in Merkle-LWE reflects its heavier reliance on ChaCha20 for matrix row generation and sparse secret expansion [
49]. While ChaCha20 is computationally efficient and constant-time, its iterative expansion requires sustained energy input. Nevertheless, the difference remains modest, suggesting that PRNG overhead is not a dominant factor in the overall energy profile. This finding validates the choice of ChaCha20 as a secure and energy-tolerant generator for embedded cryptography.
The computation and memory access categories reveal complementary trade-offs. Merkle-LWE incurs higher computational energy (345 µJ vs. 237 µJ) due to its dynamic matrix regeneration and sparse polynomial arithmetic. However, it achieves a significant reduction in memory access energy (30 µJ vs. 79 µJ), reflecting its avoidance of bulk sequential loads. Traditional LWE’s dense storage model requires frequent large-scale memory transfers, which are disproportionately expensive in energy terms [
19]. Merkle-LWE’s design shifts this burden into computation, which is less energy-intensive per operation on microcontrollers. This redistribution aligns with the broader observation that minimizing memory traffic is the most effective strategy for reducing energy consumption in IoT devices.
Finally, base power consumption is slightly higher for Merkle-LWE (900 µJ vs. 750 µJ), reflecting longer active runtimes due to its increased computational load. However, this overhead is modest compared to the dramatic savings achieved in hash and memory access categories. The overall profile confirms that Merkle-LWE’s energy efficiency is structurally coherent: while computation and PRNG expansion raise baseline costs, reductions in memory traffic and hashing more than offset these increases. The scheme’s energy footprint remains bounded and predictable, supporting its deployment in battery-powered environments where long-term sustainability is critical.
In conclusion, the figure validates that Merkle-LWE achieves energy efficiency by strategically reallocating resource pressure. Its design reduces the most energy-intensive components—hashing and memory access—while accepting modest increases in computation and PRNG costs. This balance ensures that Merkle-LWE remains viable for IoT platforms, where energy constraints dominate system design. The breakdown confirms that Merkle-LWE’s memory-first philosophy not only minimizes storage but also delivers tangible energy benefits, reinforcing its suitability for secure and sustainable embedded cryptography.
7.11. Correctness and Reliability Validation
Figure 26 presents a logarithmic plot of failure probability against the number of trials, with a red dashed line denoting the acceptable failure threshold of
. The observed results, represented by vertical bars, consistently remain below this threshold across all tested ranges. Most notably, no decryption failures were recorded within the experimental dataset, as annotated by the figure. This outcome provides strong empirical evidence of the correctness of the Merkle-LWE KEM implementation, demonstrating that its design achieves reliable decryption under repeated stress testing. The absence of failures even at high trial counts indicates that the scheme’s probabilistic components—such as sparse error sampling and Merkle path verification—are structurally sound and do not introduce instability into the decryption process [
24].
The reliability observed can be attributed to several architectural features. First, the deterministic regeneration of matrix rows from seeds ensures that public key material is reproduced consistently, eliminating discrepancies that could otherwise lead to decryption mismatches [
29]. Second, the sparse secret representation, while introducing random access patterns, is carefully bounded by fixed sparsity levels and deterministic indexing, ensuring that reconstruction during decapsulation is exact [
57]. Third, Merkle tree commitments provide cryptographic guarantees of correctness: each path verification ensures that the reconstructed secret aligns with the committed root, preventing silent errors from propagating [
35]. Together, these mechanisms form a layered correctness model, where redundancy in verification compensates for potential weaknesses in any single component.
From a reliability perspective, the absence of failures across the tested range confirms that Merkle-LWE achieves robustness comparable to, or exceeding, traditional LWE-based schemes. The logarithmic scaling of the plot emphasizes that even as the number of trials increases, the observed failure probability remains effectively zero. This suggests that the scheme’s correctness is not merely a statistical artifact of limited testing but a structural property of its design. The validation is particularly significant for embedded and IoT deployments, where reliability is paramount: decryption failures in such contexts could lead to session drops, authentication errors, or denial of service. By demonstrating correctness under extensive testing, Merkle-LWE establishes itself as a dependable candidate for PQC in constrained environments, balancing efficiency with uncompromised reliability.
Figure 27 compares the observed decryption failure rates of Merkle-LWE KEM, Traditional LWE KEM, and representative NIST PQC schemes. The
y-axis is plotted on a logarithmic scale, ranging from
to
, with a red dashed line marking the acceptable failure threshold of
. The results reveal a clear distinction between the schemes: Traditional LWE exhibits a failure rate of 0.0001, which exceeds the acceptable threshold, while both Merkle-LWE and NIST PQC schemes report failure rates below 0.0003, remaining within acceptable bounds. This comparative analysis underscores the reliability advantages of Merkle-LWE and modern PQC candidates over traditional lattice-based constructions [
24].
The elevated failure rate of Traditional LWE can be attributed to its reliance on dense error vectors and direct decryption without auxiliary correctness checks. In practice, small deviations in error distribution or rounding during polynomial arithmetic can accumulate, leading to decryption mismatches. Without structural redundancy or verification mechanisms, these errors manifest as observable failures. Merkle-LWE avoids this pitfall by embedding correctness guarantees into its architecture: sparse error vectors are deterministically generated, and Merkle tree commitments enforce consistency between transmitted ciphertexts and reconstructed secrets [
35]. Similarly, NIST PQC schemes such as Kyber and NTRU employ carefully tuned noise distributions and reconciliation mechanisms, ensuring that decryption remains robust even under adversarial or noisy conditions [
5,
42].
The comparative results highlight that Merkle-LWE achieves correctness and reliability on par with NIST PQC candidates, despite its unconventional memory-first design. By shifting verification into cryptographic commitments and sparse reconstructions, the scheme ensures that decryption failures are structurally suppressed. The reported failure rates below confirm that Merkle-LWE’s correctness is not compromised by its efficiency-oriented trade-offs. This validation is particularly significant for embedded and IoT deployments, where reliability is paramount: even rare decryption failures can disrupt communication protocols, authentication flows, or secure key exchanges. The figure thus demonstrates that Merkle-LWE balances efficiency with robustness, offering a dependable alternative to both traditional and standardized PQC schemes.
In conclusion, the comparative correctness analysis confirms that Merkle-LWE achieves reliability within acceptable thresholds, outperforming Traditional LWE and aligning with the robustness of NIST PQC candidates. The observed results validate the scheme’s architectural choices—sparse error representation, deterministic regeneration, and Merkle-based verification—as effective mechanisms for suppressing decryption failures. This reliability, combined with its memory efficiency, reinforces Merkle-LWE’s suitability for constrained environments, ensuring secure and error-free operation in real-world post-quantum deployments.
Figure 28 presents a statistical significance analysis of decryption correctness, plotting the upper bound on failure rate at a 95% confidence interval against the number of trials. The logarithmic scaling of both axes highlights the diminishing upper bound as trial counts increase, demonstrating the statistical principle that larger sample sizes yield stronger confidence in low observed failure rates. The red dashed line marks the acceptable failure threshold of
, while the orange vertical line indicates that approximately 3,000,000 trials are required to statistically confirm correctness at this confidence level. The figure thus provides a rigorous framework for validating reliability beyond empirical observation, ensuring that correctness claims are supported by statistical guarantees [
61].
The trend line shows that with modest trial counts, the confidence interval upper bound remains relatively high, reflecting the uncertainty inherent in small sample sizes. As the number of trials increases, the upper bound decreases sharply, converging toward the acceptable threshold. This behaviour confirms that the absence of observed failures in limited testing cannot alone establish reliability; instead, statistical validation requires sufficient trials to reduce the confidence interval below the threshold. For Merkle-LWE, the figure demonstrates that while no failures were observed empirically, formal validation of correctness at the
level necessitates millions of trials. This requirement is consistent with cryptographic standards, which demand rigorous statistical assurance to account for rare but potentially catastrophic errors [
24].
The analysis also underscores the robustness of Merkle-LWE’s architectural design. Sparse secret reconstruction, deterministic matrix generation, and Merkle path verification collectively ensure that decryption failures are structurally suppressed. The statistical framework confirms that these mechanisms not only prevent failures in practice but also withstand scrutiny under confidence-based validation. By quantifying the number of trials required for assurance, the figure bridges empirical testing with formal reliability guarantees, providing a roadmap for future large-scale validation. This is particularly relevant for embedded and IoT deployments, where correctness must be guaranteed under continuous operation and adversarial conditions. The ability to demonstrate reliability both empirically and statistically reinforces Merkle-LWE’s suitability as a dependable post-quantum cryptographic scheme.
In conclusion, the experimental evaluation confirms that Merkle-LWE achieves unprecedented memory efficiency without compromising concrete security. The “traditional LWE” baseline serves an analytical purpose—isolating the impact of individual architectural choices—while the primary comparisons against Kyber and NTRU, conducted under identical security levels and parameter sets drawn from standardized specifications [
5,
13,
41,
42,
43], demonstrate that Merkle-LWE’s gains are genuine and not artifacts of weakened parameters. Parameter alignment across lattice dimension, modulus, and module rank validates that the scheme’s compactness arises from structural representation rather than security margin erosion. These results position Merkle-LWE as a viable alternative for deeply constrained embedded platforms where static storage is the dominant bottleneck, complementing rather than replacing balanced schemes like Kyber in the broader post-quantum ecosystem.
Correctness evaluation indicates that reliable operation is not solely empirical but supported by statistical reasoning. While large-scale testing is required to formally validate extremely low failure probabilities, the absence of observed failures across extensive trials, together with structural safeguards embedded in the design, strongly supports the robustness of the construction. This dual perspective—empirical validation and statistical assurance—reinforces confidence in the correctness of Merkle-LWE in practical deployment scenarios, particularly in environments where reliability and deterministic behavior are critical. Taken together, the experimental and analytical results confirm that Merkle-LWE achieves its design goal: enabling quantum-resistant key exchange on deeply memory-constrained embedded platforms, with protocol-level overheads that are acceptable for the latency and concurrency profiles typical of low-power IoT deployments.
8. Concluding Remarks and Future Work
The comprehensive evaluation of Merkle-LWE KEM presented throughout this study highlights both its distinctive architectural philosophy and its practical implications for post-quantum cryptography in constrained environments. At its core, Merkle-LWE represents a deliberate reallocation of resource pressure: it sacrifices computational simplicity in order to achieve dramatic reductions in storage and memory traffic. This design choice is not incidental but structurally embedded, reflecting a memory-first approach that directly addresses the limitations of embedded and IoT platforms, where flash and RAM are scarce resources but computational cycles are comparatively abundant. The results consistently validate this philosophy. Storage usage is reduced by more than ninety-nine percent relative to traditional LWE schemes, memory traffic is lowered by over sixteen percent, and correctness is preserved without observable decryption failures. These gains, however, are accompanied by a doubling of CPU cycle counts, elevated L1 cache miss rates in certain operations, and modest increases in base power consumption. Yet the trade-offs are predictable, bounded, and phase-specific, ensuring that the scheme remains viable for real-world deployment. Crucially, the protocol-level analysis in
Section 7.9 demonstrates that these overheads translate to only marginal increases in handshake duration (approximately 3 ms) and substantial bandwidth savings (approximately 85% smaller ciphertexts), aligning with the operational profiles of low-power IoT protocols such as DTLS 1.3. For infrequent, latency-tolerant handshakes on flash-constrained devices, Merkle-LWE offers a compelling alternative to balanced schemes like Kyber, enabling post-quantum security on platforms previously excluded from adoption.
The analyses of computational cost reveal that Merkle-LWE’s overhead is concentrated in matrix operations and PRNG expansion, both of which are directly tied to its memory-saving design. By regenerating matrix rows dynamically from seeds and expanding sparse secrets on demand, the scheme eliminates bulky storage but incurs additional cycles. Component-level breakdowns confirm that while Merkle-LWE introduces new operations such as Merkle tree traversal and bit packing, their energy and cycle costs remain modest compared to the dominant matrix and PRNG workloads. Cache behaviour studies further demonstrate that sequential components such as hash computations and bit packing maintain high locality and low miss rates, while sparse secret access introduces inefficiencies due to random indexing. Importantly, these inefficiencies remain localized to specific phases, with no L2 cache misses observed, confirming that the working sets fit comfortably within embedded cache hierarchies. The overall cache hit rates in key generation and encapsulation remain high, ensuring that the majority of runtime execution benefits from cache acceleration.
Energy consumption analyses reinforce the centrality of memory traffic as the dominant factor in embedded cryptographic workloads. The scatter plots and component breakdowns consistently show that schemes with high memory traffic incur elevated energy costs, while those that minimize memory access achieve efficiency even in the presence of increased computation. Merkle-LWE exemplifies this principle: by reducing memory traffic through PRNG expansion and sparse representation, it achieves energy efficiency despite its higher cycle counts. The component-level breakdown reveals dramatic reductions in the energy footprint of hashing and memory access, offsetting modest increases in computation and PRNG operations. This redistribution of energy costs aligns with the broader observation that computation is less energy-intensive per operation than memory transactions on microcontrollers. Consequently, Merkle-LWE achieves a balanced energy profile that supports sustainable deployment in battery-powered IoT devices, where energy efficiency is paramount for long-term operation.
Correctness and reliability validation further strengthen the case for Merkle-LWE. Empirical testing revealed no decryption failures within the tested range, and comparative analyses confirmed that Merkle-LWE achieves reliability on par with NIST PQC candidates while outperforming traditional LWE, which exhibited failure rates above acceptable thresholds. Statistical significance analysis demonstrated that millions of trials are required to formally validate correctness at the confidence level, but the absence of observed failures and the structural safeguards embedded in the design strongly indicate robustness. Sparse secret reconstruction, deterministic matrix generation, and Merkle path verification collectively ensure that decryption failures are structurally suppressed, providing both empirical and statistical assurance of reliability. This dual validation—empirical and statistical—ensures that Merkle-LWE can be confidently deployed in environments where correctness and reliability are paramount, bridging the gap between experimental assurance and formal cryptographic standards.
Taken together, these findings position Merkle-LWE as a compelling candidate for PQC in constrained environments. Its memory-first design philosophy directly addresses the limitations of embedded platforms, where storage and bandwidth are scarce resources. While the scheme does not aim to outperform traditional or standardized PQC candidates in raw speed, it offers a unique balance of compactness, predictability, and reliability. This balance is particularly valuable in IoT deployments, where secure communication must coexist with strict energy budgets and limited hardware capabilities. The trade-off ratio of approximately two times computation per unit of memory saved is consistent across operations and reflects a predictable, phase-specific overhead. Importantly, the scheme’s computational costs are bounded and constant-time, ensuring resilience against timing-based side-channel attacks.
Looking forward, several avenues for future research remain open. Optimization of computational overhead is a critical direction. While Merkle-LWE’s cycle counts are predictable and bounded, further work is needed to reduce the cost of matrix regeneration and sparse secret handling. Techniques such as layout-aware scheduling, cache-conscious indexing, and hardware acceleration for PRNG expansion could mitigate the observed overheads without compromising memory efficiency. Exploring lightweight hash functions or hybrid verification strategies may also reduce the energy footprint of Merkle path traversal. Broader benchmarking across diverse hardware platforms is essential. The current analyses focus on embedded microcontrollers, but future work should extend to heterogeneous environments, including FPGAs, GPUs, and specialized cryptographic accelerators. Such studies would clarify the scalability of Merkle-LWE’s design and identify platform-specific optimizations. In particular, GPU-based parallelization of matrix operations and Merkle tree traversal could offset computational costs, while FPGA implementations may enable hardware-level compression of sparse secrets.
Integration with standardized PQC frameworks warrants exploration. While Merkle-LWE demonstrates correctness and reliability comparable to NIST PQC candidates, its unconventional design raises questions about interoperability and standardization. Future work should investigate hybrid schemes that combine Merkle-LWE’s memory efficiency with the balanced performance of Kyber or NTRU, potentially yielding designs that optimize across multiple resource dimensions. Comparative studies of protocol-level integration, including TLS and VPN frameworks, would further validate Merkle-LWE’s applicability in real-world deployments. Side-channel and fault tolerance analyses remain critical. While Merkle-LWE’s constant-time execution mitigates timing attacks, its sparse and tree-like access patterns may introduce new side-channel vectors, particularly in cache-based adversarial models. Future research should rigorously evaluate these risks and develop countermeasures, such as randomized access scheduling or hardware-assisted masking. Similarly, fault injection resilience must be tested, ensuring that Merkle path verification and sparse secret reconstruction remain robust under adversarial conditions.
Finally, long-term empirical validation is necessary to confirm statistical reliability. While millions of trials are required to formally validate correctness at the threshold, sustained testing across diverse workloads and adversarial scenarios will provide stronger assurance. Establishing open benchmarking frameworks and reproducible datasets would enable the broader research community to validate and refine Merkle-LWE’s reliability claims, fostering transparency and collaboration in PQC research. In conclusion, Merkle-LWE KEM represents a bold reimagining of lattice-based cryptography, one that prioritizes memory efficiency without compromising correctness or reliability. Its design philosophy—trading computation for storage—aligns with the realities of embedded and IoT platforms, where memory scarcity is the dominant constraint. While challenges remain in optimizing computational overhead and ensuring side-channel resilience, the scheme’s empirical and statistical validation confirms its viability as a secure and efficient post-quantum solution. Future work will refine, extend, and integrate Merkle-LWE into broader cryptographic ecosystems, ensuring that it contributes meaningfully to the ongoing evolution of secure communication in the quantum era.