CALM: Continual Associative Learning Model via Sparse Distributed Memory

Nechesov, Andrey; Ruponen, Janne

doi:10.3390/technologies13120587

Open AccessArticle

CALM: Continual Associative Learning Model via Sparse Distributed Memory

by

Andrey Nechesov

^†

and

Janne Ruponen

^*,†

The Artificial Intelligence Research Center of Novosibirsk State University, 630090 Novosibirsk, Russia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Technologies 2025, 13(12), 587; https://doi.org/10.3390/technologies13120587

Submission received: 5 November 2025 / Revised: 10 December 2025 / Accepted: 10 December 2025 / Published: 13 December 2025

(This article belongs to the Special Issue Collaborative Robotics and Human-AI Interactions)

Download

Browse Figure

Versions Notes

Abstract

Sparse Distributed Memory (SDM) provides a biologically inspired mechanism for associative and online learning. Transformer architectures, despite exceptional inference performance, remain static and vulnerable to catastrophic forgetting. This work introduces Continual Associative Learning Model (CALM), a conceptual framework that defines the theoretical base and integration logic for the cognitive model seeking to establish continual, lifelong adaptation without retraining by combining SDM system with lightweight dual-transformer modules. The architecture proposes an always-online associative memory for episodic storage (System 1), as well as a pair of asynchronous transformer consolidate experience in the background for uninterrupted reasoning and gradual model evolution (System 2). The framework remains compatible with standard transformer benchmarks, establishing a shared evaluation basis for both reasoning accuracy and continual learning stability. Preliminary experiments using the SDMPreMark benchmark evaluate algorithmic behavior across multiple synthetic sets, confirming a critical radius-threshold phenomenon in SDM recall. These results represent deterministic characterization of SDM dynamics in the component level, preceding the integration in the model level with transformer-based semantic tasks. The CALM framework provides a reproducible foundation for studying continual memory and associative learning in hybrid transformer architectures, although future work should involve experiments with non-synthetic, high-load data to confirm scalable behavior in high interference.

Keywords:

continual learning; associative memory; hybrid architecture; sparse distributed memory (SDM)

1. Introduction

Modern large language models (LLMs) with transformer architecture revolutionized natural language understanding and generation. However, transformers lack persistent memory and continual learning capabilities, and are also vulnerable to catastrophic forgetting. The reason for this is the encoding of information through dense vector spaces and positional embeddings. While powerful for static inference, they lack inherent memory and must be retrained or fine-tuned to learn new data. This situation led to the exploration of additional methods to integrate an online learning capabilities and dynamic memory processing to transformers. Since generalization and reasoning are possible only via stored associations, memory is interpretable as a component of intelligence.

Although Retrieval-Augmented Generation (RAG) has been used as a complementary retriever, its lack of internalization does not satisfy declarative representation, which involves acquisition, consolidation (also online), and long-term retrieval. Episodic memory as a sub-component follows a similar workflow by involving encoding (acquisition), consolidation (storage), and recall (retrieval).

Explorations for declarative augmentation via episodic memory incorporated the study of Kanerva’s Sparse Distributed Memory (SDM) framework based on high-dimensional, associative, and distributed memory principles. Kanerva’s framework in 1988 [1] established SDM as both a model of human long-term memory and a practical associative memory system. SDM represents data in high-dimensional binary spaces where memory is spread across many storage locations instead of localization. Addresses are compared via Hamming distance, and memory retrieval uses weighted averaging from nearby locations. This approach yields robustness to noise and partial input [1,2].

These principles directly motivate SDM’s sparse binary representation and distributed storage mechanisms. The Continual Associative Learning Model (CALM) framework is introduced to fill the gap in the integration of lifelong learning in AI applications. Integrating SDM and a transformer was seen as an opportunity to explore online learning with associative memory in various application spaces. This hybrid network merges SDM module (S1) that is always online with pair of asynchronous lightweight transformer modules for serving and training (S2). The serving transformer is active (wake) and replaced periodically by forked shadow transformer performing training and consolidation of information.

This study is theoretical in nature and presents a conceptual modeling framework rather than a fully implemented system. CALM is presented as a theoretical architecture that integrates SDM with dual transformer modules to explore the feasibility of continual associative learning. Accordingly, the contributions of this manuscript lie in the formulation of the conceptual model, theoretical reasoning, and synthetic validation experiments designed to characterize the behavior of the proposed memory system.

The research goals were set as follows:

1.: Establish evaluation system to configure optimal parameters for SDM devices.
2.: Validate the characteristic behavior of SDM (S1) by synthetic tests from the benchmark.
3.: Define and validate the preliminary architecture containing the SDM module (S1) and dual-transformer module (S2),

2. Related Work

2.1. SDM Foundations and Neuroscientific Principles

Subsequent research has developed sophisticated algorithms for high-dimensional pattern storage and retrieval [3,4]. Subsequent research in 1989 established practical parameters for memory operations [5]. The study of the foundations of SDM reveals how biological neural circuits inspired novel computational approaches, enabling modern applications such as a continual learning, neuromorphic computing, and artificial intelligence (AI) systems. The biological inspiration came from cerebellar cortex models by Marr and Albus, where the cerebellum was proposed as an associative learning device through sparsely distributed neural computations [6,7].

The fundamental difference between transformer and SDM representations lies in their vector encoding. Transformers typically operate on dense floating-point vectors (e.g., 768 or 1024 dimensions in BERT/GPT), requiring substantial compute for each token [8]. A dense floating-point vector may look like the following:

[0.12, - 0.89, 1.4, 0.03, - 0.55, 0.6, 0.0, 1.2]

In contrast, SDM leverages sparse binary vectors, often of size

n = 1000

or more, where each bit is either 0 or 1. These high-dimensional binary vectors facilitate associative recall by proximity (Hamming distance) without the need for exact matches or matrix multiplications. Interference is handled by multiple patterns stored at overlapping locations that form associative links. This characteristic allows memory access in

O (1)

time per hard location, significantly reducing the computational load for memory access while providing theoretical capacity that can scale exponentially with dimension for sparse patterns. A sparse binary vector may look like the following:

[0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, . . .] (3 / 512 b i t s a c t i v e)

The sparsity enables efficient bitwise operations and dramatically reduces memory requirements for embedded systems. Episodic memory could be achievable as an inherent feature for observer objects such as sensors and IoT devices. Characteristics of SDM highlight the efficiency required from the S1 running initialization of explicit memory. The correspondence to cerebellar provides the primary neural inspiration with anatomical mappings from mossy fibers to address inputs, in addition to granule cells to hard locations, parallel fibers to distributed storage connections, and Purkinje cells to output neurons [6,9]. This biological architecture exhibits sparse activation patterns (only small percentages of granule cells active simultaneously) and massive parallelism enabling fast computation, as well as distributed storage with graceful degradation [10,11].

2.2. Neuromorphic Implementations

The sparse coding principles of widespread neuroscience research have influenced SDM development. SDM has been extended through cortical coding models mapping mini- and macro-columnar structures [12] as well as recent continual-learning implementations [11]. Human hippocampus studies by Rutishauser et al., 2006 [13] found direct evidence of sparse distributed coding in episodic memory, with only small percentages of neurons in hippocampus responding to any given memory target. Each neuron contributes to representations of only a few memories, providing “the most efficient way for neurons in hippocampus to encode episodic memories rapidly” [14]. Cortical column theories have demonstrated connections between SDM principles and cortical organization, through research by Rinkus (2010) [12], showing macrocolumns functioning as sparse distributed coding fields. Minicolumns (20 L2/3 pyramidal cells) act as winner-take-all competitive modules enforcing sparseness, with sparse distributed codes consisting of 70 active L2/3 cells from much larger populations.

Neuromorphic hardware implementations have achieved remarkable success on the SpiNNaker platform, with Steve Furber’s lab incorporating N-of-M rank codes into SDM variants for biologically plausible implementation [15,16,17]. The SpiNNaker2 chip (2024) supports 153 ARM cores with 19MB on-chip SRAM, dedicated ML, and neuromorphic accelerators, as well as a 10× increase in neural simulation capacity per watt [18,19]. Recent work demonstrates the first neuromorphic language model using Event-based Gated Recurrent Units (EGRUs), achieving LSTM-equivalent performance with significant energy advantages [20]. Dendrocentric learning has also been proposed, based on shifting from synaptic weights to dendritic ordering for radically lower-power learning on neuromorphic substrates [21]. In another study, XNOR-Net [22] demonstrated the practical advantages of binary representations, achieving 32× memory savings through binary weight approximations and 58× speedup in convolutional operations by using primarily bitwise operations rather than high-precision floating-point arithmetic. These binary approximations enable state-of-the-art deep networks to run on CPUs in real time rather than requiring GPU acceleration, making them suitable for portable devices with limited computational resources. The success of XNOR-Net validates the fundamental principle underlying SDM that sparse binary representations can maintain accuracy while providing substantial computational and memory advantages over dense floating-point approaches. Achieved efficiency not only enables SDM to complement transformer models with declarative memory, but also provides a specialized method to deploy lightweight edge AI with episodic memory capabilities.

Continual learning breakthroughs emerged with Bricken et al., 2023 [11] demonstrated that “Sparse Distributed Memory is a Continual Learner.” This modified Multi-Layered Perceptron exhibits strong continual learning capabilities without catastrophic forgetting, translating biological neural circuit principles into artificial networks [11]. The approach achieves continual learning performance comparable to biological systems through sparse activation patterns and biologically inspired connectivity. In another study of Compressive Sensing SDM (CS-SDM) by Vdovychenko and Tulchinsky [23] integrated a compressive sensing theory with SDM to address classical limitations. This GPU-based implementation uses compressive sensing principles to achieve significantly better capacity and denoising capabilities compared to classical designs, particularly for semantic storage with enhanced noise tolerance [23,24]. In-memory computing implementations reported by Karunaratne et al., 2020 [3] achieved breakthrough performance using 760,000 phase-change memory devices that perform analog in-memory computing. These systems demonstrate software-equivalent accuracy for language classification, news classification, and hand gesture recognition while providing massive parallel processing with a significant reduction in data movement and energy consumption [3].

Recent work in continual learning has explored Elastic Weight Consolidation (EWC), replay-based methods, synaptic intelligence (SI), and gradient episodic memory (GEM) [25]. EWC mitigates catastrophic forgetting by selectively reducing plasticity on parameters deemed important for previously learned tasks, preserving prior knowledge through Fisher-information-weighted regularization [26].

Replay-based methods address forgetting by storing past experiences in episodic buffers and interleaving them with new data, stabilizing learning even under non-stationary task sequences [27]. SI introduces intelligent synapses that accumulate task-relevant importance measures during learning, enabling the network to rapidly incorporate new information while preserving previously acquired knowledge by penalizing disruptive parameter changes [28]. GEM approaches the problem from a constrained optimization perspective: each new gradient update is projected to avoid increasing the loss on a stored episodic memory, thereby preventing catastrophic forgetting while enabling positive backward transfer [25]. Although effective, these approaches rely on parameter-level regularization or explicit storage and replay of past data. In contrast, SDM enables continual learning without parameter updates by relying on high-dimensional associative storage that provides modularity, fault tolerance, and stable memory access independent of task sequence with different binarization methods.

The biological foundations established several key principles: metabolic efficiency through sparse coding, fault tolerance via the distributed storage, scalability through high-dimensional spaces allowing virtually unlimited memory capacity, computational speed through parallel processing, and generalization through similar representations for similar inputs supporting learning transfer.

Furthermore, sparse coding in neuroscience is related to biological neural networks that should exhibit sparse activation patterns. Only a few percentage of neurons are simultaneously active in cortical areas [29,30,31]. Sparse codes maximize information capacity and energy efficiency [32,33]. Lateral inhibition mechanisms maintain sparsity constraints [32,34]. Population vector decoding enables robust pattern completion [35,36].

It should be noted that choosing suitable encoding methods for the SDM offers a myriad of options and combinations for different behavior in memory processing. For example, locality-sensitive hashing (LSH) is applicable for fast approximate retrievals [37], while interactive thresholding (ISTA) is suitable for bio-aligned sparsity [38]. Further, semantic hashing can be applied if content-based addressing is preferred [39], while learned quantization is more focused on memory efficiency [40]. Section 3.2 discusses these options in more detail.

2.3. Multimodal and Memory-Augmented Learning

Recent progress in multimodal representation learning has been driven by large-scale vision–language pre-training, where models learn joint embeddings from billions of image–text pairs collected from the web. CLIP [41] demonstrated that contrastive pre-training on 400M image caption pairs yields highly transferable visual representations usable for zero-shot recognition across a wide range of tasks. Subsequent works have scaled both data and architectures: ALIGN [42] showed that even noisy, weakly filtered alt-text corpora can support state-of-the-art multimodal alignment, and Flamingo [43] introduced a cross-attentional architecture capable of few-shot reasoning over image sequences by interfacing a visual encoder with a frozen large language model. These approaches provide strong evidence that large-scale cross-modal alignment can serve as a powerful substrate for downstream reasoning, recognition, and retrieval tasks. New frameworks have further advanced the modularity and efficiency of multimodal pre-training. BLIP [44] and BLIP-2 [45] introduced bootstrapped captioning and lightweight Querying Transformers to bridge frozen image encoders with frozen language models, reducing the number of trainable parameters while achieving state-of-the-art results in vision–language understanding and generation.

PaLI [46] extends this paradigm to multilingual setting with a 10B image–text corpus, while Gato [47] demonstrates that a unified transformer policy can operate across modalities within a network that include text, images, proprioception, and control. These multimodal systems highlight a trend toward scalable, versatile architectures capable of integrating heterogeneous sensory inputs and supporting flexible task interfaces, forming a backdrop for models that aim to unify perception, language, and embodied reasoning.

Although these implementations focus on language inputs, the principles of SDM can be extended to multimodal sensor fusion. Prior work on multimodal identity recognition and temporal estimation, such as Lightweight Bilateral Networks for Mura detection, multi-task learning for hand heat trace estimation, and Deep Soft-Threshold Feature Separation networks for infrared handprints, demonstrates the versatility of high-dimensional associative memory in integrating heterogeneous inputs.

Although the current implementation of CALM focuses on language inputs, the principles of SDM are readily extendable to multimodal sensor fusion. Prior studies in multimodal identity recognition and temporal estimation illustrate the relevance of high-dimensional, sparsely activated and distributed representations for integrating heterogeneous inputs. For example, Zhao et al. [48] employ separate spatial and semantic paths with selective feature fusion for Mura detection on micro-OLED displays, analogous to distributed and sparse encoding across memory locations. Yu et al. [49] leverage multi-task learning with attention mechanisms to separate identity and temporal features from infrared hand heat traces, reducing interference between tasks in a manner similar to SDM’s non-destructive associative storage. Similarly, Yu et al. [50] use soft-threshold feature separation for sparsification while dynamically balancing feature contributions for infrared handprint recognition and time estimation. This echoes SDM’s sparse activation and robustness to partial overlap. Together, these approaches demonstrate that principles of sparse, high-dimensional, and distributed representations can naturally support multimodal integration while maintaining robustness and enabling incremental or continual learning.

Modern approaches to continual learning, memory integration, sparsity, and efficient model adaptation reveal a converging research trend toward more adaptive and energy-efficient AI systems resistant to catastrophic forgetting. while parameter-efficient continual adaptation methods aim to update large models through lightweight modules such as LoRA mixtures and routing networks [51]. Surveys on continual pre-training and parameter-efficient continual fine-tuning highlight that distribution shift, warm-up scheduling, and interference remain central challenges [52,53]. Memory-augmented architectures, spanning replay buffers, synthetic replay, transformer-integrated working memory, and neuromorphic spike-based rewiring, provide mechanisms for long-range context retention, compression-guided sparsity, and test-time learning [54,55,56,57,58].

Broader lifelong learning studies in robotics and latent-factor models further reinforce the importance of shared time-scale memory for maintaining stable performance across evolving tasks based on invariant structures, sparse representations, generative replay, and multi-timescale memory [59,60]. Overall, the existing literature appear to emphasize transition from monolithic static models toward modular, sparse, and memory-centric intelligence systems capable of sustained continual learning.

2.4. CALM Positioning Within Continual Learning LLMs

Recent advances in LLMs have explored continual learning through a combination of parameter-efficient fine-tuning, episodic replay, and memory-augmented architectures [51,52,53]. Techniques such as LoRA mixtures and routing networks enable lightweight adaptation of large models without full parameter updates, while replay buffers, synthetic replay, and transformer-integrated working memory provide mechanisms for long-range context retention [54,55,56]. Despite their effectiveness, these methods often require careful scheduling, distribution alignment, and selective parameter updates to mitigate catastrophic forgetting, and they typically operate at the level of dense floating-point representations.

In contrast, CALM leverages Sparse Distributed Memory (SDM) for associative replay, enabling high-dimensional, sparse, and non-destructive storage of experiences [11,23]. By storing memories at overlapping locations in a distributed fashion, CALM allows efficient recall without explicit parameter updates, providing robust continual learning capabilities even under non-stationary input distributions. Unlike conventional replay buffers or parameter-regularization approaches (e.g., EWC, SI, GEM) [25,26,28], SDM-based memory scales naturally with the number of stored patterns and supports rapid access through simple bitwise operations.

CALM’s hybrid-memory integration further distinguishes it from recent memory-augmented LLMs. While many architectures extend transformers with explicit memory modules or episodic caches, these typically require dense read/write operations and careful memory management. By contrast, CALM uses SDM as a declarative, associative memory layer that complements transformer reasoning, allowing seamless storage and retrieval of past experiences without interfering with ongoing computation. This design is particularly advantageous for multimodal, edge, or low-resource deployments where efficiency and fault tolerance are critical [20,21].

Taken together, CALM situates itself within the modern LLM ecosystem as a memory-centric, sparse, and scalable continual learner. It bridges the gap between biologically inspired SDM principles and transformer-based reasoning, providing a framework capable of incremental learning, multimodal integration, and energy-efficient memory operations. By explicitly contrasting CALM’s associative memory with conventional replay- and parameter-based methods, the framework is positioned as a complementary, modular approach to the challenges of sustaining large model performance over evolving data streams.

3. Materials and Methods

The primary objectives of this study are (1) to build an evaluation system for analyzing SDM behavior and selecting optimal parameters, (2) to validate SDM’s associative dynamics through the SDMPreMark synthetic benchmark, and (3) to define and examine a preliminary CALM architecture that integrates SDM (S1) with dual-transformer modules (S2) in a continual-learning loop. Particularly, this study focused on establishing theoretical framework applicable for future work involving empirical experiments.

The methodological approach contains the evaluation methods for SDM operations according the existing literature (Section 3.1), encoding methods for the SDM (Section 3.2), the SDM benchmark design for synthetic testing of SDM characteristics (Section 3.3), the mapping system for S1 to explore feature integrations from existing biological mechanisms (Section 3.4), the CALM integration methods for architecture (Section 3.5), and the CALM architecture that applies integration methods (Section 3.6).

3.1. SDM Evaluation Methodology

To facilitate understanding of SDM-based memory, transformer reasoning modules, and integration mechanisms, concepts and definitions are summarized in Table 1 containing the core terminology used in the subsequent sections.

Unlike von Neumann architectures, SDM implements content-addressable memory based on Hamming distance and distributed storage. This section summarizes the key mathematical operations; full derivations follow Kanerva’s original formulation [1,10]. Hamming distance is defined as

H (x, y)

and is used for SDM addressing for binary vectors

x, y \in {0, 1}^{n}

; in bipolar form

x, y \in {- 1, + 1}^{n}

, this is simplified, as shown in Equation (1):

H (x, y) = \sum_{i = 1}^{n} | x_{i} - y_{i} | = \sum_{i = 1}^{n} (x_{i} \oplus y_{i}) \to H (x, y) = \frac{n - x \cdot y}{2}

(1)

where n is the vector dimensionality. Memory locations are activated when

H (A_{m}, x) \leq r

, or equivalently

A_{m} \cdot x \geq n - 2 r

. Write operation is implemented by storing each pattern

w

in all active locations within radius r:

C : = C + y \otimes w,

(2)

where

y

is a binary activation mask (

y_{m} = 1

if

H (A_{m}, x) \leq r

). Weighted writes use distance-dependent gains

α (H)

that decrease with radius. Read operation retrieves aggregates of all active counters and applies thresholding:

z = sign (C^{T} y),

(3)

which reconstructs stored patterns even from incomplete inputs. Retrieval fidelity depends on the signal-to-noise ratio

ρ

and the resulting bit accuracy

ϕ

, given by

ρ = \frac{μ}{σ}, μ = p M, σ^{2} \approx p M [1 + p T (1 + p^{2} M)], ϕ \approx Φ (ρ)

(4)

where p is the activation probability of a memory location, M is the number of hard locations, T is the number of stored patterns,

μ

and

σ^{2}

are the signal and noise components, respectively, and

Φ

denotes the standard normal cumulative distribution function (CDF). Access radius is solved by the probability of activating a location p and its optimal value, following Kanerva’s rule:

p^{*} = {(2 M T)}^{- 1 / 3}, r \approx 0.4 n,

(5)

where M is the number of hard locations and T the number of stored patterns. The radius controls the trade-off between generalization and interference. The Address generation uses uniform random N-bit addresses stored in matrix [16,61]:

A \in {- 1, + 1}^{M \times N}

(6)

with each row representing one hard location’s address in random or structured address matrix. Jaeckel’s Selected-Coordinate Design optimizes this through sparse address matrices with

k \approx 10

non-zero coordinates per address and activation condition:

A_{m} \cdot x = k

(7)

which provide exact match on selected coordinates. The use of Hyperplane applies k ones per hard address for skewed data distributions. Weighted retrieval applies similarity-based weighting functions such as

w_{i} (Q) = exp (- α H (A_{i}, Q)), for H (A_{i}, Q) \leq r

(8)

and yield smooth recall probabilities:

P_{retrieval} (BER) = \sum_{k = 0}^{r} (\binom{n}{k}) {BER}^{k} {(1 - BER)}^{n - k}

(9)

where

w_{i} (Q)

is the weight assigned to location i for query Q,

H (A_{i}, Q)

denotes the Hamming distance between the query and memory address

A_{i}

, r is the access radius that defines the activation threshold, and

α

is a decay coefficient controlling distance sensitivity. In the recall probability expression, n is the dimensionality of the binary vector space, k indexes the number of bit errors, and BER (bit error rate) defines the fraction of corrupted bits in the query relative to the stored address. Further methods involve location activity probability

P_{a c t i v a t e}

and expected number of activated locations

E [K]

(Equation (10)):

P_{a c t i v a t e} = P (H (a_{i}, p) \leq r) = \frac{\sum_{k = 0}^{r} (\binom{d}{k})}{2^{d}}, E [K] = m \cdot P_{a c t i v a t e}

(10)

where

H (a_{i}, p)

is the Hamming distance between address

a_{i}

and probe

p

, m is the total number of memory locations, d is the vector dimensionality, r is the access radius, and k is the summation index over bit differences. Interference coefficient

γ

(patterns per location) and quality degradation due to interference

Q (γ)

are defined as (Equation (11)):

γ = \frac{N_{t o t a l}}{E [K]}, Q (γ) = \frac{1}{1 + β γ^{2}}

(11)

where

N_{t o t a l}

is the sum of all patterns,

β

is the interference sensitivity constant that determines how strongly overlapping pattern storage affects recall quality. Higher

β

values correspond to higher sensitivity and faster degradation of the retrieval accuracy as

γ

increases. Shannon-type capacity bound

C_{S h a n n o n}

defines the theoretical upper bound on SDM storage capacity per dimension for interference-free storage while empirical capacity

C_{e m p i r i c a l}

measures actual storage efficiency and capacity to store unique patters per location before domination of recall errors (Equation (12)):

C_{S h a n n o n} = \frac{2^{d} \cdot \sum_{k = 0}^{r} (\binom{d}{k})}{m \cdot d}, C_{e m p i r i c a l} = \frac{N_{s u c c}}{m}

(12)

where

N_{s u c c}

measures the successfully stored patterns. Having established the theoretical foundations for SDM evaluation, the next section proceeds to examine how evaluation methodology is converted to benchmark design for real-life performance tests.

3.2. Encoding Techniques

The encoding techniques used with SDM can be organized into five functional categories based on the computational role they play: sparsification, semantic encoding, locality-preserving hashing, quantization and compression, and stabilization and preconditioning.

Sparsification methods enforce low-activity representations and include mechanisms such as iterative shrinkage–thresholding (ISTA) [62,63], Fast ISTA (FISTA) [62], k-winners-take-all, and biologically inspired lateral-inhibition thresholding [38]. Sparsification refers to the process of transforming high-dimensional neural activity patterns into representations in which only a small subset of units remain active. This reduces interference, improves energy efficiency, and enables robust pattern separation, resembling principles in cortical circuits where lateral inhibition and adaptive firing thresholds maintain low average activity levels. Iterative thresholding methods include ISTA/FISTA, allowing convergence to the sparse solutions efficiently, while biologically inspired mechanisms can provide real-time competition among the units for activation [38].

In the context of SDM, sparsification acts as a critical processing step by converting dense sensory or latent vectors into stable, noise-tolerant sparse codes. Only a small set of components remain active, and their selective activation provides a distinctive address for high-dimensional associative memory. Effective sparsification not only increases SDM capacity but also enhances generalization as sparse codes reduce overlap between unrelated patterns while preserving relevant structure for pattern completion. Methods such as iterative thresholding, k-winners-take-all, sparse autoencoders, and spiking-inspired threshold dynamics all instantiate this principle by enforcing competition and selective activation. Methods such as k-winners-take-all, sparse autoencoders, and spiking-inspired threshold dynamics instantiate this principle by enforcing competition and selective activation [38,62].

Semantic encoding methods derive structured latent codes that reflect conceptual similarity, such as semantic hashing [39], VQ-VAE latent quantization, or discrete autoencoder bottlenecks. Semantic hashing ensures that semantically similar inputs are mapped to nearby binary codes, facilitating efficient approximate retrieval from SDM.

Locality-preserving hashing methods include LSH, SimHash, and sparse random projections, which are used to ensure geometrically similar patterns map to nearby addresses in SDM’s high-dimensional space, therfore enabling efficient approximate retrieval [37,64]. Recent work on lattice-based LSH has established optimal constructions that minimize query time and space while preserving theoretical guarantees.

Quantization and compression methods discretize or compress encodings using learned quantization, vector quantization, product quantization, or binary/ternary neural networks, providing memory efficiency and hardware compatibility [40,65,66]. Product quantization decomposes high-dimensional vectors into subspaces and encodes them separately, achieving high compression ratios without significant loss in retrieval performance.

Together, these five categories cover the major design dimensions for constructing modern SDM-compatible encoders that balance biological plausibility, computational efficiency, and semantic richness.

Stabilization and preconditioning methods include Hopfield cleanup dynamics, whitening transforms, and manifold encoders. These methods prepare representations to be more robust and consistent before insertion into SDM.

Together, these methods from five encoding method categories cover the major design dimensions for constructing modern SDM-compatible encoders that balance biological plausibility, computational efficiency, and semantic richness.

3.3. SDM Benchmark Design

LLM benchmarks assume continuous token embeddings, cosine similarity and dot products on continuous vectors, gradient-based learning with continuous optimization, and probabilistic next-token prediction. The combination of memory architecture in terms of address space (learned embeddings versus random binary addresses), storage (parameter matrices vs. hard locations with binary counters) and retrieval (attention mechanisms vs. Hamming distance-based associative recall) create distinct challenges to apply LLM benchmarks on SDM-based processes.

This study focused on implementing SDMPreMark as the initial component-level validation for optimal parameters. It provides device-specific analysis to evaluate and optimize selection of suitable SDM configuration combinations, fingerprinting optimal parameter combinations (vector_dim, num_locations, access_radius, reinforce, match_ratio) across different hardware environments. This benchmark framework tests 1470 unique configurations, exploring the trade-offs between memory capacity, retrieval accuracy, and computational efficiency. Key parameter ranges include parameter-sweep coverage of vector dimension, memory locations, access radius factor, and reinforcement cycles. The parameters are listed in Table 2.

Vector dimension impacts memory footprint and computational complexity. Larger dimensions establish more precise representations but require proportionally more resources. Access radius is calculated as

r = max (1, ⌊ d \times α ⌋)

where d is vector dimension and

α

is the radius factor, controlling the specificity-generalization issues in pattern matching. Memory locations determines storage capacity and potential interference patterns, where higher counts provide greater capacity at the cost of increased search complexity. Reinforcement cycles control pattern strengthening, with more cycles improving storage reliability but increasing training overhead.

The benchmark generates key performance indicators including match ratio (retrieval accuracy ranging 0–1), execution duration (computational efficiency in seconds), and sparsity ratio (input pattern density). Additional metrics capture input pattern characteristics (ones count), recalled pattern fidelity (recalled ones count), and derived measures such as the radius factor for systematic analysis. These metrics enable the evaluation of performance across accuracy, efficiency, and pattern handling dimensions. The benchmark aims to identify parameter combinations that achieve high retrieval accuracy with minimal computational overhead for specific hardware environments.

Further optimization of SDM could involve a benchmark for evaluating episodic memory capacity, pattern completion accuracy, and interference resistance in long-context scenarios using binary-distributed representations. Another benchmark design could be used for assessing temporal reasoning capabilities through binary vector binding/unbinding operations and chronological sequence reconstruction in high-dimensional space.

3.4. System 1 Methodology

According to Kahneman [67], S1 in biological context involves plethora of mapped mechanisms applied for fast thinking. The most prevalent of those is associative activation, the core feature of SDM method. SDM also achieves stereotyping as category vectors naturally serve as prototypes via pattern matching. SDM also integrates mechanisms for availability heuristic as retrieval probability scales with memory frequency and recency. Similarity-based recall functions provides classification by resemblance and prototype matching, emerging as representativeness heuristic. Since SDM limit local view, it models partial-context reasoning, resembling the mechanism of “What You See Is All There is” (WYSIATI).

Questions were raised as to whether other mechanisms could be beneficial and implementable for CALM. The easiest mechanisms for integration were estimated to be substitution, anchoring, cognitive ease, halo effect, and framing effect.

Substitution simplifies difficult questions and favors intuition. Its use in the SDM framework adds fallback mechanism under uncertainty with incomplete data. Substitution requires addition of heuristic controller to fall back to nearest simpler query under high uncertainty. Risk of oversimplification (confident answers) is apparent, and therefore the mechanism should be used as probabilistic feedback to be triggered only in low confidence levels, marked as heuristic inference.

Anchoring provides stability and context persistence across iterative queries, as well as prevent erratic jumps in associative recall. Anchoring could operate by maintaining an initial retrieval with decay per distance or time. The risk of path fixation may occur during early states, requiring controller decay by using S2 to reset anchor periodically.

Cognitive ease in SDM is the confidence estimator that signals for internal coherence by fast retrieval and low-conflict responses. Retrieval fluency drives confidence signal for cognitive ease. The overrating of effortless associations may ignore deeper associations. Consequently, this mechanism should be used to weight certainty estimates only, and not for logical validity or decision rule.

Halo effect (influence of overall impression to generalize positively all traits) may be used to propagate affective coherence such as consistent emotional stance or tone across related memory clusters. In practice, this could manifest as empathy, narrative consistency, and user alignment. This mechanism applies spread valence to nearby vectors for tone propagation. The risk of halo effect is the spread of undesired bias to overly positive or negative framing. Therefore, it should be applied only locally and allow the decay over distance. An additional module may be required to neutralize overgeneralization.

For stylistic purposes, framing effect can bring more sensitivity to linguistic context, helping the system to detect user framing adapt tone based on keyword frames. Acting as a stylistic modulation for the encoder, it could provide an additional record of framing tags separate from factual memory.

Mechanisms deemed not worthy of feature integration include confirmation bias, the law of small numbers, and overconfidence. Confirmation bias is problematic due to its rejection of new or contradictory data, introducing risk of rigidity and echo-chamber effect. These are not interpreted as desirable features for an adaptive, learning model. A law of small numbers (generalization based on small sample sizes) is not a desirable behavior; it may, however, manifest itself at the degree when adaptive memory radius is tuned. Likewise, overconfidence can occur during natural manifestation of SDM parameters. Consequently, these methods act more as emergent lag indicators of underlying SDM parameters, instead of driving mechanisms for memory processing.

Additionally, while intensity matching, affect heuristic, and causal thinking were estimated to be not integrated to S1, they were applicable when integrated with S2. Intensity matching would enable along reasoning about degree and magnitude to form analogues, emotional scales, and prioritization, but requires scalar extensions to binary SDM. Affect heuristic is critical for motivational steering, empathy, and adaptivity, but it is difficult to ground safely without human-level affection. Furthermore, although causal thinking with predictive reasoning is useful for planning and behavior modeling, its implementation requires temporal reasoning beyond SDM module.

Mechanisms of S1 in the context of SDM are represented in four categories: inherent, intuitive, emergent, and causal. Mirroring cognitive architectures, the inherent module act as reflexive neural wiring; the intuitive module applies heuristic reasoning; the emergent module displays adaptive biases; and the causal module is reflective cognition with abstractions (S2 interface). This also clarifies the following evolutionary design priorities:

1.: Establish inherent S1 mechanisms;
2.: Integrate intuitive S1 mechanisms;
3.: Implement controller to detect emergent S1 mechanisms;
4.: Integrate S2 dual transformers to S1 (SDM and controller).

3.5. CALM Integration

The CALM framework provides the integration system for the SDM-based S1 with dual-transformer reasoning modules (S2A and S2B) to enable continual, online adaptation. The capability of the integration for lifelong continual learning is estimated by three conditions.

(1) Theoretical Sufficiency. The CALM architecture satisfies the three canonical operations required for continual learning:

E (x_{t}) \overset{encode}{\to} SDM, R (C_{t}) \overset{recall}{\to} Transformer, U ({\hat{x}}_{t}) \overset{consolidate}{\to} SDM

(13)

where

E

denotes episodic encoding,

R

associative retrieval provides contextual recall to the active transformer, and

U

denotes consolidation (i.e., the process of integrating new or updated representations from transformer outputs back into SDM, updating episodic memory traces (binary vector and metadata) while preserving previously stored information).

Together, these three transformations form a closed-loop continual learning cycle:

{(Encode \to Recall \to Consolidate)}_{t} \Rightarrow {(Adaptive Memory Update)}_{t + 1}

This cycle ensures that every interaction (user input, environmental stimulus, or transformer output) is internalized into the SDM and available for subsequent reasoning. The retrieved episodic memory from SDM is consumed by the transformer modules, which store semantic memory and provide structured knowledge to guide reasoning and decision-making.

(2) Empirical Plausibility. Results from the SDMPreMark benchmark establish a deterministic critical radius threshold (

r \approx 0.4

–

0.6

of vector dimensionality) defining stable associative recall. Above this threshold, SDM recall becomes error-free (match_ratio = 1.0) across all tested dimensions, satisfying the stability condition required for consistent transformer context generation. This property guarantees that the memory subsystem feeding the reasoning layers operates deterministically while being noise-tolerant.

(3) Provability Condition. Functional provability of CALM is defined through measurable improvement of the active transformer following SDM-based consolidation. Let

B_{t}

denote a shadow transformer (S2B), which is trained asynchronously on replay samples

M_{t - r : t}

retrieved from SDM memory. These replay samples constitute the episodic memory retrieved from SDM for shadow transformer training. The currently active transformer (S2A)

A_{t}

is only updated to

A_{t + 1} = B_{t}

if the shadow model demonstrates a measurable improvement in task performance (Equation (14)):

Δ L_{t} = L (A_{t}) - L (B_{t}) > ϵ

(14)

where

L

is a task-specific loss (e.g., validation loss, accuracy, or user metric) and

ϵ

is a minimal improvement threshold.

This measurable criterion provides a falsifiable condition for continual learning: each update cycle must yield a quantifiable improvement in the active model’s performance without external retraining [26].

An atomic swap refers to this conditional update or replacement of the active (S2A) and secondary (S2B) models based on performance. The update occurs only when the shadow model achieves a quantifiable performance gain, ensuring consistency and avoiding partial or unstable state updates.

(4) Verification Roadmap. The provability of CALM can be established through a progressive evaluation sequence. SDMPreMark benchmark is applied to confirm associative stability and computational efficiency (S1 validation). The next verification component, CALMark benchmark, measures

Δ L

before and after atomic model swaps to verify memory-driven consolidation (System-2 validation). The third verification component involves continual reasoning tests employing long-term retention and adaptation benchmarks (e.g., GLUE, SQuAD) to validate stability over time.

3.6. CALM Architecture

Building on the limitations of conventional transformer systems and the continual learning capabilities of Sparse Distributed Memory (SDM), the proposed CALM framework integrates a dual-system architecture inspired by cognitive processes described in Section 3.5. Figure 1 depicts this hybrid design, which was motivated by three key goals:

Continual Adaptation: SDM is allowing rapid incorporation of new data without retraining.
Memory Grounding: The system maintains an episodic trace of events, enabling structured recall and S1 operations.
Biological Plausibility: The modular memory structure aligns with the distributed intelligence of the cortical columns and sensorimotor integration.

S1 encapsulates online associative memory (SDM), storing new experiences as sparse binary vectors and providing rapid associative recall via Hamming proximity. It also provides episodic context for active models. S2 comprises an Active Transformer (S2A) serving user interactions and a Shadow Transformer (S2B) training asynchronously using SDM replay samples and fine-tuning adapters (PEFT/LoRA). The shadow model replaces the active one atomically, when validated improvement occurs. The controller and version manager monitor metrics, stability, and loss, but they also manage swaps and archive model versions for rollback or analysis. The model operates in three concurrent processes:

1.: Encoding and Acquisition: New input is encoded as high-dimensional binary vectors and stored in SDM.
2.: Retrieval and Reasoning: SDM provides associative recall to the Active Transformer, which generates responses and updates memory into the SDM.
3.: Consolidation: The Shadow Transformer fine-tunes asynchronously on SDM replay data, ensuring gradual learning without disrupting live performance.

This architecture enables continual adaptation, episodic recall, and stable long-term reasoning, bridging the gap between static transformer inference and dynamic memory-based cognition. Internally, CALM’s distributed SDM implementation mirrors cortical organization through modular associative units analogous to cortical columns. In local learning, each SDM module independently encodes local sensory or contextual input, enabling online adaptation without global retraining. The use of reference frames are implemented via spatial and temporal patterns that are represented as high-dimensional binary vectors, preserving relative relationships across inputs. Voting mechanism uses overlapping modules collaborating through similarity-based activation to resolve ambiguous or noisy inputs. Higher-level SDM layers integrate outputs from lower layers to form increasingly abstract episodic representations, representing hierarchical integration.

Figure 1. The pipeline of CALM architecture.

Transformer-to-binary encoding serves as a bridge between dense transformer embeddings and sparse SDM representations. Floating-point vectors are constrained by threshold to produce binary encoding, ensuring compatibility between systems while preserving semantic content:

v_{d e n s e} = BERT (tokenize (s)) \in R^{d_{m o d e l}}

(15)

b_{i} = \{\begin{matrix} 1, & if v_{i} > θ \\ 0, & otherwise . \end{matrix}

(16)

Here, BERT denotes a bidirectional transformer encoder that maps a tokenized input sequence s into a dense embedding space of dimensionality

d_{m o d e l}

. The vector

v_{d e n s e}

represents the continuous embedding output of the transformer,

b_{i}

is the i-th bit of the resulting binary vector after thresholding, and

θ

is the activation threshold controlling sparsity. Alternative mapping schemes such as LSH or learned quantization may be applied to balance sparsity and information retention.

The SDM maintains a fixed set of hard memory locations uniformly distributed in high-dimensional space. Stored patterns activate subsets of these locations within a Hamming radius r. Fixed locations simplify hardware access, though future work may explore dynamic expansion for scalability. An example of SDM association storage between high-dimensional addresses and their corresponding data is demonstrated in (Table 3).

Memory retrieval operates by computing Hamming distances between query and stored addresses as well as aggregating all locations within a threshold. SDM tolerates noisy or partial inputs through distributed representation and weighted recall, recovering nearest stored patterns even under corruption (Equation (17)).

BER = \frac{corrupted_bits}{d}, P_{r e t r i e v a l} (BER) = \sum_{k = 0}^{r} (\binom{d}{k}) {BER}^{k} {(1 - BER)}^{d - k} .

(17)

where d denotes the dimensionality of the binary address vectors, r is the Hamming distance threshold (access radius), and k indexes the number of bit errors tolerated during retrieval.

P_{r e t r i e v a l}

thus expresses the probability of successful recall under a Bernoulli noise model with bit error rate BER.

SDM tolerates noisy or partial inputs due to its distributed representation and weighted retrieval. For instance, a corrupted binary query averages activations over nearby locations, recovering the closest stored pattern. This contrasts with transformers that degrade under corrupted inputs unless retrained. Experimental results demonstrate this behavior (Section 4).

The CALM architecture unites the reasoning power of transformers with the continual memory and robustness of SDM. S1 provides online associative learning and recall, while S2 performs symbolic reasoning and long-term consolidation through dual-transformer models. This design supports continual adaptation, structured episodic recall, and biologically plausible memory dynamics, particularly suited for systems operating in dynamic or noisy environments, such as embedded agents and real-time anomaly detection tasks, where continual learning and long-term memory stability are critical. Algorithm 1 presents the core online learning loop procudere for the architecture.

Each incoming input

x_{t}

is first encoded into a high-dimensional sparse vector

v_{t}

, which serves as the System 1 representation. SDM performs an associative retrieval over memory

M

using a fixed Hamming radius r, and the retrieved neighborhood

N_{t}

is passed to the lightweight S2 model

f_{Θ}

to produce an interpretation

{\hat{y}}_{t}

. The new pattern is then written back into SDM, enabling continual learning without overwriting previous memories. Every K step udates the System 2 parameters

Θ

using a small replay batch drawn from SDM, ensuring stability without requiring full retraining. No new conceptual parameters are introduced beyond the existing components already defined in the architecture: the SDM radius r, the System 2 parameters

Θ

, the memory

M

, and the replay interval K.

Algorithm 1 S1–S2 Online Learning Loop with SDM Internals and Atomic Swap.
1: procedure Online.Step( $x_{t}$ )
2: $v_{t} \leftarrow Encode (x_{t})$	▹ Episodic encoding: Equation (13) $E (x_{t})$
3: $N_{t} \leftarrow SDM . Retrieve (M, v_{t}, r)$	▹Associative recall: Equation (13) $R (C_{t})$ ; uses radius r from empirical SDM threshold
4: ${\hat{y}}_{t} \leftarrow f_{Θ} (x_{t}, N_{t})$	▹ Lightweight S2 reasoning (transformer)
5: $SDM . Write (M, v_{t}, {\hat{y}}_{t})$	▹ Consolidation: Equation (13) $U ({\hat{x}}_{t})$
6: if $t mod K = 0$ then
7: $B \leftarrow SDM . Replay (M, K)$	▹ Construct replay batch $M_{t - r : t}$ for shadow transformer
8: $Θ_{shadow} \leftarrow Θ - η \nabla_{Θ} L (f_{Θ} (B))$	▹ Asynchronous shadow S2 update using replay batch
9: $Δ L_{t} \leftarrow L (Θ) - L (Θ_{shadow})$	▹ Equation (14): measurable improvement
10: if $Δ L_{t} > ϵ$ then
11: $Θ \leftarrow Θ_{shadow}$	▹ Atomic swap: update active S2 only if improvement exceeds threshold
12: end if
13: end if
14: end procedure
15: procedure SDM.Retrieve( $M, v_{t}, r$ )
16: $N_{t} \leftarrow []$
17: for all $a_{i} \in M$ do
18: $d_{i} \leftarrow HammingDistance (v_{t}, a_{i})$
19: if $d_{i} \leq r$ then
20: $w_{i} \leftarrow 1 - d_{i} / r$	▹ Radius-based weight; see Section 3.1, Equations (5) and (10)
21: $N_{t} . append ((a_{i}, w_{i}))$
22: end if
23: end for
24: return $N_{t}$	▹ Weighted aggregate retrieval; cf. Equations (3) and (4)
25: end procedure
26: procedure SDM.Write( $M, v_{t}, y_{t}$ )
27: for all $(a_{i}, w_{i}) \in addresses within radius r of v_{t}$ do
28: $M [a_{i}] \leftarrow M [a_{i}] + w_{i} * y_{t}$	▹ Weighted accumulation; preserves prior memory and handles interference; see Equations (2) and (11)
29: end for
30: end procedure
31: procedure SDM.Replay( $M, K$ )
32: $B \leftarrow select K patterns emphasizing novelty / recency$	▹ Replay batch construction; cf. Table 1 and discussion in Section 3.1
33: return B
34: end procedure

4. Preliminary Results

The SDMPreMark benchmark was executed as a controlled in silico experiment to map parameter sensitivities in high-dimensional associative memory operations. Each configuration was deterministically evaluated via the SDM memory test routine, providing reproducible metrics (match ratio, sparsity ratio, runtime).

Inspection of results focuses on threshold behavior of critical radius, vector dimension scaling, computational efficiency, selection of optimal radius, capacity, and analytical resource constraints. Although these experiments were limited to synthetic vectors and did not involve learned datasets or empirical high-load degradation tests, they qualify as algorithmic characterization of SDM dynamics.

Critical Radius Threshold Behavior: The most striking finding is the existence of a sharp performance threshold around radius factor 0.2 (corresponding to access radius values of 6, 12, 25, 51, 102, 204, 307, 399, 460, 614, 798, and 921 for vector dimensions 32, 64, 128, 256, 512, and 1024, respectively). This transition corresponds to the empirical critical radius function

r_{c r i t i c a l}

(Equation (5)) and performance transition function

P_{s u c c e s s}

(Equation (10)), presented together in Equation (18).

r_{c r i t i c a l} (d) = \{\begin{matrix} 0.375 d, & if d \leq 64 \\ 0.40 d, & if 64 < d \leq 256 \\ 0.60 d, & if d > 256 \end{matrix}, P_{s u c c e s s} (r, d) = \{\begin{matrix} 0.4 - 0.6, & if r < r_{c r i t i c a l} (d) \\ 1.0, & if r \geq r_{c r i t i c a l} (d) \end{matrix}

(18)

Below this threshold, match ratios consistently remain under 0.7, while configurations at or above radius factor 0.2 achieve perfect recall (match_ratio = 1.0) across most parameter combinations. This threshold represents a fundamental transition point where the SDM system shifts from under-activation to reliable pattern retrieval. However, the future work should also include high-load experiments for a full match ratio.

Vector Dimension Scaling: Performance remains remarkably stable across vector dimensions from 32 to 1024 bits once the critical radius threshold is exceeded. Lower dimensions (32–128) show slightly more variability in the sub-threshold region, while higher dimensions (512–1024) demonstrate more consistent behavior. Notably, unified memory architecture of ARM-based system-on-a-chip (SoC) appears to handle even the largest vector dimensions efficiently, with execution times scaling predictably rather than experiencing memory bottlenecks.

Computational Efficiency Patterns: Empirical latency model

T_{e m p i r i c a l}

was derived from time complexity equations to extend theoretical latency scaling). Execution duration scales linearly with both memory locations and reinforcement cycles (

R^{2} = 0.98

). The relationship follows:

T \approx 0.07 \times \frac{locations}{1000} + 0.03

ms. The binary nature of retrieval success patterns creates distinct operating regimes, as shown in Table 4. Latency was calculated according to the empirical relation (Equation (19)):

T_{e m p i r i c a l} (d, m) = 0.07 \cdot \frac{m}{1000} + 0.03 ms, E_{b i t} = α_{s w i t c h i n g} \cdot V_{d d}^{2} + β_{l e a k a g e} \cdot V_{d d}

(19)

Configurations with 500–1000 memory locations are completed in under 0.1 s for most parameter combinations, while 8000 locations require 1–5 s. The reinforcement cycle impact is particularly pronounced, with 100 cycles taking approximately 10× longer than single-cycle operations. This suggests that for the ARM-SoC platform, configurations with 1000–3000 memory locations and 10–30 reinforcement cycles provide an optimal balance between capacity and responsiveness.

Optimal Radius Selection: To minimize retrieval errors, the optimal radius

r^{*}

is obtained by balancing miss and interference probabilities: Equation (20) contains methods for optimal radius selection

r^{*}

and miss probability

P_{m i s s} (r)

(derived from Equation (10)).

r^{*} = arg min_{r} [α \cdot P_{m i s s} (r) + β \cdot P_{i n t e r f e r e n c e} (r)], P_{m i s s} (r) = P (_{i d l e}) = {(1 - \frac{\sum_{k = 0}^{r} (\binom{d}{k})}{2^{d}})}^{m}

(20)

Capacity: Following Equation (12), practical efficiency was calculated with capacity utilization factor

α_{c a p a c i t y}

(Equation (21)). Increasing memory locations shows minimal effect on match ratio but significant impact on processing time (Table 5).

α_{c a p a c i t y} = \frac{C_{e m p i r i c a l}}{C_{S h a n n o n}} \approx 0.3 - 0.7

(21)

The performance characteristics reveal that ARM-SoC favors moderate memory locations (1000–5000) with radius factors between 0.4 and 0.6 for optimal efficiency. Configurations exceeding these parameters show diminishing returns in accuracy, while significantly increasing computational overhead. These findings establish a clear device-specific configuration profile that prioritizes the ARM-SoC strengths in parallel processing while respecting its thermal and power constraints.

Analytical Resource Constraints: To quantify the stability of associative recall under concurrent storage, the interference probability

P_{i n t e r f e r e n c e}

(Equation (22)) was derived from the expected number of activated locations

E [K]

. This metric captures how overlapping address activations degrade recall fidelity as memory utilization increases.

P_{i n t e r f e r e n c e} (r) = 1 - exp (- \frac{N \cdot E [K]}{m})

(22)

Memory requirements

M_{t o t a l}

and

M_{m e m o r y}

(Equation (23)) were evaluated to estimate storage scalability (Equation (12):

M_{t o t a l} = m \cdot (d_{a d d r} + d_{d a t a} + {log}_{2} (N_{m a x})), M_{m e m o r y} (d, m) = m \cdot (d + p)

(23)

Equation (24) dictates processing power (operations per second)

P_{o p s}

and time complexities (

T_{w r i t e}

,

T_{r e a d}

) according to Equation (1):

P_{o p s} = \frac{f_{c p u}}{d \cdot m}, T_{w r i t e} (d, m) = O (m \cdot d) T_{r e a d} (d, m, k) = O (m \cdot d + k \cdot log k)

(24)

These formulations were used only to validate observed synthetic scalability trends.

5. Discussion and Future Work

The results indicate that SDM-based architectures are potentially viable for real-time applications, highlighting the importance of controlling sparsity in the encoding pipeline. Based on the current findings, we offer three practical recommendations, with caveats regarding hardware-specific behavior and semantic fidelity.

Firstly, the current evaluation reveals that a 50% bit activation is too dense for efficient SDM operation, leading to larger required radii and reduced specificity. Sparse encoding targeting 2–5% bit activation may improve pattern separation and retrieval efficiency, although semantic fidelity may degrade due to aggressive binarization, especially for complex or overlapping patterns.

Secondly, the observed sharp transition between failure and success suggests SDM performance is sensitive to access radius, challenging assumptions of gradual performance degradation. The empirical critical radius appears to grow faster than linear with vector dimension. Achieving reliable binary performance may require adaptive radius schemes, adjusting radius dynamically based on vector dimension and content. It should be noted that these conclusions are derived from SDMPreMark, which is a S1 synthetic benchmark, and is not able to fully capture semantic effects in real-world datasets.

Thirdly, capacity optimization for embedded deployment (500–1000 locations) provides a reasonable trade-off between latency and accuracy on the ARM-SoC platform. Larger capacities increase computational cost disproportionately, although they may improve robustness in more complex or distributed scenarios.

5.1. Convergence and Generalization

The mathematical conditions underlying SDM stability, convergence, and generalization were summarized in Section 3.1. The components derived included (i) the provability constraints for reliable recall (Equation (14)), (ii) bit-level reconstruction accuracy and noise tolerance (Equation (4)), (iii) the optimal access radius governing the generalization–interference trade-off (Equation (5)), and (iv) the canonical high-dimensional addressing formulation used for all experiments (Equation (13)). Together, these equations define the theoretical operating regime in which SDM converges to attractors (stable memory patterns) and maintains generalizable retrieval behavior. These principles were also incorporated in Section 3.3 into the SDMPreMark benchmark and in Section 3.6 in the CALM architecture for continual learning.

Future work will extend this theoretical framework to formal convergence proofs for the full CALM architecture. In particular, the aim should be to derive explicit bounds on convergence rates of the SDM update dynamics under continual input streams. In addition, it is necessary to characterize generalization through high-dimensional concentration bounds that explicitly model correlations among address patterns. Furthermore, development of a full fixed-point analysis is required to identify conditions for stable attractor formation against drifting under incremental updates. These actions should establish a more complete understanding of convergence and generalization for process-based, SDM-driven continual learning.

5.2. High-Load Degradation Considerations

High memory load degrades recall performance due to interference and capacity saturation. Using the previously defined interference probability

P_{i n t e r f e r e n c e} (r)

(Equation (22)) and optimal radius selection

r^{*}

(Equation (20)), SDM behavior can be estimated under high-load conditions:

Recall vs. memory size (T/M): As the number of stored locations m increases, the probability of interference rises, potentially reducing match ratios if the access radius r is not adjusted dynamically [1,33].
Capacity saturation curves: The empirical capacity factor $α_{c a p a c i t y}$ (Equation (21)) suggests that match ratios remain high up to a threshold memory size, beyond which efficiency decreases due to overlapping activations.
Dynamic radius adaptation: To maintain high recall under increased load, the optimal radius $r^{*}$ should scale with memory utilization, balancing miss and interference probabilities as described in Equation (20) [1,68]. Parameter tuning in high-dimensional vector spaces for associative memory retrieval can apply radius/threshold adaptation. Controlling these conditions is expected to accommodate high memory load at least partially if access radius and reinforcement strategies are adapted.

5.3. Limitations and Future Work

The preliminary results were derived from synthetic vectors under deterministic configurations, which limits their generalizability. While match ratios of 1.0 were observed under low-load conditions, further work is needed to evaluate SDM performance under high memory load, concurrent storage, and real-world semantic datasets. To this end, future studies will systematically investigate the impact of SDM parameters (access radius, sparsity, encoding schemes, and reinforcement cycles) on catastrophic forgetting, interference, and energy efficiency across sequential learning benchmarks. These additions will provide a more complete understanding of practical deployment constraints and performance characteristics.

Future work could explore meta-cognitive AI assistants that generate progress mirrors, contextual nudges, and associative links aligned with user goals. Extending this to Virtual Cities (VCs), AI agents could maintain local SDM-based reasoning while interacting with IoT environments, gradually developing more sophisticated models of socio-technical ecosystems [69,70]. Furthermore, the investigation of SDM updates using block-wise memory allocation could involve polynomial proof-of-memory approaches from blockchain research [71]. Such a framework would allow decentralized agents and edge environments to exchange verified memory blocks that contain episodic data annotated with metadata and associative gain certificates. Instead of raw computation, consensus is based on memory formation (episodic data) and recall performance (associative gain). However, semantic loss from binarization, network latency, and heterogeneous hardware could limit practical deployment.

To evaluate CALM in realistic continual learning scenarios, future work will aim to integrate the SDM-based architecture with pre-trained LLM backbones (BERT and/or GPT) to compare SDM-based replay against conventional continual learning strategies such as experience replay, EWC, and LoRA-style fine-tuning. Experiments should aim to quantify task retention, catastrophic forgetting, and generalization across sequential learning benchmarks, including the CALMark benchmark, which measures

Δ L

before and after atomic model swaps to verify memory-driven consolidation (System-2 validation). CALMark can also estimate long-term continual reasoning tests employing GLUE, SQuAD as well as extended long-context evaluation tasks to validate stability over time.

In parallel, systematic studies should explore how SDM parameters affect recall accuracy, interference, and energy efficiency under high memory load. These studies will provide empirical guidelines for optimizing SDM configurations in long-horizon continual learning and multimodal reasoning scenarios.

6. Conclusions

Sparse Distributed Memory offers a promising path toward more adaptive, efficient, and human-like AI systems. By embracing distributed, continual memory, and biologically inspired computation, interactions with AI can surpass the static nature of current models. Empirical analysis reveals a critical radius threshold (

r \approx 0.4

–

0.6

) that defines stable associative recall and demonstrates near-linear computational scaling. The SDMPreMark benchmark introduced provided an initial phase to codify the evaluation methodology, allowing the discovery of optimal parameters per host. This confirms the associative stability and computational efficiency required from S1.

Empirical analysis with synthetic vectors confirmed characteristics of a critical radius threshold (

r \approx 0.4

–

0.6

) for stable associative recall. The SDMPreMark benchmark provided an initial phase to codify evaluation methodology in low interference, but requires further experiments with non-synthetic data in high load in order to confirm the associative stability and computational efficiency required from S1. Further methods such as binary autoencoders or semantic hashing could provide could be applied for future work.

The architecture for a hybrid LLM scheme with episodic memory capacities was introduced, containing S1 (SDM) and S2-A/B (dual-transformers). An initial design of S1 was introduced to map feature clusters for further behavioral optimizations. The next step involves adding intuitive models to SDM, and integrating lightweight transformer pair and synchronization of operations with existing SDM modules applied in SDMPreMark. Emergent mechanisms should be trackable and causal linkages between S1 and S2 established. Verification will include memory-driven consolidation and atomic swaps, along with continual reasoning tests (e.g., GLUE, SQuAD) to validate stability over time.

While preliminary results are deterministic and limited to synthetic benchmarks, future work (as detailed in Section 5.3) will extend evaluation to semantic datasets, multimodal integration, and high-load conditions, addressing binarization challenges and real-world deployment considerations.

Author Contributions

Conceptualization, A.N. and J.R.; methodology, J.R.; software, J.R.; validation, A.N.; formal analysis, A.N.; investigation, A.N. and J.R.; resources, J.R.; data curation, J.R.; writing—original draft preparation, A.N. and J.R.; writing—review and editing, A.N. and J.R.; visualization, J.R.; supervision, A.N.; project administration, A.N.; funding acquisition, A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant for research centers, provided by the Ministry of Economic Development of the Russian Federation in accordance with the subsidy agreement with the Novosibirsk State University dated 17 April 2025 No. 139-15-2025-006: IGK 000000C313925P3S0002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Benchmark implementation for synthetic testing of SDM features and example scripts are available at https://github.com/toxom/CALM (accessed on 5 November 2025) upon publication for download to reproduce synthetic experiments.

Acknowledgments

Authors would like to express gratitude to the Artificial Intelligence Research Center of Novosibirsk State University for their support of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BERT	Bidirectional Encoder Representations from Transformers
CALM	Continual Associative Learning Model
EGRU	Event-based Gated Recurrent Units
EWC	Elastic Weight Consolidation
FISTA	Fast Iterative Shrinkage-Thresholding Algorithm
GEM	Gradient Episodic Memory
IoT	Internet of Things
ISTA	Iterative Shrinkage–Thresholding Algorithm
LLM	Large Language Model
LSH	Locality-Sensitive Hashing
RAG	Retrieval-Augmented Generation
S1	System 1 framework
S2	System 2 framework
SDM	Sparse Distributed Memory
SI	Synaptic Intelligence
SoC	System-on-a-Chip
T/M	Time per Memory location
VC	Virtual Cities
WYSIATI	What You See Is All There Is

References

Kanerva, P. Sparse Distributed Memory; MIT Press: Cambridge, MA, USA, 1988. [Google Scholar]
Kanerva, P. Self-Propagating Search: A Unified Theory of Memory (Address Decoding, Cerebellum). Ph.D. Thesis, Stanford University, Stanford, CA, USA, 1984. [Google Scholar]
Karunaratne, G.; Le Gallo, M.; Cherubini, G.; Benini, L.; Rahimi, A.; Sebastian, A. In-memory hyperdimensional computing. Nat. Electron. 2020, 3, 327–337. [Google Scholar] [CrossRef]
Keeler, J.D. Capacity for patterns and sequences in Kanerva’s SDM as compared to other associative memory models. In Proceedings of the Neural Information Processing Systems, Denver, CO, USA, 1 November 1987; American Institute of Physics: College Park, MD, USA, 1988. [Google Scholar]
Flynn, M.J.; Kanerva, P.; Bhadkamkar, N. Sparse Distributed Memory: Principles and Operation; Number RIACS-TR-89-53; NASA: Washington, DC, USA, 1989. [Google Scholar]
Marr, D. A theory of cerebellar cortex. J. Physiol. 1969, 202, 437–470. [Google Scholar] [CrossRef] [PubMed]
Albus, J.S. Mechanisms of planning and problem solving in the brain. Math. Biosci. 1979, 45, 247–293. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Kawato, M.; Ohmae, S.; Hoang, H.; Sanger, T. 50 Years Since the Marr, Ito, and Albus Models of the Cerebellum. Neuroscience 2021, 462, 151–174. [Google Scholar] [CrossRef]
Kanerva, P. Sparse Distributed Memory and Related Models; NASA Contractor Report NASA-CR-190553; NASA Ames Research Center: Moffett Field, CA, USA, 1992.
Bricken, T.; Davies, X.; Singh, D.; Krotov, D.; Kreiman, G. Sparse Distributed Memory is a Continual Learner. arXiv 2023, arXiv:2303.11934. [Google Scholar] [CrossRef]
Rinkus, G.J. A cortical sparse distributed coding model linking mini- and macrocolumn-scale functionality. Front. Neuroanat. 2010, 4, 17. [Google Scholar] [CrossRef]
Rutishauser, U.; Mamelak, A.N.; Schuman, E.M. Single-Trial Learning of Novel Stimuli by Individual Neurons of the Human Hippocampus–Amygdala Complex. Neuron 2006, 49, 805–813. [Google Scholar] [CrossRef]
Wixted, J.T.; Squire, L.R.; Jang, Y.; Papesh, M.H.; Goldinger, S.D.; Kuhn, J.R.; Smith, K.A.; Treiman, D.M.; Steinmetz, P.N. Sparse and distributed coding of episodic memory in neurons of the human hippocampus. Proc. Natl. Acad. Sci. USA 2014, 111, 9621–9626. [Google Scholar] [CrossRef]
Snaider, J.; Franklin, S.; Strain, S.; George, E.O. Integer sparse distributed memory: Analysis and results. Neural Netw. 2013, 46, 144–153. [Google Scholar] [CrossRef]
Furber, S.B.; John Bainbridge, W.; Mike Cumpstey, J.; Temple, S. Sparse distributed memory using N-Codes. Neural Netw. 2004, 17, 1437–1451. [Google Scholar] [CrossRef]
Peres, L.; Rhodes, O. Parallelization of Neural Processing on Neuromorphic Hardware. Front. Neurosci. 2022, 16, 867027. [Google Scholar] [CrossRef] [PubMed]
Gonzalez, H.A.; Huang, J.; Kelber, F.; Nazeer, K.K.; Langer, T.; Liu, C.; Lohrmann, M.; Rostami, A.; Schöne, M.; Vogginger, B.; et al. SpiNNaker2: A Large-Scale Neuromorphic System for Event-Based and Asynchronous Machine Learning. arXiv 2024, arXiv:2401.04491. [Google Scholar] [CrossRef]
Liu, C.; Bellec, G.; Vogginger, B.; Kappel, D.; Partzsch, J.; Neumärker, F.; Höppner, S.; Maass, W.; Furber, S.B.; Legenstein, R.; et al. Memory-Efficient Deep Learning on a SpiNNaker 2 Prototype. Front. Neurosci. 2018, 12, 840. [Google Scholar] [CrossRef] [PubMed]
Nazeer, K.K.; Schöne, M.; Mukherji, R.; Mayr, C.; Kappel, D.; Subramoney, A. Language Modeling on a SpiNNaker 2 Neuromorphic Chip. arXiv 2023, arXiv:2312.09084. [Google Scholar] [CrossRef]
Boahen, K. Dendrocentric learning for synthetic intelligence. Nature 2022, 612, 43–50. [Google Scholar] [CrossRef]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. arXiv 2016, arXiv:1603.05279. [Google Scholar] [CrossRef]
Vdovychenko, R.; Tulchinsky, V. Sparse Distributed Memory for Sparse Distributed Data. In Intelligent Systems and Applications; Arai, K., Ed.; Springer International Publishing: Cham, Switzerland, 2022; pp. 74–81. [Google Scholar] [CrossRef]
Vdovychenko, R.; Tulchinsky, V. Sparse Distributed Memory for Binary Sparse Distributed Representations. In Proceedings of the 2022 7th International Conference on Machine Learning Technologies, ICMLT ’22, Rome Italy, 11–13 March 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 266–270. [Google Scholar] [CrossRef]
Lopez-Paz, D.; Ranzato, M.A. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.P.; Wayne, G. Experience Replay for Continual Learning. arXiv 2019, arXiv:1811.11682. [Google Scholar] [CrossRef]
Zenke, F.; Poole, B.; Ganguli, S. Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August July 2017; pp. 3987–3995. [Google Scholar]
Graham, D.; Field, D. 3.14—Sparse Coding in the Neocortex. Evol. Nerv. Syst. 2007, 3, 181–187. [Google Scholar] [CrossRef]
Beyeler, M.; Rounds, E.L.; Carlson, K.D.; Dutt, N.; Krichmar, J.L. Neural correlates of sparse coding and dimensionality reduction. PLoS Comput. Biol. 2019, 15, e1006908. [Google Scholar] [CrossRef] [PubMed]
Jääskeläinen, I.P.; Glerean, E.; Klucharev, V.; Shestakova, A.; Ahveninen, J. Do sparse brain activity patterns underlie human cognition? NeuroImage 2022, 263, 119633. [Google Scholar] [CrossRef] [PubMed]
Drix, D.; Hafner, V.V.; Schmuker, M. Sparse coding with a somato-dendritic rule. Neural Netw. 2020, 131, 37–49. [Google Scholar] [CrossRef] [PubMed]
Olshausen, B.A.; Field, D.J. Sparse coding of sensory inputs. Curr. Opin. Neurobiol. 2004, 14, 481–487. [Google Scholar] [CrossRef]
Le, N.D.H. Sparse Code Formation with Linear Inhibition. arXiv 2015, arXiv:1503.04115. [Google Scholar] [CrossRef]
Panzeri, S.; Moroni, M.; Safaai, H.; Harvey, C.D. The structures and functions of correlations in neural population codes. Nat. Rev. Neurosci. 2022, 23, 551–567. [Google Scholar] [CrossRef]
Chaisanguanthum, K.S.; Lisberger, S.G. A Neurally Efficient Implementation of Sensory Population Decoding. J. Neurosci. 2011, 31, 4868–4877. [Google Scholar] [CrossRef]
Chandrasekaran, K.; Dadush, D.; Gandikota, V.; Grigorescu, E. Lattice-based Locality Sensitive Hashing is Optimal. arXiv 2017, arXiv:1712.08558. [Google Scholar] [CrossRef]
Gregor, K.; LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Haifa, Israel, 21–24 June 2010; Omnipress: Madison, WI, USA, 2010; pp. 399–406. [Google Scholar]
Salakhutdinov, R.; Hinton, G. Semantic hashing. Int. J. Approx. Reason. 2009, 50, 969–978. [Google Scholar] [CrossRef]
Zhang, D.; Yang, J.; Ye, D.; Hua, G. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11212, pp. 373–390. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.J.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv 2023, arXiv:2209.06794. [Google Scholar] [CrossRef]
Reed, S.; Zolna, K.; Parisotto, E.; Colmenarejo, S.G.; Novikov, A.; Barth-Maron, G.; Gimenez, M.; Sulsky, Y.; Kay, J.; Springenberg, J.T.; et al. A Generalist Agent. arXiv 2022, arXiv:2205.06175. [Google Scholar] [CrossRef]
Zhao, G.; Lin, Y.; Lu, Y.; Chen, Z.; Guo, W. Lightweight bilateral network of Mura detection on micro-OLED displays. Measurement 2025, 255, 117937. [Google Scholar] [CrossRef]
Yu, X.; Liang, X.; Zhou, Z.; Zhang, B. Multi-task learning for hand heat trace time estimation and identity recognition. Expert Syst. Appl. 2024, 255, 124551. [Google Scholar] [CrossRef]
Yu, X.; Liang, X.; Zhou, Z.; Zhang, B.; Xue, H. Deep soft threshold feature separation network for infrared handprint identity recognition and time estimation. Infrared Phys. Technol. 2024, 138, 105223. [Google Scholar] [CrossRef]
Li, J.; Wang, Q.; Wang, Z.; Zhang, Y.; Mao, Z. ELDER: Enhancing Lifelong Model Editing with Mixture-of-LoRA. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 24440–24448. [Google Scholar] [CrossRef]
Gupta, K.; Thérien, B.; Ibrahim, A.; Richter, M.L.; Anthony, Q.; Belilovsky, E.; Rish, I.; Lesort, T. Continual Pre-Training of Large Language Models: How to (re)warm your model? arXiv 2023, arXiv:2308.04014. [Google Scholar] [CrossRef]
Coleman, E.N.; Quarantiello, L.; Liu, Z.; Yang, Q.; Mukherjee, S.; Hurtado, J.; Lomonaco, V. Parameter-Efficient Continual Fine-Tuning: A Survey. arXiv 2025, arXiv:2504.13822. [Google Scholar] [CrossRef]
Liu, R.; Mozafari, B. Transformer with Memory Replay. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 7567–7575. [Google Scholar] [CrossRef]
Resta, M.; Bacciu, D. Self-generated Replay Memories for Continual Neural Machine Translation. arXiv 2024, arXiv:2403.13130. [Google Scholar] [CrossRef]
Sagirova, A.; Burtsev, M. Extending Transformer Decoder with Working Memory for Sequence to Sequence Tasks. In Proceedings of the Advances in Neural Computation, Machine Learning, and Cognitive Research V; Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 253–260. [Google Scholar] [CrossRef]
Shen, J.; Xu, Q.; Pan, G.; Chen, B. Improving the Sparse Structure Learning of Spiking Neural Networks from the View of Compression Efficiency. arXiv 2025, arXiv:2502.13572. [Google Scholar] [CrossRef]
Omidi, P.; Huang, X.; Laborieux, A.; Nikpour, B.; Shi, T.; Eshaghi, A. Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Enhanced Model Architectures. arXiv 2025, arXiv:2508.10824. [Google Scholar] [CrossRef]
Yang, Y.; Huang, J.; Hu, D. Lifelong learning with Shared and Private Latent Representations learned through synaptic intelligence. Neural Netw. 2023, 163, 165–177. [Google Scholar] [CrossRef]
Yue, W. Towards General Purpose Robots at Scale: Lifelong Learning and Learning to Use Memory. arXiv 2024, arXiv:2501.10395. [Google Scholar] [CrossRef]
Aleksander, I.; Stonham, T.J. Guide to pattern recognition using random-access memories. IEE J. Comput. Digit. Tech. 1979, 2, 29–40. [Google Scholar] [CrossRef]
Chen, X.; Liu, J.; Wang, Z.; Yin, W. Theoretical Linear Convergence of Unfolded ISTA and its Practical Weights and Thresholds. arXiv 2018, arXiv:1808.10038. [Google Scholar] [CrossRef]
Kamilov, U.S.; Mansour, H. Learning optimal nonlinearities for iterative thresholding algorithms. IEEE Signal Process. Lett. 2016, 23, 747–751. [Google Scholar] [CrossRef]
O’Donnell, R.; Wu, Y.; Zhou, Y. Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny). ACM Trans. Comput. Theory 2014, 6, 5:1–5:13. [Google Scholar] [CrossRef]
Chen, T.; Li, L.; Sun, Y. Differentiable Product Quantization for End-to-End Embedding Compression. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 12–18 July 2020; Daumé, H., III, Singh, A., Eds.; Proceedings of Machine Learning Research (PMLR): Cambridge, MA, USA, 2020; Volume 119, pp. 1617–1626. [Google Scholar]
Ge, T.; He, K.; Ke, Q.; Sun, J. Optimized Product Quantization. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 744–755. [Google Scholar] [CrossRef]
Kahneman, D.; Frederic, S. 2—Representativeness Revisited: Attribute Substitution in Intuitive Judgment. In Heuristics and Biases The Psychology of Intuitive Judgment; Cambridge University Press: Cambridge, MA, USA, 2002. [Google Scholar] [CrossRef]
Plate, T. Holographic reduced representations. IEEE Trans. Neural Netw. 1995, 6, 623–641. [Google Scholar] [CrossRef]
Nechesov, A.; Dorokhov, I.; Ruponen, J. Virtual Cities: From Digital Twins to Autonomous AI Societies. IEEE Access 2025, 13, 13866–13903. [Google Scholar] [CrossRef]
Ruponen, J.; Dorokhov, I.; Barykin, S.E.; Sergeev, S.; Nechesov, A. Metaverse Architectures: Hypernetwork and Blockchain Synergy. In Proceedings of the MathAI, Sochi, Russia, 24 March 2025. [Google Scholar]
Dorokhov, I.; Ruponen, J.; Shutsky, R.; Nechesov, A. Time-Exact Multi-Blockchain Architectures for Trustworthy Multi-Agent Systems. In Proceedings of the MathAI, Sochi, Russia, 24 March 2025. [Google Scholar]

Table 1. Core terminology.

Term	Definition
Episodic Memory	SDM-encoded event trace (binary vector + metadata)
Semantic Memory	Transformer parameters and embeddings representing structured knowledge
Consolidation	Shadow Transformer fine-tuning on SDM replay to update active transformer
Atomic Swap	Replacement of the active transformer (S2A) with shadow transformer (S2B) using version control guarantees
Replay	Retrieval of historical SDM events for training or consolidation
Active Transformer (S2A)	Transformer module serving real-time user interactions and receiving SDM recall as context
Shadow Transformer (S2B)	Transformer module trained asynchronously using SDM replay, replacing S2A upon validation
Access Radius (r)	Hamming distance threshold defining which SDM memory locations are activated
Binary Encoding	Conversion of dense transformer embeddings into sparse high-dimensional binary vectors for SDM storage
Sparsification	Process enforcing low-activity representations, reducing interference and increasing memory capacity
Semantic Encoding	Structured latent code reflecting conceptual similarity for SDM retrieval
Locality-Preserving Hashing	Mapping method ensuring geometrically similar input patterns activate nearby SDM locations
Quantization	Discretization or compression of vectors to reduce storage and memory footprint
Stabilization/Preconditioning	Transformations that make SDM representations robust, noise-tolerant, and well-conditioned
Match Ratio	Retrieval accuracy metric in SDM benchmarks
BER (Bit Error Rate)	Fraction of corrupted bits in a query compared to stored SDM pattern
Voting Mechanism	Aggregation method in SDM modules to resolve noisy or ambiguous activations
Hierarchical Integration	Combining outputs from multiple SDM modules to form higher-level episodic representations
CALM	Cognitive architecture integrating SDM (System 1) with dual transformers (System 2) for continual learning
System 1/System 2	SDM-based associative memory vs. transformer-based reasoning modules

Table 2. SDM benchmark parameter ranges.

Parameter	Values	Rationale
Vector Dimension	32, 64, 128, 256, 512, 1024	Tests scaling from embedded to server deployment
Memory Locations Access Radius Factor	500, 1K, 3K, 5K, 8K 0.05, 0.1, 0.2, 0.4, 0.6, 0.78, 0.9	Evaluates capacity vs. interference trade-offs Explores specificity vs. generalization spectrum
Reinforcement Cycles	1, 5, 10, 15, 30, 50, 100	Standard strengthening for reliable storage

Table 3. Example of SDM memory contents.

Address Vector (Binary)	Payload	Confidence
`001010001000010100…`	“person detected”	0.87
`111000000101100010…`	“car detected”	0.92
`000100100000000001…`	“no motion”	0.95

Table 4. SDM operating regimes by access radius.

Regime	Radius Range	Match Ratio	Characteristics
Under-activation	$r < 0.4 d$	0.4–0.6	No successful retrieval, recalled_ones = 0
Transition Zone	$r \approx 0.4 d$	0.6–0.9	Unstable, configuration-dependent
Over-activation	$r > 0.4 d$	1.0	Perfect recall, full pattern recovery

Table 5. Memory scaling effects (1024-bit vectors, radius = 614).

Locations	Match Ratio	Latency (ms)	Efficiency
500	1.0	0.090	High
1000	1.0	0.188	Medium
3000	1.0	0.564	Medium
5000	1.0	0.942	Low
8000	1.0	1.501	Low

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nechesov, A.; Ruponen, J. CALM: Continual Associative Learning Model via Sparse Distributed Memory. Technologies 2025, 13, 587. https://doi.org/10.3390/technologies13120587

AMA Style

Nechesov A, Ruponen J. CALM: Continual Associative Learning Model via Sparse Distributed Memory. Technologies. 2025; 13(12):587. https://doi.org/10.3390/technologies13120587

Chicago/Turabian Style

Nechesov, Andrey, and Janne Ruponen. 2025. "CALM: Continual Associative Learning Model via Sparse Distributed Memory" Technologies 13, no. 12: 587. https://doi.org/10.3390/technologies13120587

APA Style

Nechesov, A., & Ruponen, J. (2025). CALM: Continual Associative Learning Model via Sparse Distributed Memory. Technologies, 13(12), 587. https://doi.org/10.3390/technologies13120587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CALM: Continual Associative Learning Model via Sparse Distributed Memory

Abstract

1. Introduction

2. Related Work

2.1. SDM Foundations and Neuroscientific Principles

2.2. Neuromorphic Implementations

2.3. Multimodal and Memory-Augmented Learning

2.4. CALM Positioning Within Continual Learning LLMs

3. Materials and Methods

3.1. SDM Evaluation Methodology

3.2. Encoding Techniques

3.3. SDM Benchmark Design

3.4. System 1 Methodology

3.5. CALM Integration

3.6. CALM Architecture

4. Preliminary Results

5. Discussion and Future Work

5.1. Convergence and Generalization

5.2. High-Load Degradation Considerations

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI