The Capacity for Correlated Semantic Memories in the Cortex

Boboeva, Vezha; Brasselet, Romain; Treves, Alessandro

doi:10.3390/e20110824

Open AccessFeature PaperArticle

The Capacity for Correlated Semantic Memories in the Cortex

by

Vezha Boboeva

¹

,

Romain Brasselet

¹

and

Alessandro Treves

^1,2,*

¹

Cognitive Neuroscience, SISSA—International School for Advanced Studies, Via Bonomea 265, 34136 Trieste, Italy

²

Kavli Institute for Systems Neuroscience/Centre for Neural Computation, Norwegian University of Science and Technology, 7491 Trondheim, Norway

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(11), 824; https://doi.org/10.3390/e20110824

Submission received: 14 August 2018 / Revised: 11 October 2018 / Accepted: 23 October 2018 / Published: 26 October 2018

(This article belongs to the Special Issue Statistical Mechanics of Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

A statistical analysis of semantic memory should reflect the complex, multifactorial structure of the relations among its items. Still, a dominant paradigm in the study of semantic memory has been the idea that the mental representation of concepts is structured along a simple branching tree spanned by superordinate and subordinate categories. We propose a generative model of item representation with correlations that overcomes the limitations of a tree structure. The items are generated through “factors” that represent semantic features or real-world attributes. The correlation between items has its source in the extent to which items share such factors and the strength of such factors: if many factors are balanced, correlations are overall low; whereas if a few factors dominate, they become strong. Our model allows for correlations that are neither trivial nor hierarchical, but may reproduce the general spectrum of correlations present in a dataset of nouns. We find that such correlations reduce the storage capacity of a Potts network to a limited extent, so that the number of concepts that can be stored and retrieved in a large, human-scale cortical network may still be of order 10⁷, as originally estimated without correlations. When this storage capacity is exceeded, however, retrieval fails completely only for balanced factors; above a critical degree of imbalance, a phase transition leads to a regime where the network still extracts considerable information about the cued item, even if not recovering its detailed representation: partial categorization seems to emerge spontaneously as a consequence of the dominance of particular factors, rather than being imposed ad hoc. We argue this to be a relevant model of semantic memory resilience in Tulving’s remember/know paradigms.

Keywords:

Potts network; attractor neural networks; autoassociative memory; cortex; semantic memory

1. Introduction

One of the most fascinating aspects of the human brain is its ability to ascribe significance to and recognize meaning in objects and events, and more generally to make sense of the world. Semantic memory, comprising our acquired knowledge about the world, can be imagined to reflect, in its statistical structure, the complex, distributed, policentric structure of the neocortex where it resides. In contrast, the relatively much simpler network structure of the hippocampus, in particular of its CA3 field, where episodic memories have long been thought to be at least initially represented by unique patterns of neural activity, may lead to the limited set of outcomes of episodic memory retrieval: either the pattern is retrieved, or not. In the first case, retrieval, subjects remember what happened in the episode, in the second they do not, although they may still know many of the elements in the episode, likely as they reconstruct them with input from semantic memory. This is the basis for remember/know paradigms [1] that assess hippocampal contribution to memory retrieval. But how can the statistical structure of the memory representations themselves be characterized?

1.1. Correlations

In the case of episodic representations in the hippocampus, one straightforward hypothesis about their statistics is that they are largely uncorrelated: each representation is set up, e.g., in CA3, independently of other representations already stored there, under the influence of the Dentate Gyrus [2]. Then the representations are roughly at the same distance from each other in activity space, i.e., they are ametric: relations of being closer or farther away, or in the middle between another pair, lose their meaning. This may seem at odds with the best studied neural representations in the hippocampus, spatial representations in rodents, which reflect the continuity of space, where being close or distant is clearly defined. As soon as we move, however, from the representation of different locations in the same restricted spatial context to the representation of different contexts, the phenomenon of global remapping suggests that the notion of ametric representations is relevant. Indeed, it has been observed that even very similar spatial contexts are represented in rat CA3 by completely different, essentially uncorrelated representations [3]. Correspondingly, a measure of metric content has been shown to “increase” in human subjects who can rely less, due to incipient Alzheimer, on their hippocampal representations [4]. If hippocampal representations can be said to be ametric, what is the nature of the metricity observed, by contrast, in semantic representations in the neocortex?

Direct access to individual semantic representations through single unit recordings is of course very limited, and not just in the human brain, because of their very distributed nature. Multi-voxel pattern analyses from fMRI are consistent with a complex web of correlations [5,6], but their resolution is limited and so is the characterization of the statistical properties of those correlations. A simple alternative, however, is to assess the nature of the correlations among the semantic items themselves, rather than probing their representations in the brain. This can be done by utilizing any of a number of databases, where a set of semantic items have been described in terms of the features or attributes people associate with them. As a simple toy example, we took the

p = 60

nouns used in a recent fMRI study [7]. We computed the pairwise correlation between these nouns, as measured by a set of intermediate or surrogate features, such as the co-occurrence with a set of verbs within a sentence in their corpus.

In Figure 1a, we report the correlation matrix of the nouns, from which it can be seen that the pattern of correlations cannot be described by any simple schema. One way of thinking about this organization, that is ubiquitous in the semantic literature, is to think of concepts organized in a hierarchy, or a tree [8,9]. Such models, in their descriptive and generative formulations, are dramatic oversimplifications that ignore important features of the data, such as the prevalence of concepts intermediate between other concepts (see Section 2.2.1). As an example, in Figure 1c, we report the correlation that an extreme hierarchical model would see in such data. Unsurprisingly, it is made of clusters with large within correlation and no between correlation. It is apparent, from the comparison with Figure 1a, that this hierarchical model fails to represent off-block values with high correlation.

A less dramatic simplification consists in considering that the correlations between individual concepts belonging to the distinct clusters can be well approximated by the mean correlation between clusters. Such a simplification yields Figure 1b. To quantify the validity of this simplification, one can measure to what extent the distance relations between the concepts match the fully hierarchical limit case [10]. This index, called the ultrametric content (see Appendix C) can be computed once correlations are translated into a measure of distance (Figure 2). The “soft” hierarchical structure of the matrix in Figure 1b yields an ultrametric content index of 0.61, to be compared with the value 0.5 for the original data. The fully hierarchical matrix, Figure 1a, has an ultrametric content of 1. On the other hand, ametric, independently generated representations, as observed in CA3, have an ultrametric content close to 0. Semantic relations, we can conclude, are complex, and the ultrametric content of 0.5 is as far from the purely ultrametric as from the trivial ametric limit.

However, it is the statistical independence of memory patterns that had made available most of the mathematically sophisticated analyses of autoassociative networks. While these analyses have been successfully used to describe the CA3 circuit, it does not seem like they can be applied to semantic memory, which has in the shared structure between memories its raison d’être. Still, in exploring variants of the Hopfield model, which initially featured uncorrelated patterns, the challenge of storing correlated patterns was eventually taken up. One of the earliest attempts to introduce correlations was through an algorithm that arranged patterns on a tree [11,12], in which the upper nodes correspond to classes of items and the lower nodes, each branching from a single upper node, correspond to exemplars of a class. Subsequently it was found that beyond the storage capacity, initial states highly correlated with one individual pattern evolve to the corresponding class, while even the class categorization is lost at a higher critical loading [13]. In [14], it was proposed that such a scheme could function as a model for prosopagnosia, an impairment in visual recognition in which the patient can correctly recognize the category of faces but is unable to recognize individual faces.

However, can such a simple scheme be relevant to describe semantic memory, too? A tree-like structure, while possibly suited to capture a specific cognitive impairment, does not account for the complex relations of semantic memory. When dealing with the meaning of a concept, one typically accesses not only its identity and class membership, but also the stronger or weaker relations to other concepts, which span many dimensions and are not only contingent on common human experience but also on personal experience [15]. As such, the complexity of semantic relations [16] can be argued to require a more sophisticated description than the one provided by an approximate tree-like model.

Valuable attempts to go beyond both uncorrelated memories and simple branching trees, for example within the parallel-distributed processing (PDP) framework [17], have remained largely data driven, focused on computer simulations that could qualitatively reproduce results in agreement with patterns of deficit seen in the neuropsychological literature [18,19,20]. No mathematical framework, however, has been proposed for theoretical questions of a quantitative scope. Such a theoretical perspective is necessary if one wants to approach the question of semantic memory in a more principled way. For example, what is the reliability and generalizability of such results from the small networks used in simulations to large-scale cortical networks such as those of the human brain?

1.2. Connectivity

The analytical tools allowing for a complete analysis have been applied to fully connected or else very sparsely connected networks, in which the average connectivity between the units vanishes. These models have been thoroughly analyzed and scaling relations have been found for the storage capacity as a function of the mean connectivity and the coding sparsity in the network. Remarkably, the same scaling relation holds, when coding is sparse, for both limit cases of full connectivity and extremely sparse connectivity. Does it mean that it holds also for any connectivity in between, including realistic models of cortical connectivity?

From the point of view of plausibility, such studies of randomly wired networks fall short of describing some features of the anatomy of cortical connectivity. For example, it has been shown [21] that in layers II and III of mouse visual cortex the probability of connection falls from 50–80 percent for directly adjacent neurons to 0–15 percent at a distance of 500 micrometers. Building on such observations, the properties of an autoassociative network of threshold-linear units whose synaptic connectivity is spatially structured has also been investigated [22]. Other experiments, however, have shown that at a larger scale cortical connectivity is not randomly distributed, not even after allowing for a distance-dependent parameter. For example, it has been shown that in the prefrontal cortex of monkeys, patches of a hundred microns make connections to and from other discrete patches of cortex of the same size [23]. A patch is connected to about 15–20 other patches in its proximity via grey matter connections, and to at least 15–20 more distant patches connected via white matter connections.

Braitenberg and Schuz have elegantly synthesized this dual local and global characteristic of the cortex in terms of the A and B systems (referring to apical and basal dendrites [24]). They suggest regarding the whole cortex as a memory machine, in which the B systems encode a set of memories as local attractors and the A-system encodes global attractors, by virtue of long-range connections, Figure 3a. Variant models of associative memory networks that implement this separation of scale between dense local connectivity and sparse long-range connectivity have been studied [25,26,27,28,29]. This study is in line with such an approach, in that it aims at describing each patch of, say, the human cortex, a functional voxel of a few mm³, comprising some 10⁵ neurons, as one local network interacting through the B system, whose activity is coarsely subsumed into a Potts unit. A Potts unit has multiple activity states, akin to a capsule of the kind recently introduced in deep learning networks [30]. The Potts network, aimed at describing the cortex, or a large part of it, is comprised of N such units, constituting the A-system, Figure 3b. We refer to [31] for a detailed analysis of the approximate thermodynamic and dynamic equivalence of the full multi-modular model and the Potts network. We do not dwell on the correspondence here, but use it to discuss correlations in the Potts framework.

2. Results

2.1. The Potts Network

The Potts neural network is a generalization of Hopfield’s binary autoassociative network [32]. A Potts unit can be either in the quiescent state or in one of the S equivalent active states. By convention, we label these states with numbers from 0 to S, where

k = 0

indicates the quiescent state and

k = 1, \dots, S

the active ones, representing the possible local attractors (see Figure 3). Due to stochastic fluctuations, a unit can be, with a non-vanishing probability, in several of the

S + 1

states, so that the activity of unit i in state k is denoted by

σ_{i}^{k}

, a variable in the interval

[0, 1]

. By network state, or configuration, we refer instead to the collection of local states assigned to all units,

{ξ_{i}}

, each of which is an integer from 0 to S, and where

i \in {1, \dots, N}

, N being the number of units in the network.

Couplings

J_{i j}

between states of distinct units represent the strength with which connected units influence each other. In the case of the Hopfield network, the couplings

J_{i j}

are just scalars. In the Potts network, the matrices

J_{i j}^{k l}

express the strength of the coupling between unit i in state k and unit j in state l.

Of crucial importance in the definition of the network model is the learning rule, which prescribes how the couplings depend on a given training dataset. In the present model, the training dataset consists of a number p of network configurations

{ξ_{i}^{μ}}

that we call patterns. Throughout this paper we consider patterns that are sparse: every unit in each pattern is taken to be in the quiescent state with probability

1 - a

, with the remaining probability a shared uniformly by the S active states:

\begin{matrix} \{\begin{matrix} P (ξ_{i}^{μ} = 0) = 1 - a \\ P (ξ_{i}^{μ} = k) = \tilde{a} \equiv a / S \end{matrix} \end{matrix}

(1)

The way the patterns

{\bar{ξ}}^{μ}

are generated, i.e., their probability distribution, has effects on the “retrieval properties” of the network, i.e., the ability to retrieve with good accuracy one of the training patterns, when it is partially cued. A quantitative measure of this ability of the network is the storage capacity, the number of patterns the network is able to store and retrieve, relative to the number of synaptic connections per unit.

The learning rule according to which the patterns are used to build the synaptic connections between units is a Potts-adapted version of Hebbian learning

c_{i j} J_{i j}^{k l} = \frac{c_{i j}}{c_{m} a (1 - \frac{a}{S})} \sum_{μ = 1}^{p} (δ_{ξ_{i}^{μ} k} - \frac{a}{S}) (δ_{ξ_{j}^{μ} l} - \frac{a}{S}) (1 - δ_{k 0}) (1 - δ_{l 0}),

(2)

where the factor

c_{i j}

denotes the

(i, j)

-th entry of the adjacency matrix of the connectivity (graph), equal to 1 if an edge exists from j to i and 0 otherwise. The constant

c_{m}

is the average total degree of this graph, i.e., the average number of connections at a given node, so that

〈 c_{i j} 〉 = c_{m} / N

. The Kronecker

δ

-function is 1 when the two indices are equal and 0 if they are different. The subtraction of the mean activity by state,

a / S

, ensures a higher storage capacity, as initially shown for the Hopfield network in [33] and for the Potts neural network in [34].

The fully connected network, in which

c_{i j} = 1

for all pairs

(i, j)

allows for a full-fledged analytic approach, by means of techniques borrowed from spin glass physics [35]. It has been shown, as reviewed in [36], that such connectivity ensures that each of these configurations, if they are not too many, becomes a stable state, or an attractor of the energy function

H = - \frac{1}{2} \sum_{i, j \neq i}^{N} \sum_{k, l = 1}^{S} J_{i j}^{k l} σ_{i}^{k} σ_{j}^{l} + U \sum_{i}^{N} \sum_{k}^{S} σ_{i}^{k},

(3)

where the activation function is given by the Boltzmann distribution with inverse temperature

β

σ_{i}^{k} = \frac{e^{β h_{i}^{k}}}{e^{β U} + \sum_{l = 1}^{S} e^{β h_{i}^{l}}},

(4)

where U is a threshold for activation. The field received by unit i in state k is determined by the activity of all the Potts units and writes

h_{i}^{k} = \sum_{j} \sum_{l} c_{i j} J_{i j}^{k l} σ_{j}^{l} - U (1 - δ_{k, 0}),

(5)

where the coupling strength between two states of two different units

J_{i j}^{k l}

is given by Equation (2). From Equation (4), it follows that

\sum_{k = 0}^{S} σ_{i}^{k} = 1

at all times.

A more biologically plausible case is that of diluted networks, where the number of connections per unit

c_{m}

is less than N. In this paper we consider random dilution (RD), in which

P (c_{i j}, c_{j i}) = P (c_{i j}) P (c_{j i}),

(6)

with

P (c_{i j}) = \frac{c_{m}}{N} δ (c_{i j} - 1) + (1 - \frac{c_{m}}{N}) δ (c_{i j}) .

(7)

2.2. Generating Correlated Representations

The initial studies of the capacity of the Potts network [31,34] featured patterns that were uncorrelated. Uncorrelated patterns are generated by assigning Potts states to different units in different patterns independently. This means that the p patterns

{{\bar{ξ}}^{μ}}

are generated according to a probability distribution which is factorized into p identical ones

P ({\bar{ξ}}^{1} \dots {\bar{ξ}}^{p}) = P ({\bar{ξ}}^{1}) \cdot \dots \cdot P ({\bar{ξ}}^{p}) .

(8)

In turn, units in each pattern are also independent and identically distributed

P ({\bar{ξ}}^{μ}) \equiv P (ξ_{1}^{μ} \dots ξ_{N}^{μ}) = P (ξ_{1}^{μ}) \cdot \dots \cdot P (ξ_{N}^{μ}) .

(9)

In this simple uncorrelated scheme, it is possible to compute the correlation between any two patterns

μ \neq ν

as measured by the fraction of active units in one pattern that are co-active in the other and in the same state

C_{a s}^{μ ν} = \frac{1}{N a} \sum_{i = 1}^{N} δ_{ξ_{i}^{μ}, ξ_{i}^{ν}} (1 - δ_{ξ_{i}^{ν}, 0}) .

(10)

The distribution of this correlation measure is straightforward and given by a binomial

N a C_{a s} \sim B (N, \frac{a^{2}}{S}),

(11)

where

B (k; N, p r) \equiv (\binom{N}{k}) p r^{k} {(1 - p r)}^{N - k}

; that is,

〈 C_{a s} 〉 = \tilde{a}

. Once patterns become correlated, however, there is no straightforward way to compute this measure, and we resort to simulations, as reported in Section 2.2.5.

2.2.1. Single Parents and Ultrametrically Correlated Children

The interest in ultrametrically organized patterns was largely due to the discovery of an ultrametric hierarchy of the free energy minima in the formal solution of the Sherrington-Kirkpatrick model of a spin glass [35]. In particular, the Hopfield model of neural networks was extended to allow for the storage and retrieval of hierarchically correlated patterns [12]. In this study [12], a set of random patterns, which we can call “parents”, are characterized by independent units, active with probability a

P (ξ_{i}^{π}) = a δ (ξ_{i}^{π} - 1) + (1 - a) δ (ξ_{i}^{π}),

(12)

where

ξ_{i}^{π}

denotes the activity of unit i of parent

π

and

0 < a < 1

is the sparsity of the parents. In the next step, “child” patterns are drawn from the following distribution

P (ξ_{i}^{π μ}) = \{a + b (ξ_{i}^{π} - a)\} δ (ξ_{i}^{π μ} - 1) + \{1 - a - b (ξ_{i}^{π} - a)\} δ (ξ_{i}^{π μ}),

(13)

where

ξ_{i}^{π μ}

denotes the activity of unit i of child

μ

branching from parent

π

.

0 < b < 1

parametrizes to what degree children are biased toward their (single) parent. For

b = 0

, child patterns become uncorrelated with no dependence on the parent, while for

b = 1

the child patterns become identical to their single parent. Given the distributions above, we can compute the average activity of parents and child patterns (since the state of each unit i is drawn identically from the same distribution, in the following we can drop this index)

\begin{matrix} 〈 ξ^{π} 〉 = a \end{matrix}

(14)

\begin{matrix} 〈 ξ^{π μ} 〉 = a, \end{matrix}

(15)

as well as child-parent correlations

〈 ξ^{π μ} ξ^{π^{'}} 〉 = \{\begin{matrix} a^{2} + a (1 - a) b & π = π^{'} \\ a^{2} & π \neq π^{'} . \end{matrix}

(16)

As expected, if

b > 0

, children of the same branch have higher similarity to their own parent (

π = π^{'}

), than to a parent of another branch (

π \neq π^{'}

). We can also compute the correlation between two children of the same parent (

π = π^{'}

) and that of two children belonging to distinct parents (

π \neq π^{'}

)

〈 ξ^{π μ} ξ^{π^{'} μ^{'}} 〉 = \{\begin{matrix} a^{2} + a (1 - a) b^{2} & π = π^{'} \\ a^{2} & π \neq π^{'} . \end{matrix}

(17)

It trivially follows that

〈 ξ^{π μ} ξ^{π μ^{'}} 〉 - 〈 ξ^{π μ} ξ^{π^{'} μ^{'}} 〉 = a (1 - a) b^{2} .

(18)

This is one of the characteristics of this algorithm: it is possible to define a distance d such that three patterns (

x, y, z = ξ^{π μ}, ξ^{π μ^{'}}, ξ^{π^{'} μ^{'}}

) at the same level of the hierarchy can be seen to satisfy the strong triangle inequality:

d (x, z) \leq m a x (d (x, y), d (y, z))

and permutations of

x, y, z

. As illustrated in Figure 4a, triplets of patterns can only be in one of the two triangle relations: equilateral and isosceles with two long edges, in other words, an ultrametric space has no node intermediate between any two nodes (Figure 4b).

From the point of view of semantics, this is an implausible situation. If one considers superordinate categories as the single archetypal parent from which all concepts descend, it becomes clear that such an ultrametric structure is unsuitable in describing all the semantic relations in which the ultrametric inequality is not satisfied: for example when a concept finds itself “in between” two other concepts. On the other hand, the very meaning of a concept can be thought of as the set of features that are associated with it. It may then be more sensible to consider the features characterizing a concept as its building blocks, hence its parents. In the following, we describe an algorithm, first sketched in [37], in which each child pattern (concept) is generated from multiple parents (features), a random subset of the total group of parents relevant to it.

2.2.2. Multiple Parents and Non-Trivially Organized Children

How can we incorporate a plausible featural description into our model of semantic memory? One may consider features as the parents from which child concepts are derived. We can then map quantities such as the number of features, their sharedness, and their dominance to appropriate parameters in our model.

Our simple version of the multi-parent pattern-generation algorithm works in three stages. In the first stage, a set of

Π

random patterns are generated to act as parents. In the second stage, each of the

Π

parents are assigned to

p_{p a r}

randomly chosen children. Then, each “child” pattern is generated: each pattern, receiving the influence (or input) of its parents, aligns itself, unit by unit, in the direction of the largest input. In the third and final step, the fraction a of the units with the largest inputs is set as active in each child pattern. A schematic representation can be seen in Figure 5b and put in contrast with a schematic representation of the single-parent algorithm in Figure 5a.

2.2.3. The Algorithm Operating on Simple Binary Units

Each parent is assigned

p_{p a r}

children out of a total of p. The probability distribution that a given child has

n_{p}

parents, out of a total pool of

Π

is given by a binomial, with the prolificity

f = p_{p a r} / p

P (n_{p}) = (\binom{Π}{n_{p}}) f^{n_{p}} {(1 - f)}^{Π - n_{p}}

(19)

The algorithm draws, for the input

x_{i}^{π \to μ}

from unit i of parent

π

to unit i of pattern

μ

, a uniformly distributed random number in the interval

(0, 1]

with probability

a_{p}

and zero with probability

1 - a_{p}

such that we can write

P (x_{i}^{π \to μ}) = a_{p} U_{(0, 1]} (x_{i}^{π}) + (1 - a_{p}) δ (x_{i}^{π}),

(20)

where

a_{p}

, which we can call the extent of the input from one parent, is analogous to the a parameter in Equation (12); indeed, if

a_{p} \sim 0

, a child pattern is very unlikely to receive, on a particular unit, the contribution from one of its parents. On the other hand, if

a_{p} \sim 1

then all parents influencing a child contribute to its input, whichever the unit.

U_{(0, 1]}

denotes the uniform distribution, such that the input from parents is graded, contrary to the previous section.

Here, we have made the choice of non-sparse parents, but sparse input from parents, aimed at decorrelating units, while conserving correlations between patterns. This choice will prove crucial in Section 2.3.1, where statistical independence between units will lead to a vanishing mean noise, using only a simple covariance rule. For

S = 1

, this means that the patterns generated by the algorithm are uncorrelated, but the importance of having non-sparse parents with sparse input from them becomes important when dealing with more than one Potts state. Nevertheless, in this section, we consider

S = 1

, before treating genuine Potts units.

The main difference with respect to the single-parent algorithm is that now, one must compute the total field

h_{i}^{μ}

that a unit i of pattern

μ

receives from all parents

h_{i}^{μ} = \sum_{π = 1}^{Π} x_{i}^{π \to μ} I_{Ω_{μ}} (π) + ϵ,

(21)

where

Ω_{μ}

is the set of all parents acting on pattern

μ

and where we have

| Ω_{μ} | = n_{p} (μ)

.

I_{Ω_{μ}} (π)

is the indicator function that is 1 if parent

π

is assigned to pattern

μ

and 0 otherwise.

ϵ

is a small random input (

ϵ ≪ 1

) allowing for some input, even when

a_{p} ≪ 1

. The fields of all units of all patterns have the same distribution. In Appendix A, the full derivation of the probability distribution for the field

h_{i}^{μ}

is reported. Such a distribution has a non-trivial expression and, to our knowledge, it can only be evaluated numerically. However, a simple analytic expression can be given for the moments of the distribution of

h_{i}^{μ}

〈 h 〉 = n_{p} \frac{a_{p}}{2},

(22)

σ_{h} = \sqrt{n_{p} a_{p} (\frac{1}{3} - \frac{a_{p}}{4})},

(23)

as shown in Figure 6b. In Figure 6a, we see that these analytical results match tightly those from implementation of the algorithm.

P (h^{'} < h_{m} | n_{p}) = 1 - a .

(24)

As a last step, within a given pattern, a fraction a of the units having fields above a threshold

h_{m}

are set to become active. The threshold

h_{m}

is then implicitly given in terms of the cumulative distribution function. For any given child pattern

μ

with number of parents

n_{p}

, we can now define the probability that it will be activated, given the field that it receives

P (ξ_{i}^{μ} = 1 | h_{i}^{μ}) = Θ (h_{i}^{μ} - h_{m}) .

(25)

2.2.4. The Algorithm Operating on Genuine Potts Units

With genuine Potts states, the main difference with respect to the previous case is that the input from a parent

π

to the field of its child patterns can be, on a given unit, to any one of S states, with equal probability. This means that only a subset

Ω_{i, k}

of the total parents will contribute to state k of unit i. We denote the number of parents in the subset as

| Ω_{i, k} | = n_{i}^{k}

. The joint distribution of number of parents by state is

P (n_{i}^{1}, \dots, n_{i}^{S}) = \frac{n_{p}!}{S^{n_{p}} \prod_{k = 1}^{S} n_{i}^{k}!},

(26)

such that the constraint

\sum_{k = 1}^{S} n_{i}^{k} = n_{p}

is satisfied. We can then write the field of unit i in state k of pattern

μ

h_{i, k}^{μ} = \sum_{π = 1}^{Π} x_{i, k}^{π \to μ} I_{Ω_{i, k}^{μ}} (π) + ϵ .

(27)

Then, the algorithm is such that it selects, unit by unit, the state receiving the maximal input. Following some calculations shown in Appendix B, we can compute the distribution of the fields for those states having received maximal input H (Figure 7a). We can then compute, exactly as before, the threshold above which the unit becomes activated

P (H^{'} < H_{m} | n_{p}) = \int_{- \infty}^{H_{m}} P (H^{'} | n_{p}) d h^{'} = 1 - a .

(28)

Having obtained the minimal field

H_{m}

required to activate a unit (Figure 7b), we now need only the distribution of the field given the number of parents in that state

P (h^{k} | n^{k})

, which is none other than Equation (A8) (replacing

n_{p}

with

n^{k}

). We finally get to the distribution of activity across units and states, given the field received

P (ξ_{i k}^{μ} = 1 | h_{i k}^{μ}) = Θ (h_{i k}^{μ} - H_{m}) .

(29)

Given the algorithm just described, the main mechanism determining the state of a unit in a given pattern is how many of the parents affecting a child are in the same state. If parents are all aligned, this makes the unit receive a higher field in a single state, making it more likely to become activated. On the other hand, lower alignment between parents results in the field received by a child unit to be spread among the different states, and make it less probable for the child unit to find itself among those with maximal fields, as given by Equation (A18).

2.2.5. Resulting Patterns and Their Correlations

In the previous section, we have described the mechanism through which individual child patterns are generated. At this level, in order to determine whether or not a unit of a pattern is active, the only relevant parameter is the number of parents, as well as their degree of alignment in Potts space. From the point of view of an individual child pattern then, all parents are equivalent and can be considered as identical and independently distributed, a property exploited above. In this section, we turn to the correlations between patterns. Are they dominated by the number of parents that a pair of child patterns have in common? Is this a plausible model for semantic memory?

Patterns generated by the algorithm sample different active states uniformly, such that Equation (1) still holds, though the joint distribution

P ({\bar{ξ}}^{1} \dots {\bar{ξ}}^{p})

is not factorizable anymore, as it was in Equation (8). In Section 2.2.2 we discussed how the activity of different units is still approximately uncorrelated. We can see this by computing, analogously to the correlation between patterns, Equation (10), the correlation between units as the fraction of patterns in which two units are co-active and in the same state

C_{i j} = \frac{1}{p a} \sum_{μ}^{p} δ_{ξ_{i}^{μ}, ξ_{j}^{μ}} (1 - δ_{ξ_{i}^{μ}, 0}) .

(30)

In Figure 8 we can see the distributions of

C_{μ ν}

and

C_{i j}

for nine different combinations of the extent

a_{p}

and prolificity f parameters. The distributions are very sensitive to the specific values of the parameters. For low values of

a_{p}

and f, pairs of Potts units have uncorrelated activity when averaged across patterns, in the sense that the distribution

C_{i j}

has zero covariance. Pairs of patterns, instead, are correlated with a distribution

C_{μ ν}

of non-zero covariance, that is positively skewed. Low values of

a_{p}

and high values of f result in the distribution of the pattern correlations becoming more and more normal, while high values of

a_{p}

and low values of f result in a normally distributed correlation between units and a highly skewed multimodal distribution between patterns.

To assess these observations more systematically, in Figure 9a we can see boxplots of the

C_{μ ν}

distributions for different values of

a_{p}

keeping

f = 0.05

fixed. While the mean correlation is unaffected by increasing

a_{p}

, the standard deviation and the skewness increase. In Figure 9b, conversely, we can see boxplots of

C_{μ ν}

distributions for different values of f keeping

a_{p} = 0.4

fixed. It can be seen that increasing f increases the mean correlation between patterns. The effects observed can be understood intuitively because of the different roles that these parameters play in the algorithm. The extent

a_{p}

is the parameter that increases the probability that a child unit receives input from a parent, increasing the overall similarity of a child to its parents. This means that those children that have a larger number of shared parents will be more similar and more strongly correlated, giving rise to the larger values in the distribution. The prolificity f, on the other hand, is the ratio of the pool of children affected by one parent to the total number of children. Increasing this ratio leads to an increase in

〈 n_{p} 〉

, the mean number of parents, such that children tend to share more parents. It can be seen in Figure 10b, in which pairs of patterns are decomposed into different distributions sharing an increasing number of parents (0–5 shared parents), that for a pair of patterns, a higher number of shared parents leads to a higher mean correlation. The number of such pairs is markedly fewer, as can be seen in the left axis of Figure 10a (plotted in a logarithmic scale), but if f is high enough, this effect is enough to increase the overall mean correlation between all patterns. The two parameters

a_{p}

and f therefore play different roles in generating the correlations.

2.2.6. The Ultrametric Limit

It is interesting to note a limit case of the algorithm. For low prolificity, if e.g.,

{〈 n_{p} 〉}_{μ} = Π f \sim 1

as in Figure 8 (left column, i.e.,

f = 0.01

,

Π = 150

), on average most children will have a single parent, which effectively produces ultrametric patterns. Indeed, for these parameters, since the number of total parents

Π = 150

is smaller than the total number of children generated,

p = 1000

, several children share a given single parent. The mean value of their correlation with all other children, however, at

a / S

, is the same as the mean correlation between uncorrelated patterns. Note that the distribution is multimodal. The values forming the second mode of the distribution express the correlation between children belonging to the same (single) parent.

2.2.7. The Random Limit

Another limit is the random or limited-parent-influence limit, in which

a_{p} ≪ 1

(effectively, the top row in Figure 8). In this case, most units will not receive input from their respective parents, regardless of how many they are, and the unit will align itself in the direction of a random Potts state given by the input

ϵ

. In this way, it is possible to parametrically generate patterns ranging from independent (

a_{p} ≪ 1

) to ultrametric (

a_{p} = 1

,

f Π = 1

), from the top row to the left column of Figure 8, but also to enter the area of complexity to the bottom right, where correlations might begin to resemble plausible semantic relations.

2.2.8. Semantic Dominance

Returning to the correlation observed among the nouns we considered in Section 1.1 as our toy example, how important is, in the set of words, each individual feature? We can quantify it through a simple measure of semantic “dominance”, by simply summing the feature weights of all nouns

s_{j} = \sum_{i}^{N} w_{i j}

.

In Figure 11, we report the summed weights of the

M = 50

features across all the nouns considered, sorted and plotted on a semi-logarithmic scale. Even given the very small dataset used, it can be approximated to a good extent by an exponential law. The suboptimal fit may conceivably be the result of limited and unbalanced sampling. Indeed, the words, the nouns or the verbs were not chosen with comparable frequency. This measure is therefore only approximate, as an aggregate measure of dominance. Our measure is related to the measure called “semantic relevance” used by Sartori and colleagues [38] as well as to the “semantic differential” used by Osgood [39]. The difference with the latter measure, however, is that ours is cumulative across all of the nouns and derived from co-occurrence statistics in a corpus, while the semantic differential refers to a scale in which individuals rate the connotative meaning of objects, events, and concepts.

To take into account this observation, we consider a more refined model in which the parents in our algorithm (the features), ranked from 1 to

Π

, have their input strengths damped exponentially with a dominance rate

ζ

, such that Equation (27) is revised in the following way

h_{i, k}^{μ} = \sum_{π = 1}^{Π} x_{i, k}^{π \to μ} I_{Ω_{i, k}^{μ}} (π) exp (- ζ π) + ϵ,

(31)

where

x_{i, k}^{π \to μ}

is the input from parent

π

to child pattern

μ

,

Ω_{i, k}^{μ}

is the set of all parents acting on pattern

μ

and

I_{Ω_{i, k}^{μ}} (π)

is the indicator function that is 1 if parent

π

is assigned to pattern

μ

and 0 otherwise. The limit

ζ \to 0

corresponds to the algorithm described in the previous sections, such that we recover Equation (27). In this way, we introduce a parameter,

ζ

, which can be related to the slope seen in dominance distributions observed in real data, such as the one in Figure 11.

In Figure 12 we can see a schematic representation of this new algorithm. In contrast to the extent

a_{p}

, the parameter

ζ

, though also affecting the strength of input, plays a different role, as it affects the global strength with which each parent affects its children, leading to variability of input across patterns. A high value of

ζ

contributes to highly unbalanced input from parents influencing a child pattern, such that units tend to align each with the most powerful parent, or the most dominant feature.

How are the correlations affected by the dominance

ζ

? In Figure 13 we report the distributions for three different values of the dominance

ζ

and prolificity f. While for low values of

ζ

, i.e., parents homogeneous in their strengths, the correlation between patterns is unaffected (see Figure 8), increasing

ζ

we see the emergence of a tail of highly correlated patterns. For small f, this has the effect of smearing the bi-modal distribution, while for larger f, the already existing tail becomes fatter. This effect can also be seen more summarily in Figure 14. Parents’ dominance reinforces children correlations such that in the regime of low dominance, we call the resulting patterns weakly correlated, while in the regime of high dominance, we call the resulting patterns strongly correlated.

2.3. Storage Capacity of the Potts Network with Correlated Patterns

Having defined an algorithm which generates correlated patterns, we can turn to study the storage capacity of the Potts network and how it is affected by the correlations. We have carried out numerical simulations [34] with the learning rule in Equation (2), and have observed that the storage capacity is diminished in the case of correlated patterns, a result that has been obtained analytically by others [40,41], albeit for different sources of correlations.

2.3.1. Self-Consistent Signal to Noise Analysis

In [31], we have discussed the application of the self-consistent signal to noise analysis (SCSNA) to the Potts network with uncorrelated patterns. In the following section we extend this analysis to the case of correlated patterns encoded by a Potts network with diluted random connectivity (Equation (7)) and obtain estimates of the storage capacity accounting for correlations. In the expression of the field, Equation (5) we identify two terms which are commonly referred to as signal and noise, with respectively non-vanishing and vanishing averages. While the signal, the contribution from the condensed pattern (that we label as

μ = 1

in Equation (2)), is what pushes the activity of the unit such that the network configuration converges to an attractor, the noise, or the crosstalk from all of the other patterns, is what deflects the network away from the cued memory pattern. Denoting

v_{ξ_{i}^{μ} k} = (δ_{ξ_{i}^{μ} k} - \tilde{a}) (1 - δ_{k 0})

, the noise term writes

n_{i}^{k} \propto \sum_{μ > 1}^{p} \sum_{j (\neq i)}^{N} \sum_{l} v_{ξ_{i}^{μ} k} v_{ξ_{j}^{μ} l} σ_{j}^{l},

(32)

that is, the contribution to the weights

J_{i j}^{k l}

by all non-condensed patterns (

μ > 1

). By virtue of the subtraction of the mean activity

\tilde{a}

from the post-synaptic factor, the noise has vanishing average:

{〈 n_{i}^{k} 〉}_{P (ξ)} \propto \sum_{μ > 1}^{p} \sum_{j (\neq i)}^{N} \sum_{l} 〈 v_{ξ_{i}^{μ}, k} 〉 〈 v_{ξ_{j}^{μ}, l} σ_{j}^{l} 〉 = 0 .

(33)

The variance of the noise can be approximately written in the following way:

〈 {(n_{i}^{k})}^{2} 〉 \propto \sum_{μ > 1}^{p} \sum_{j (\neq i) = 1}^{N} \sum_{l} \sum_{μ^{'} > 1}^{p} \sum_{j^{'} (\neq i) = 1}^{N} \sum_{l^{'}} c_{i j} c_{i j^{'}} 〈 v_{ξ_{i}^{μ}, k} v_{ξ_{i}^{μ^{'}}, k} 〉 〈 v_{ξ_{j}^{μ}, l} v_{ξ_{j^{'}}^{μ^{'}}, l^{'}} σ_{j}^{l} σ_{j^{'}}^{l^{'}} 〉,

(34)

where statistical independence between units is implicitly used. While in the case of uncorrelated patterns, all terms but

μ = μ^{'}, j = j^{'}

and

l = l^{'}

vanish, with correlated patterns this is not the case. Now, the additional terms

μ \neq μ^{'}, j = j^{'}

and

l = l^{'}

must be considered. Given the statistical independence of units, however, all other terms are zero. Having identified the non-zero terms, we can proceed with the capacity analysis. We can express the field, Equation (5) using the overlap parameter

h_{i}^{k} = v_{ξ_{i}^{1} k} m_{i}^{1} + \sum_{μ > 1} v_{ξ_{i}^{μ} k} m_{i}^{μ} - U (1 - δ_{k 0}),

(35)

where we define the local overlap

m_{i}^{ν}

as

m_{i}^{ν} = \frac{1}{c_{m} a (1 - \tilde{a})} \sum_{j} \sum_{l} c_{i j} v_{ξ_{j}^{ν} l} σ_{j} .

(36)

At the root of the SCSNA [22,42,43] is the assumption that the noise term itself can be expressed as the sum of two terms, one proportional to the activity of unit i and the other a Gaussian random variable,

\sum_{μ > 1} v_{ξ_{i}^{μ}, k} m_{i}^{μ} = γ_{i}^{k} σ_{i}^{k} + \sum_{n = 1}^{S} v_{n, k} ρ_{i}^{n} z_{i}^{n};

(37)

z_{i}^{n}

are standard Gaussian variables, and

γ_{i}^{k}

and

ρ_{i}^{n}

are positive constants to be determined self-consistently. The first term, proportional to

σ_{i}^{k}

, represents the noise resulting from the activity of unit i on itself, after having reverberated in the loops of the network; the second term contains the noise which propagates from units other than i. The activation function writes

σ_{i}^{k} = \frac{e^{β h_{i}^{k}}}{\sum_{l} e^{β h_{i}^{l}}} \equiv F^{k} ({\{y_{i}^{l} + γ_{i}^{l} σ_{i}^{l}\}}_{l}),

(38)

where

y_{i}^{l} = v_{ξ_{i}^{1}, l} m_{i}^{1} + \sum_{n} v_{n, l} ρ_{i}^{n} z_{i}^{n} - U (1 - δ_{l, 0})

. The activity

σ_{i}^{k}

is then determined self-consistently as the solution of Equation (38)

σ_{i}^{k} = G^{k} ({\{y_{i}^{l}\}}_{l}),

(39)

where

G^{k}

are functions solving Equation (38) for

σ_{i}^{k}

. However, Equation (38) cannot be solved explicitly. Instead we make the assumption that

{σ_{i}^{l}}

enters the fields

{h_{i}^{l}}

only through their mean value

〈 σ_{i}^{l} 〉

, so that we write

G^{k} ({\{y_{i}^{l}\}}_{l}) ≃ F^{k} ({\{y_{i}^{l} + γ_{i}^{l} 〈 σ_{i}^{l} 〉\}}_{l}) .

(40)

The coefficients in the SCSNA ansatz, Equation (38),

γ_{i}^{k} = γ

and

ρ_{i}^{k} = ρ^{k}

are found to be

γ = \frac{α}{2 S} \frac{c_{m}}{N} \frac{Ω}{1 - Ω},

(41)

{(ρ^{n})}^{2} = \frac{α P_{n}}{S (1 - \tilde{a})} q \{1 + \frac{p \bar{C_{a s}}}{a (1 - \tilde{a})} (\bar{C_{a s}} - \tilde{a})\} \{1 + 2 \frac{c_{m}}{N} Ψ + \frac{c_{m}}{N} Ψ^{2}\} .

(42)

where

P_{n}

refers to the distribution of the patterns (Equation (1), where

α = p / c_{m}

as before, and where

C_{a s}

, defined in Section 2.2.5, is the fraction of units that are in the same Potts state in two different patterns, normalized by a. Note the second term in the first curly brackets that scales with

p^{2} / c_{m}

and is proportional to

\bar{C_{a s}} - \tilde{a}

, the covariance between patterns. This term originates from the additional non-zero terms in the sum in Equation (34) due to correlations between patterns. When uncorrelated patterns are considered, such that

\bar{C_{a s}} = \tilde{a}

, it becomes zero. In this calculation, we assume that correlations between the v operators of order higher than the second are negligible. As a consequence, the only quantities involved are their covariances. This approximation corresponds to the assumption that the v operators are normally distributed. Following the same procedure reported in [31],

Ω

, q and

Ψ

are found to be

Ω = 〈\frac{1}{N S} \sum_{j} \sum_{l} \frac{\partial G_{j}^{l}}{\partial y^{l}}〉,

(43)

q = 〈\frac{1}{N a} \sum_{j} \sum_{l} {(G_{j}^{l})}^{2}〉,

(44)

Ψ = \frac{Ω}{1 - Ω},

(45)

where

〈 \cdot 〉

indicates the average over all patterns. The mean field received by a unit is then

\begin{matrix} H_{k}^{ξ} & = v_{ξ, k} m + \frac{α}{2 S} \frac{c_{m}}{N} Ψ (1 - δ_{k, 0}) - U (1 - δ_{k, 0}) \\ + \sum_{n = 0}^{S} v_{n, k} z^{n} \sqrt{\frac{α P_{n}}{S (1 - \tilde{a})} q \{1 + 2 \frac{c_{m}}{N} Ψ + \frac{c_{m}}{N} Ψ^{2}\} \{1 + \frac{p \bar{C_{a s}}}{a (1 - \tilde{a})} (\bar{C_{a s}} - \tilde{a})\}} . \end{matrix}

(46)

Taking the average over the non-condensed patterns (the average over the Gaussian noise z), followed by the average over the condensed pattern

μ = 1

(denoted by

{〈 \cdot 〉}_{ξ}

), in the limit

β \to \infty

, we get the self-consistent equations satisfied by the order parameters

m = \frac{1}{a (1 - \tilde{a})} {〈\int D^{S} z \sum_{l (\neq 0)} v_{ξ, l} \prod_{n (\neq l)} Θ (H_{l}^{ξ} - H_{n}^{ξ})〉}_{ξ},

(47)

q = \frac{1}{a} {〈\int D^{S} z \sum_{l (\neq 0)} \prod_{n (\neq l)} Θ (H_{l}^{ξ} - H_{n}^{ξ})〉}_{ξ},

(48)

\begin{matrix} Ω & = \frac{1}{\tilde{a} \sqrt{α q \{1 + 2 \frac{c_{m}}{N} Ψ + \frac{c_{m}}{N} Ψ^{2}\} \{1 + \frac{p \bar{C_{a s}}}{a (1 - \tilde{a})} (\bar{C_{a s}} - \tilde{a})\}}} \cdot \\ 〈 \int D^{S} z \sum_{l (\neq 0)} \sum_{k} & \sqrt{\frac{P_{k}}{S (1 - \tilde{a})}} v_{k l} z^{k} \prod_{n (\neq l)} Θ (H_{l}^{ξ} - H_{n}^{ξ}) 〉_{ξ} \end{matrix}

(49)

The averaging in Equations (47)–(49) can be performed analytically and we refer to [31,34] for their expressions. The storage capacity

α_{c}

is defined as the maximal

α

that solves the set of equations Equations (47)–(49) with finite overlap m.

2.3.2. Numerical Solutions of Mean-Field Equations and Simulations

In Figure 15a we solve the set of self-consistent equations Equations (47)–(49) for different values of the sparsity a and

c_{m} / N

for the simpler case of uncorrelated patterns. In Figure 15b we can see the agreement of the former solutions for

c_{m} / N = 0.1

with simulations. We can also see, in the same figure, the mean-field solutions as well as simulations for correlated patterns, with the values of

\bar{C_{a s}}

obtained from simulations of the algorithm. For lower values of the sparsity, the solution to the mean-field equations over-estimates the capacity compared to what we obtain through the simulations, possibly because the mean-field treatment does not account for the fluctuations in the correlations obtained through the algorithm, but only the increase in the excess mean correlation. For higher values of the sparsity, the agreement is better presumably because the correlations produced by the algorithm become dominated by the mean.

In Figure 16 we show the storage capacity for correlated patterns, for different values of the correlation parameters

a_{p}

and f. As can be seen in Figure 16a, increasing either extent of influence or prolificity, whatever the sparsity, is detrimental to the capacity. In Figure 16b, instead, we can see the capacity as a function of the number of Potts states S. For

S = 1

, as the algorithm produces uncorrelated patterns, the capacity remains the same, regardless of the correlation parameters. For higher values of S, on the other hand, the capacity decreases with increasing values of the correlation parameters f and

a_{p}

. The behavior of the capacity as a function of

c_{m}

, shown in the simulations of Figure 16c which have been carried out with random dilution of the connectivity (see Equation (7)) shows the same strong dependence on correlations. The decrease in capacity brought on by the correlations is due to the increased variance of the noise, discussed in the previous section.

2.3.3. The Effect of Correlation Parameters f and $a_{p}$

In Figure 17 we see the storage capacity as a function of the different correlation parameters f and

a_{p}

. We can see that increasing each of these parameters decreases capacity, albeit in a different manner. The dependence of

α_{c}

on the prolificity f can be seen in Figure 17a:

α_{c}

decreases dramatically with increasing f, and goes to zero for very high values of f, in which children are each affected by a large number of parents. This result makes sense in light of the fact that f affects the mean correlation between children, as shown in Figure 9b.

In contrast,

α_{c}

decreases almost linearly with increasing the extent of parent input

a_{p}

, as shown in Figure 17b, but does not go to zero for the highest possible value of

a_{p} = 1

. As we saw in the Section 2.2.5,

a_{p}

affects the degree to which children are similar to each of their individual parents. Increasing this parameter increases the similarity between those children receiving input from the same parents, increasing their overall similarity and therefore decreasing their discriminability. In terms of the effect on the correlation distribution, in Figure 9a it can be seen that with increasing

a_{p}

, there is an increase in the fluctuations in the correlations, making them more positively skewed.

2.3.4. Correlated Retrieval

In the previous section we saw that correlations decrease the storage capacity of the network. In particular, in terms of the dominance parameter

ζ

, what is the effect of correlations on memory retrieval? Which configurations of activity does the network settle into? We carried out simulations with correlated patterns for different values of

ζ

. We cued each pattern and, at the end of retrieval dynamics, we computed the overlap of the network configuration with all patterns and all parents. We then computed

〈 m_{c u e} 〉

as the mean overlap, over all simulations, of the network configuration with the cued pattern. Similarly we computed

〈 m_{c o r r} 〉

as the mean overlap of the network configuration with another pattern, the highest among the children excluding the one cued. Finally we computed

〈 m_{f a c t} 〉

as the mean overlap of the network configuration with a parent, the highest among all parents. We plot these values as a function of increasing loading

α

in Figure 18.

We can see that for low values of

ζ

, the fall of

〈 m_{c u e} 〉

is accompanied by only a modest increase in both

〈 m_{c o r r} 〉

and

〈 m_{f a c t} 〉

, that we call respectively correlated and factor retrieval. Increasing

ζ

, here by two orders of magnitude, we observe two stages: a first (partial) fall in

〈 m_{c u e} 〉

, accompanied with a moderate increase in

〈 m_{c o r r} 〉

and

〈 m_{f a c t} 〉

; and then a second fall, after which

〈 m_{c o r r} 〉

and

〈 m_{f a c t} 〉

exceed the former. Increasing

ζ

by another two orders of magnitude, cued retrieval is restored to more than its initial value (with

ζ

small) beyond which we observe factor retrieval.

This effect is summarized in Figure 19, the storage capacity

α_{c}

as a function of

ζ

, where it can be seen that the capacity displays a trough at intermediate values of the dominance

ζ

. At lower values of

ζ

, well below the trough, the basins of attraction are rather well-separated (as we can anticipate from the clustering analysis of the next section, summarized in Figure 20a). The large barriers mean that each individual pattern can be retrieved relatively well.

Increasing

ζ

, up to where the capacity reaches its minimum value, patterns start to become increasingly clustered, and the barriers between them smaller. In this regime of intermediate clustering, the network can get confused during retrieval, effectively stabilizing into another stable state. This decrease in the capacity is however accompanied by correlated and to a greater extent factor retrieval: the network is not able to distinguish between individual patterns but recognizes the cluster it belongs to. Such a behavior has been found in previous models exploring correlations, and in particular in [44], where an ultrametric organization of the patterns is not a limit outcome of the model as in ours (see Figure 20) but is the object of study itself. In the latter study, the temperature is the control parameter, below a certain critical value of which the network can distinguish individual patterns, and above which the network can recall clusters, but not the individual patterns within each cluster.

As

ζ

is increased even further, the capacity starts to increase, eventually saturating at a given constant value. This increase of the capacity is due to the elimination of clusters, as the input from the weaker parents become comparable to the input from a randomly chosen state

ϵ

in Equation (31). Those clusters of patterns having as strongest parent, one of the weaker parents (in absolute) become eliminated one by one until there is only a single cluster left, corresponding to patterns belonging to the first (and strongest) parent. Moreover, that the storage capacity at such high values of

ζ

is higher than that at very low values of

ζ

is probably due to the fact that at high

ζ

, those patterns not belonging to the single cluster are randomly correlated with one another, while patterns at very low

ζ

are weakly correlated with one another. Such high values of

ζ

are, however, hardly plausible in terms of semantic organization, and hence outside the scope of our interest.

2.3.5. Residual Information: Memory Beyond Capacity

We can further corroborate the findings in the previous section through the mutual information between the pattern cued c and the configuration in which the network settles r

I (c, r) = \sum_{k, l = 0}^{S} C^{k l} (c, r) {log}_{2} (\frac{C^{k l} (c, r)}{C^{k} (c) C^{l} (r)}),

(50)

where

C^{k l} (c, r) = \frac{1}{N} \sum_{i = 1}^{N} δ_{ξ_{i}^{c}, k} σ_{i}^{l}

,

C^{k} (c) = \frac{1}{N} \sum_{i = 1}^{N} δ_{ξ_{i}^{c}, k}

, and

C^{l} (r) = \frac{1}{N} \sum_{i = 1}^{N} σ_{i}^{l}

. The maximum value of this quantity is attained when the cued pattern is also the one retrieved:

c = r

. In this case the mutual information can reach up to

I (c) = \sum_{k = 0}^{S} C^{k} (c) {log}_{2} (\frac{1}{C^{k} (c)}) = \{- (1 - a) {log}_{2} (1 - a) + a {log}_{2} (S / a)\},

(51)

that we recognize to be the entropy of the cued pattern. In Figure 18 we can see the mutual information as a function of the loading

α

for different values of the parameter

ζ

, averaged across cued retrieval of many patterns. For small values of

α

, the mutual information does not depend on

ζ

: its value at this plateau corresponds to the entropy. For small

ζ

, the mutual information has a sharp fall-off upon increasing

α

, falling to approximately zero. For the intermediate value of

ζ

reported, it displays a step-like behavior, but ultimately stabilizes to a constant non-zero value. Increasing

ζ

even further, it again has a fall-off, at a higher value of the storage load, qualitatively following the behavior of the overlap in the same figure.

The most interesting observation in Figure 18, however, is the residual information, its remaining roughly constant value, after capacity collapse, in a range of intermediate values of

ζ

. In Figure 19a, this residual information is plotted as a function of the dominance rate

ζ

, and it can be seen that it increases sharply at approximately

ζ ≃ 0.01

(reaching a value, given these parameters, of order 0.7 × 10⁻³ bits per connection, some five times below the entropy of the stored pattern, the initial plateau in Figure 18). It then decreases again at approximately

ζ ≃ 0.5

. This effect is reminiscent of a phase transition with control parameter

ζ

, where the information plays the role of the order parameter. Below

ζ ≃ 0.01

and above

ζ ≃ 0.5

, once the capacity is exceeded, there is no more retrievable information about the cued pattern. Within the range

0.01 ≲ ζ ≲ 0.5

however, the network retrieves some information about the cued pattern. In Figure 19b we plot, as a phase diagram, the residual information as a function of

ζ

in the x-axis and f in the y-axis. One sees that a non-vanishing residual information requires, essentially, intermediate values of

ζ

and sufficiently large values of f. In terms of either parameter, the region with non-vanishing residual information spans more than one order of magnitude. This intermediate regime can be argued to form the basis of semantic resilience in our model.

2.3.6. Residual Memory Interpreted through Cluster Analysis

Although it was argued in the introduction that the presence of clusters is only one component of the metric relations embedding semantic memories, it is nevertheless instructive to interpret the residual memory phase expressed by our model in terms of cluster analysis. Figure 20 shows the outcome of applying a standard clustering algorithm to patterns generated at salient locations of the phase diagram in Figure 19b. For low dominance (

f = 0.2

and

ζ = 0.001

, Figure 20a, i.e., weak correlations), the algorithm is essentially unable to identify clusters: all patterns seem to be at roughly equal distance from each other. For higher dominance (

ζ ≃ 0.1

) the clustering structure emerges, as indicated by the expanding white area to the left. In the high dominance region (

ζ ≃ 1

, i.e., strong correlations), a few parents dominate the rest and the extracted clusters are clearer, accompanied with a concomitant increase in the residual information. For even larger

ζ

values there is only one cluster (not shown here), effectively, and no residual information above

α_{c}

, which returns to a value above that for weak correlation - as they are again effectively random, against the backdrop of that single parent, which acts as a biased probability distribution. We can conclude, therefore, that the residual information largely expresses resilient memory associated with the distinction between clusters, whereas within-cluster distinctions are lost above the capacity limit.

An interpretation of these results can be framed in terms of categorical perception [45]. Whenever we learn categories, it becomes more difficult to distinguish two patterns within a category than two patterns straddling the boundary between two categories, even when the two pairs are at the same physical distance. Within-category discrimination is reduced while between-category discrimination is enhanced. In addition to this, there exists a perceptual warping within the category. All categories acquire an internal structure: how well each pattern fits in the category can be measured with a goodness metric [46,47] and the element with the largest goodness is called the prototype. Whenever a pattern is observed, it is perceived as closer to the prototype of the category than it really is. This effect is called the perceptual magnet [48] and has been observed in many contexts, see [49] for a review.

The results we obtain here, although pertaining to a model of semantic memory not of perception, and a model that transcends the simple-minded notion of category, are reminiscent of these phenomena. For moderate values of

ζ

, the exact cued pattern may be lost but still the final pattern belongs with it in some cluster. The fact that the final pattern is correlated with the strong parent pattern is akin to the perceptual magnet if we think of the parent pattern as embodying the cluster, while the child patterns are distortions of it. We recently gave an account of perceptual phenomena using as central feature the similarity between patterns [15] and showed that the existence of the structure of a category and the perceptual effects are natural consequences of the similarity structure of the patterns within a category. In both approaches, taking into account the similarity (or correlation) structure of a set of patterns lead, in some regimes, to a similar phenomenology. However, this connection remains qualitative, in that our pattern-generation algorithm does not produce well-defined categories, but a more complex set of metric relations among patterns, where clusters emerge as one component if a few parents dominate.

2.3.7. Residual Memory Rides on Fine Differences in Ultrametric Content

A complementary perspective is that afforded by our measure of ultrametric content, which is derived from a measure of distance between patterns, applied to all triplets of patterns. A suitable distance measure, for Potts patterns, can be

D_{μ ν} = C_{a 0} + C_{0 a} + 2 C_{a d},

(52)

where similar to the quantity we introduced in Section 2.2 (

C_{a s}

),

C_{a d}

is the fraction of co-active units which are in different states in both patterns

μ

and

ν

.

C_{0 a}

is the number of units quiescent in

μ

and active in

ν

and conversely

C_{a 0}

is the number of units active in

μ

and quiescent in

ν

. In the dominance-prolificity region we have been considering, this distance measure yields the values seen in Table 1 for the ultrametric content index (see Appendix C).

Interestingly, a completely different measure of distance, similar instead to the one extracted from the feature-based norms in the toy example reported in Section 1.1, yields values very close to these, within a few percent, when applied to patterns generated with our algorithm. We can see, therefore, that the emergence of residual memory does correlate with increased ultrametric content, but not in a simple one-to-one correspondence; and the putative phase transition to non-zero residual information occurs in the midst of a relatively minor increase in the values of the ultrametric content index.

A tentative conclusion is that semantic resilience, at least as crudely modelled by a Potts network, requires a degree of clustering or ultrametric structure, which in the pattern-generation model reflects sufficient values of prolificity and dominance, but is still an emergent phenomenon. Quantitative differences in the parameters produce what tends to be an all-or-none difference in semantic resilience. Yet another form, possibly, of analog-to-digital transform produced by a neural circuit.

3. Discussion: A New Model for the Extraction of Semantic Structure

In recent years, the Potts network has been proposed as an effective model of a cortical network organized with distinct local and global connections. Several aspects of the model have been studied in quantitative detail, under many simplifying assumptions, including that of uncorrelated patterns. However, it can be argued that such patterns are irrelevant for the study of semantic memory. The various feats of “mind-reading”, achieved with fMRI studies (e.g., [7,50,51]), reflect that correlations between memories are not simply a nuisance that degrades memory capacity, but express the core ability of the cortex as a machine for encoding structured information. We have attempted to make theoretical progress in this direction by designing a plausible algorithm to generate patterns. In our algorithm, the patterns are generated by the contribution of multiple factors, which can be considered as semantic category generators (except that categories overlap and have loose boundaries), or else features in a somewhat wider sense, that carry information on the statistical co-occurrence of attributes. Through competition, those attributes concur, each with its relative strength (parametrized by

a_{p}

, f and

ζ

) to construct the statistical structure of the memories.

We have further studied the storage capacity of the network as a function of both network parameters and correlation parameters through the SCSNA analysis as well as extensive simulations. We find that with a Hebbian rule for the storage of patterns, the network can store and retrieve fewer correlated patterns, though still of order

c_{m} S^{2} / a

(with weakly correlated patterns), yielding ∼10⁷ with human cortical parameters [31]. Other prescriptions for learning, enhancing capacity, may be explored and studied, and we leave such studies for future investigations. Of the correlation parameters, the effect of the dominance

ζ

is particularly interesting.

ζ ≃ 0

corresponds to a situation in which all parents are on equal footing, while the opposite limit corresponds to only one handful dominating the rest. For intermediate values of the dominance

ζ

, we observe correlated retrieval, in that with the decrease in successful retrieval, the fraction of trials in which another, correlated pattern is retrieved, increases; a closer look however, indicates that the phenomenon is linked to the retrieval of the factors, i.e., factor retrieval. In terms of the mutual information between the cued pattern and the final configuration of the network, after retrieval dynamics, we observe that in an intermediate regime of

ζ

, after the capacity limit has been reached, it does not go to zero. We call this the residual information. The residual information displays a non-monotonic dependence on the dominance

ζ

. Such a non-trivial behavior is reminiscent of a phase transition, in which the residual information is the order parameter and

ζ

is the control parameter. The residual information has an interesting interpretation: it can be thought of as the information pertaining to the gross, core semantic component of the memories, after the fine details have been compromised. Note that

1 / ζ

is a measure of the number of parents/factors/attributes that effectively dominate semantic space.

Taken at face value, the diminished capacity of the Potts network accompanied by the emergence of residual information suggests that this ability for generalization comes at the cost of losing the resolution with which we can retrieve the individual memories. However, this result as such is incomplete, and must be considered also in relation to the differential role of other memory structures and in particular the hippocampus in retrieval. For example, in humans, it has been shown that the ventral hippocampus projects directly to the medial prefrontal cortex, providing an immediate route for representations to reach the prefrontal cortex, suggesting a model of bidirectional hippocampus/prefrontal cortex interactions that support context-dependent memory retrieval [52]. Several studies have attempted to dissociate between the contributions of the hippocampus and the cortex in human memory retrieval. In particular, in [4,53] it was found that putatively different access modes to information stored in long-term memory in a remember/know paradigm lead to different distributions of classification errors of different groups with memory disorders. An information derived measure, the metric content, quantifying the concentration of errors was computed: high levels of metric content are indicative of a strong dependence on perceived relations among the set of stimuli, and therefore of a relatively preferred semantic access mode, while low levels (and similar correct performance), suggest a preferential episodic access mode. It was found that compared with normal controls, the metric content index was increased in patients with Alzheimer’s disease, decreased in patients with herpes encephalitis, and unvaried in patients with damage to the prefrontal cortex. Moreover, a significant correlation between the metric content and measures quantifying episodic and semantic retrieval mode in the remember/know paradigm introduced by Tulving [54] was found. If we think of the access modes, to a first approximation, as reflecting a stronger reliance on specific memory structures, the distribution of errors may then be a window into understanding their relative contributions. Within this larger picture, a cortical impairment as modelled for example by a Potts network with reduced connectivity may be somewhat mitigated by a complementary episodic mode of access, supported by other structures.

Our finding of semantic resilience, as characterized by the residual information has an interesting interpretation also in relation to the findings in the neuropsychology of semantic dementia. A typical finding is that the finer-grained or “subordinate” aspects of such patients’ knowledge are more susceptible to damage than the more “ordinate” aspects, for example in naming tasks. Moreover, the naming errors that such patients make tend to change in time from “circumlocutions to category coordinates to superordinate labels” [55]. It has been argued that such a finding is in favor of tree-like models of semantic knowledge, in which the mental representation of a concept occupies nodes in a branching tree, where the origin of the tree corresponds to its most general and the periphery to its most selective designation. Subsequent research however, pointed to findings that could not be explained through a tree model, such as verification latency [56] or typicality effects [47], or others which question where in the tree to store concepts that belong to more than one category [57]. In our account, instead, semantic resilience is an outcome that emerges naturally through higher values of the dominance parameter

ζ

, in which finer-grained or subordinate features of the concepts are overtaken by ordinate features, which then become the only retrievable ones, as shown through the cluster analysis in Figure 20. Crucially though, such clusters emerge only when our dominance parameter becomes large enough, and they are neither well-defined nor designed to have strict boundaries. The non-trivial behavior of the residual information, i.e., its phase transition with the dominance parameter

ζ

, cannot be predicted from a qualitative model encoding uncorrelated patterns. Our model offers one plausible way in which such resilience emerges.

Finally, our account may have implications for the question of how the cortex extracts and encodes the general statistical structure of the ensemble of stimuli that it receives. There is a well-established view of the cortex as a slow memory system that uses overlapping distributed representations to represent the general statistical structure of the environment. It has been suggested that the interaction between the hippocampus and the cortex is a crucial element in the consolidation of memories. The general idea is that memories are first stored in the hippocampal system via synaptic changes and that these support the reinstatement of recent memories in the neocortex. Neocortical synapses are slightly modified on each reinstatement and the gradual, neocortical changes accumulating over time encode remote memory. This division of labor would allow the hippocampus to rapidly encode new episodic items without disrupting semantic memories, and the cortex to slowly integrate them in a structured fashion into such memories.

This view is consistent with evidence that damage to the hippocampal system results in recent memory disruption but leaves remote memory intact, but it does not really specify what makes the consolidation process gradual or slow [58]. Early modelling attempts typically resorted to backpropagation to account for the structured learning of the cortex [59]: consolidation would then be slow, because backpropagation is effective with low learning rates. However, backpropagation has been widely criticized on the basis that it lacks a plausible biological mechanism. While hippocampal learning in these accounts was taken to fit the framework of learning unrelated patterns of activity, it had remained unclear how to model neocortical learning. Our account offers an alternative framework for neocortical learning, in which semantic structure is extracted progressively from the statistics of generating features and encoded in the cortex via Hebbian learning and is resilient, i.e., it is preserved when the storage capacity for “episodic details” is exceeded.

Author Contributions

V.B. and A.T. conceived and designed the study, which were performed primarily by V.B. and A.T., with contributions by R.B., V.B. and A.T. wrote the paper, with input from R.B.

Funding

This research was supported by Human Frontier Science Program grant RGP0057/2016.

Acknowledgments

The authors would like to thank Alberto Pezzotta for very fruitful suggestions. Discussions with Michelangelo Naim and Chol-Jun Kang are gratefully acknowledged.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Calculation of the Probability Distribution of the Field for S = 1

In this section we outline the main steps in deriving Equation (25). The distribution for

h_{μ}

can be computed by making use of the probability generating function

G (s) \equiv L {P (x = ξ_{i}^{π})} = \int_{0}^{\infty} d x P (x) e^{- s x} = \frac{a_{p}}{s} (1 - e^{- s}) + (1 - a_{p}) .

(A1)

Since the

ξ_{i}^{π}

are identically and independently distributed for all

π

, we can use the following property

P (h_{μ} | n_{p}) = L^{- 1} {G {(s)}^{n_{p}}} .

(A2)

The number of parents as well as the total field received by a pattern is i.i.d, so we drop the index

μ

for

h_{μ}

. We can compute the conditional distribution of the field received by a unit, given the number of parents, as

P (h | n_{p}) = lim_{γ \to \infty} \frac{1}{2 π i} \int_{c - i γ}^{c + i γ} d s {1 - a_{p} + \frac{a_{p}}{s} (1 - e^{- s})}^{n_{p}} e^{s h} .

(A3)

Using the binomial theorem

{1 - a_{p} + \frac{a_{p}}{s} (1 - e^{- s})}^{n_{p}} = \sum_{k = 0}^{n_{p}} (\binom{n_{p}}{k}) {(1 - a_{p})}^{n_{p} - k} \frac{a_{p}^{k}}{s^{k}} \sum_{j = 0}^{k} {(- 1)}^{j} e^{- s j},

(A4)

P (h | n_{p}) (h) = lim_{γ \to \infty} \frac{1}{2 π i} \int_{c - i γ}^{c + i γ} \sum_{k = 0}^{n_{p}} \sum_{j = 0}^{k} {(- 1)}^{j} (\binom{n_{p}}{k}) (\binom{k}{j}) {(1 - a_{p})}^{n_{p} - k} \frac{a_{p}^{k}}{s^{k}} e^{s (h - j)} d s

(A5)

We can carry out the integral to find

I (k = 0) = δ (h),

(A6)

I (k \geq 1) = \frac{{(h - j)}^{k - 1}}{(k - 1)!} .

(A7)

The distribution of the field h for a given number of parents

n_{p}

is then

P (h | n_{p}) = {(1 - a_{p})}^{n_{p}} δ (h) + \sum_{k = 1}^{n_{p}} \sum_{j = 0}^{k} \frac{{(- 1)}^{j} n_{p}! a_{p}^{k} {(1 - a_{p})}^{n_{p} - k}}{(n_{p} - k)! (k - j)! j! (k - 1)!} {(h - j)}^{k - 1} Θ (h - j) .

(A8)

The first term in this equation expresses the fact that the only way to get zero field is if all

n_{p}

parents contribute zero field and this occurs with probability

{(1 - a_{p})}^{n_{p}}

. For a given pattern

μ

, with

n_{p}

parents, the field of each unit is distributed according to Figure 6. The cumulative distribution function writes

\begin{matrix} P (h^{'} < h | n_{p}) = \int_{- \infty}^{h} d h^{'} P (h^{'} | n_{p}) \end{matrix}

(A9)

\begin{matrix} = {(1 - a_{p})}^{n_{p}} Θ (h) + \sum_{k = 1}^{n_{p}} \sum_{j = 0}^{k} \frac{{(- 1)}^{j} n_{p}! a_{p}^{k} {(1 - a_{p})}^{n_{p} - k}}{(n_{p} - k)! (k - j)! j! k!} {(h - j)}^{k} Θ (h - j) . \end{matrix}

(A10)

The minimal threshold

h_{m}

is implicitly given by the cumulative probability

P (h^{'} < h_{m} | n_{p}) = 1 - a .

(A11)

Appendix B. Calculation of the Probability Distribution of the Field for S = 2

To derive Equation (29), we start with the joint distribution of number of parents by state

P ({\hat{n}}^{1} = n^{1}, \dots, {\hat{n}}^{S} = n^{S}) = \frac{n_{p}!}{S^{n_{p}} \prod_{k = 1}^{S} n^{k}!} .

(A12)

Note that we define the field to be identically distributed across states. The probability that the fields of all states are below that of the first is given by

P (h = h^{1}) = \int_{0}^{h} P (h^{1}, \dots, h^{S}) \prod_{k = 2}^{S} d h^{k} .

(A13)

The probability distribution of the maximal field is given by S times the one above

P (h_{m a x}) = S \int_{0}^{h^{1}} P (h^{1}, \dots, h^{S}) \prod_{k = 2}^{S} d h^{k} .

(A14)

The joint distribution of the fields across states writes

P (h^{1}, \dots, h^{S} | n_{p}) = \frac{n_{p}!}{S^{n_{p}}} \prod_{k = 1}^{S} \sum_{n^{k} = 1}^{n_{p}} \frac{P (h^{k} | n^{k})}{n^{k}!} δ_{n_{p}, \sum_{k = 0}^{S} n^{k}},

(A15)

where the constraint

n_{p} = \sum_{k = 0}^{S} n^{k}

has been included in the last line.

P (h^{k} | n^{k})

is given by Equation (A8), replacing

n_{p}

with

n^{k}

. We then have

\begin{matrix} P (h^{1}, \dots, h^{S} | n_{p}) = \frac{n_{p}!}{S^{n_{p}}} \prod_{k = 1}^{S} \sum_{n^{k} = 1}^{n_{p}} {\frac{{(1 - a_{p})}^{n^{k}}}{n^{k}!} δ (h^{k}) + \end{matrix}

(A16)

\begin{matrix} + \sum_{i = 1}^{n^{k}} \sum_{j = 0}^{i} {(- 1)}^{j} \frac{{(a_{p})}^{i} {(1 - a_{p})}^{n^{k} - i}}{(n^{k} - i)! (i - j)! j!} \frac{{(h^{k} - j)}^{i - 1}}{(i - 1)!} Θ (h^{k} - j)} δ_{n_{p}, \sum_{k = 0}^{S} n^{k}} . \end{matrix}

(A17)

For

S = 1

all contributions go to a single state, so we automatically have

n^{1} = n_{p}

, then the first sum disappears and we fall back onto Equation (A8). For

S = 2

we have, denoting the state receiving the maximal field by H,

\begin{matrix} P (H | n_{p}) = \frac{n_{p}!}{2^{n_{p} - 1}} \sum_{n^{1} = 1}^{n_{p}} {\frac{{(1 - a_{p})}^{n^{1}}}{n^{1}!} δ (H) + \sum_{i = 1}^{n^{1}} \sum_{j = 0}^{i} \frac{{(- 1)}^{j} {(a_{p})}^{i} {(1 - a_{p})}^{n^{1} - i}}{(n^{1} - i)! (i - j)! j! (i - 1)!} {(H - j)}^{i - 1} Θ (H - j)} \\ {\frac{{(1 - a_{p})}^{n_{p} - n^{1}}}{(n_{p} - n^{1})!} Θ (H) + \sum_{i^{'} = 1}^{n_{p} - n^{1}} \sum_{j^{'} = 0}^{i^{'}} \frac{{(- 1)}^{j^{'}} {(a_{p})}^{i^{'}} {(1 - a_{p})}^{n_{p} - n^{1} - i^{'}}}{(n_{p} - n^{1} - i^{'})! (i^{'} - j^{'})! j^{'}! i^{'}!} {(H - j^{'})}^{i^{'}} Θ (H - j^{'})}, \end{matrix}

(A18)

where we drop the indices denoting the units (they are drawn from the same distribution). Note that the state does not appear in this expression because it is the distribution for the state that receives maximal input, regardless of which one it is. The

μ

dependence is through

n_{p} = n_{p} (μ)

. We then get the minimal threshold for activation

H_{m}

implicitly in terms of the cumulative distribution

P (H^{'} < H_{m} | n_{p}) = \int_{- \infty}^{H_{m}} P (H^{'} | n_{p}) d H^{'} = 1 - a .

(A19)

We can compute it to find

\begin{matrix} P (H^{'} < H | n_{p}) = \frac{n_{p}!}{2^{n_{p} - 1}} \sum_{n^{1} = 1}^{n_{p}} {\frac{{(1 - a_{p})}^{n_{p}}}{n^{1}! (n_{p} - n^{1})! 2} \\ + \frac{{(1 - a_{p})}^{n_{p} - n^{1}}}{(n_{p} - n^{1})!} \sum_{i = 1}^{n^{1}} \sum_{j = 0}^{i} \frac{{(- 1)}^{j} {(a_{p})}^{i} {(1 - a_{p})}^{n^{1} - i}}{(n^{1} - i)! (i - j)! j! i!} {(H - j)}^{i} Θ (H - j) \\ + \sum_{i^{'} = 1}^{n_{p} - n^{1}} \sum_{j^{'} = 0}^{i^{'}} \frac{{(- 1)}^{j^{'}} {(a_{p})}^{i^{'}} {(1 - a_{p})}^{n_{p} - n^{1} - i^{'}}}{(n_{p} - n^{1} - i^{'})! (i^{'} - j^{'})! j^{'}! i^{'}!} \sum_{i = 1}^{n^{1}} \sum_{j = 0}^{i} \frac{{(- 1)}^{j} {(a_{p})}^{i} {(1 - a_{p})}^{n^{1} - i}}{(n^{1} - i)! (i - j)! j! (i - 1)!} I (H, i, i^{'}, j, j^{'})} . \end{matrix}

(A20)

where

m a x {j, j^{'}} = j^{*}

\begin{matrix} I (H, i, i^{'}, j = j^{'}) = \frac{{(H - j)}^{i + i^{'}}}{i + i^{'}} Θ (H - j) \\ I (H, i, i^{'}, j \neq j^{'}) = \\ = \{\begin{matrix} i^{'}! (i - 1)! [\sum_{q = 0}^{i^{'}} {(- 1)}^{q} \frac{{(H - j)}^{i + q} {(H - j^{'})}^{i^{'} - q}}{(i + q)! (i^{'} - q)!} Θ (H - j^{*}}) + {(- 1)}^{i^{'} + 1} \frac{{(j^{*} - j)}^{i + i^{'}}}{(i + i^{'})!}] & i - 1 \geq i^{'} \\ i^{'}! (i - 1)! [\sum_{q = 0}^{i - 1} {(- 1)}^{q} \frac{{(H - j)}^{i - q - 1} {(H - j^{'})}^{i^{'} + q + 1}}{(i - q - 1)! (i^{'} + q + 1)!} Θ (H - j^{*}) + {(- 1)}^{i} \frac{{(j^{*} - j)}^{i + i^{'}}}{(i + i^{'})!}] & i - 1 < i^{'} \end{matrix} \end{matrix}

(A21)

Appendix C. Ultrametric Content

A possible characterization of the correlations between the memory patterns is in terms of a distance. A quasi-distance measure can be derived from the correlation following the same procedure as in [10]. We first define a so-called “confusion” matrix

P (μ | ν) = \frac{C_{μ ν}}{\sum_{μ^{'} = 0}^{p} C_{μ^{'} ν}} .

(A22)

where

C_{μ ν}

is an element of the correlation matrix and where P, the confusion matrix, is obtained by normalizing each element of the correlation matrix appropriately. Next, we symmetrize the above function to obtain

d (μ, ν) = - log (\frac{P (ν | μ) P (μ | ν)}{P (μ | μ) P (ν | ν)}),

(A23)

a quasi-distance, in the sense that it satisfies only the reflective and symmetric properties,

d (μ, μ) = 0

and

d (μ, ν) = d (ν, μ)

. The triangular inequality

d (μ, ν) + d (μ, ρ) \leq d (μ, ρ)

does not necessarily hold. It can be made to hold by raising d to a sufficiently small power

d \to d^{1 / p}

, called the “trivialization” of d, as explained in detail in [10]. Using this procedure, distances between triplets of patterns

{μ, ν, ρ}

can be computed. If we note by

d_{m i n}

the edge of minimal length,

d_{m a x}

the edge of maximal length and

d_{m e d}

the edge of intermediate length, then we can plot, in a two-dimensional graph, the ratios

δ_{1} = d_{m i n} / d_{m a x}

and

δ_{2} = d_{m e d} / d_{m a x}

. Triplets that satisfy the triangular inequality lie above the line

δ_{1} = 1 - δ_{2}

, while triplets that satisfy the ultrametric inequality lie on the vertical line where

δ_{2} = 1

. Among these, triplets that are equilateral triangles lie at the point

δ_{1} = δ_{2} = 1

. To measure the overall closeness of the cloud of triplets to the fully ultrametric limit one can define the ultrametric content

λ_{u m} = 〈\frac{log δ_{1} - log δ_{2}}{log δ_{1} + log δ_{2}}〉

(A24)

where

〈 \cdot 〉

denotes the mean over all triplets. This quantity does not depend on the trivialization of d and it ranges from 0 (for triplets forming isosceles triangles with two short sides) to 1 (for a fully ultrametric set: equilateral triangles and isosceles triangles with two long sides).

References

Yonelinas, A.P. The nature of recollection and familiarity: A review of 30 years of research. J. Mem. Lang. 2002, 46, 441–517. [Google Scholar] [CrossRef]
Treves, A.; Rolls, E.T. Computational constraints suggest the need for two distinct input systems to the hippocampal CA3 network. Hippocampus 1992, 2, 189–199. [Google Scholar] [CrossRef] [PubMed]
Alme, C.B.; Miao, C.; Jezek, K.; Treves, A.; Moser, E.I.; Moser, M.B. Place cells in the hippocampus: Eleven maps for eleven rooms. Proc. Natl. Acad. Sci. USA 2014, 111, 18428–18435. [Google Scholar] [CrossRef] [PubMed]
Lauro-Grotto, R.; Ciaramelli, E.; Piccini, C.; Treves, A. Differential impact of brain damage on the access mode to memory representations: An information theoretic approach. Eur. J. Neurosci. 2007, 26, 2702–2712. [Google Scholar] [CrossRef] [PubMed]
Huth, A.G.; Nishimoto, S.; Vu, A.T.; Gallant, J.L. A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron 2012, 76, 1210–1224. [Google Scholar] [CrossRef] [PubMed]
Huth, A.G.; de Heer, W.A.; Griffiths, T.L.; Theunissen, F.E.; Gallant, J.L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 2016, 532, 453–458. [Google Scholar] [CrossRef] [PubMed]
Mitchell, T.M.; Shinkareva, S.V.; Carlson, A.; Chang, K.M.; Malave, V.L.; Mason, R.A.; Just, M.A. Predicting human brain activity associated with the meanings of nouns. Science 2008, 320, 1191–1195. [Google Scholar] [CrossRef] [PubMed]
Collins, A.M.; Quillian, M.R. Retrieval time from semantic memory. J. Verbal Learn. Verbal Behav. 1969, 8, 240–247. [Google Scholar] [CrossRef]
Warrington, E.K. The selective impairment of semantic memory. Q. J. Exp. Psychol. 1975, 27, 635–657. [Google Scholar] [CrossRef] [PubMed]
Treves, A. On the perceptual structure of face space. BioSystems 1997, 40, 189–196. [Google Scholar] [CrossRef]
Parga, N.; Virasoro, M.A. The ultrametric organization of memories in a neural network. J. Phys. 1986, 47, 1857–1864. [Google Scholar] [CrossRef]
Gutfreund, H. Neural networks with hierarchically correlated patterns. Phys. Rev. A 1988, 37, 570–577. [Google Scholar] [CrossRef]
Franz, S.; Amit, D.J.; Virasoro, M.A. Prosopagnosia in high capacity neural networks storing uncorrelated classes. J. Phys. 1990, 51, 387–408. [Google Scholar] [CrossRef]
Virasoro, M.A. Categorization in neural networks and prosopagnosia. Phys. Rep. 1989, 184, 301–306. [Google Scholar] [CrossRef]
Brasselet, R.; Arleo, A. Category Structure and Categorical Perception Jointly Explained by Similarity-Based Information Theory. Entropy 2018, 20, 527. [Google Scholar] [CrossRef]
Shallice, T.; Cooper, R. The Organisation of Mind; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; McClelland, J.L. A general framework for parallel distributed processing. Parallel Distrib. Process. Explor. Microstruct. Cognit. 1986, 1, 45–76. [Google Scholar]
Farah, M.J.; McClelland, J.L. A computational model of semantic memory impairment: Modality specificity and emergent category specificity. J. Exp. Psychol. Gen. 1991, 120, 339. [Google Scholar] [CrossRef] [PubMed]
Plaut, D.C. Semantic and associative priming in a distributed attractor network. In Proceedings of the 17th Annual Conference of the Cognitive Science Society, Hillsdale, NJ, USA, 22–25 July 1995; Volume 17, pp. 37–42. [Google Scholar]
Rogers, T.T.; Lambon Ralph, M.A.; Garrard, P.; Bozeat, S.; McClelland, J.L.; Hodges, J.R.; Patterson, K. Structure and deterioration of semantic memory: A neuropsychological and computational investigation. Psychol. Rev. 2004, 111, 205. [Google Scholar] [CrossRef] [PubMed]
Hellwig, B. A quantitative analysis of the local connectivity between pyramidal neurons in layers 2/3 of the rat visual cortex. Biol. Cybern. 2000, 82, 111–121. [Google Scholar] [CrossRef] [PubMed]
Roudi, Y.; Treves, A. An associative network with spatially organized connectivity. J. Stat. Mech. Theory Exp. 2004, 2004, P07010. [Google Scholar] [CrossRef]
Pucak, M.L.; Levitt, J.B.; Lund, J.S.; Lewis, D.A. Patterns of intrinsic and associational circuitry in monkey prefrontal cortex. J. Comp. Neurol. 1996, 376, 614–630. [Google Scholar] [CrossRef]
Braitenberg, V.; Schüz, A. Anatomy of the Cortex: Statistics and Geometry; Springer Science & Business Media: Berlin, Germany, 1991; Volume 18. [Google Scholar]
O’Kane, D.; Treves, A. Short-and long-range connections in autoassociative memory. J. Phys. A Math. Gen. 1992, 25, 5055. [Google Scholar] [CrossRef]
O’Kane, D.; Treves, A. Why the simplest notion of neocortex as an autoassociative memory would not work. Netw. Comput. Neural Syst. 1992, 3, 379–384. [Google Scholar] [CrossRef]
Mari, C.F.; Treves, A. Modeling neocortical areas with a modular neural network. Biosystems 1998, 48, 47–55. [Google Scholar] [CrossRef]
Johansson, C.; Rehn, M.; Lansner, A. Attractor neural networks with patchy connectivity. Neurocomputing 2006, 69, 627–633. [Google Scholar] [CrossRef]
Dubreuil, A.M.; Brunel, N. Storing structured sparse memories in a multi-modular cortical network model. J. Comput. Neurosci. 2016, 40, 157–175. [Google Scholar] [CrossRef] [PubMed]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 3859–3869. [Google Scholar]
Naim, M.; Boboeva, V.; Kang, C.J.; Treves, A. Reducing a cortical network to a Potts model yields storage capacity estimates. J. Stat. Mech. Theory Exp. 2018, 2018, 043304. [Google Scholar] [CrossRef]
Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef] [PubMed]
Tsodyks, M.V.; Feigel’Man, M.V. The enhanced storage capacity in neural networks with low activity level. EPL (Europhys. Lett.) 1988, 6, 101. [Google Scholar] [CrossRef]
Kropff, E.; Treves, A. The storage capacity of Potts models for semantic memory retrieval. J. Stat. Mech. Theory Exp. 2005, 2005, P08010. [Google Scholar] [CrossRef]
Mézard, M.; Parisi, G.; Virasoro, M.A. Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications; World Scientific Publishing Co Inc.: Singapore, 1987; Volume 9. [Google Scholar]
Amit, D.J. Modeling Brain Function: The World of Attractor Neural Networks; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar]
Treves, A. Frontal latching networks: A possible neural basis for infinite recursion. Cognit. Neuropsychol. 2005, 22, 276–291. [Google Scholar] [CrossRef] [PubMed]
Sartori, G.; Lombardi, L. Semantic relevance and semantic disorders. J. Cognit. Neurosci. 2004, 16, 439–452. [Google Scholar] [CrossRef] [PubMed]
Osgood, C.E. Semantic differential technique in the comparative study of cultures. Am. Anthropol. 1964, 66, 171–200. [Google Scholar] [CrossRef]
Löwe, M. On the storage capacity of Hopfield models with correlated patterns. Ann. Appl. Probab. 1998, 8, 1216–1250. [Google Scholar] [CrossRef]
Engel, A. Storage capacity for hierarchically correlated patterns. J. Phys. A Math. Gen. 1990, 23, L285. [Google Scholar] [CrossRef]
Shiino, M.; Fukai, T. Self-consistent signal-to-noise analysis of the statistical behavior of analog neural networks and enhancement of the storage capacity. Phys. Rev. E 1993, 48, 867. [Google Scholar] [CrossRef]
Kropff, E. Full solution for the storage of correlated memories in an autoassociative memory. Comput. Model. Behav. Neurosci. Closing Gap Neurophysiol. Behav. 2009, 2, 225. [Google Scholar]
Tamarit, F.A.; Curado, E.M. Pair-correlated patterns in Hopfield model of neural networks. J. Stat. Phys. 1991, 62, 473–480. [Google Scholar] [CrossRef]
Liberman, A.M.; Harris, K.S.; Hoffman, H.S.; Griffith, B.C. The discrimination of speech sounds within and across phoneme boundaries. J. Exp. Psychol. 1957, 54, 358–368. [Google Scholar] [CrossRef] [PubMed]
Rips, L.J.; Shoben, E.J.; Smith, E.E. Semantic distance and the verification of semantic relations. J. Verbal Learn. Verbal Behav. 1973, 12, 1–20. [Google Scholar] [CrossRef]
Rosch, E.H.; Mervis, C.B. Family resemblances: Studies in the internal structure of categories. Cognit. Psychol. 1975, 7, 573–605. [Google Scholar] [CrossRef]
Kuhl, P.K. Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Percept. Psychophys. 1991, 50, 93–107. [Google Scholar] [CrossRef] [PubMed]
Feldman, N.H.; Griffiths, T.L.; Morgan, J.L. The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. Psychol. Rev. 2009, 116, 752. [Google Scholar] [CrossRef] [PubMed]
Haxby, J.V.; Gobbini, M.I.; Furey, M.L.; Ishai, A.; Schouten, J.L.; Pietrini, P. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 2001, 293, 2425–2430. [Google Scholar] [CrossRef] [PubMed]
Norman, K.A.; Polyn, S.M.; Detre, G.J.; Haxby, J.V. Beyond mind-reading: Multi-voxel pattern analysis of fMRI data. Trends Cognit. Sci. 2006, 10, 424–430. [Google Scholar] [CrossRef] [PubMed]
Preston, A.R.; Eichenbaum, H. Interplay of hippocampus and prefrontal cortex in memory. Curr. Biol. 2013, 23, R764–R773. [Google Scholar] [CrossRef] [PubMed]
Ciaramelli, E.; Lauro-Grotto, R.; Treves, A. Dissociating episodic from semantic access mode by mutual information measures: Evidence from aging and Alzheimer’s disease. J. Physiol. Paris 2006, 100, 142–153. [Google Scholar] [CrossRef] [PubMed]
Tulving, E. Episodic memory: From mind to brain. Annu. Rev. Psychol. 2002, 53, 1–25. [Google Scholar] [CrossRef] [PubMed]
Garrard, P.; Perry, R.; Hodges, J.R. Disorders of semantic memory. J. Neurol. Neurosurg. Psychiatry 1997, 62, 431. [Google Scholar] [CrossRef] [PubMed]
Conrad, C. Cognitive Economy in Semantic Memory; American Psychological Association: Washington, DC, USA, 1972. [Google Scholar]
Spivey, M.; Joanisse, M.; McRae, K. The Cambridge Handbook of Psycholinguistics; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Kitamura, T.; Ogawa, S.K.; Roy, D.S.; Okuyama, T.; Morrissey, M.D.; Smith, L.M.; Redondo, R.L.; Tonegawa, S. Engrams and circuits crucial for systems consolidation of a memory. Science 2017, 356, 73–78. [Google Scholar] [CrossRef] [PubMed]
McClelland, J.L.; McNaughton, B.L.; O’Reilly, R.C. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 1995, 102, 419. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (a) Original correlation matrix. (b) Correlation matrix obtained by replacing each within-block entry with the mean correlation value of that noun cluster and each off-block entry with the mean correlation between clusters. The clusters are obtained through the application of a standard clustering algorithm to the original correlation matrix (a). (c) Strictly ultrametric correlation matrix obtained by again replacing each within-block entry with the mean value within that block, and now each off-block entry with the overall off-block mean.

Figure 2. (a) Two-dimensional logarithmic density plot of the ratio between the intermediate and the longest edge vs. the ratio between the shortest and the longest edge, in the triangles created by extracting quasi-distances for the triplets of nouns taken from Figure 1a. The triplets are scattered, with an ultrametric content of 0.5. (b) Same as (a), with triplets taken from Figure 1b. The triplets have less scatter and yield an ultrametric content of 0.61. (c) Same as (a), with triplets taken from Figure 1c. Here triplets constitute isosceles triangles with two long sides, as can be seen from the alignment of the ratios along the vertical line

d_{m e d} = d_{m a x}

. The ultrametric content (see Appendix C) is exactly 1. In all three panels, the dashed red line corresponds to the line of constant ultrametric content index.

Figure 2. (a) Two-dimensional logarithmic density plot of the ratio between the intermediate and the longest edge vs. the ratio between the shortest and the longest edge, in the triangles created by extracting quasi-distances for the triplets of nouns taken from Figure 1a. The triplets are scattered, with an ultrametric content of 0.5. (b) Same as (a), with triplets taken from Figure 1b. The triplets have less scatter and yield an ultrametric content of 0.61. (c) Same as (a), with triplets taken from Figure 1c. Here triplets constitute isosceles triangles with two long sides, as can be seen from the alignment of the ratios along the vertical line

d_{m e d} = d_{m a x}

. The ultrametric content (see Appendix C) is exactly 1. In all three panels, the dashed red line corresponds to the line of constant ultrametric content index.

Figure 3. (a) The Potts network, here intended as a model of semantic memory, is a coarse description of the cortex in terms of local patches of dense connectivity, which store activity patterns corresponding to local attractors (a). Each patch is a small local network characterized by high connectivity; diluted connections are instead present between units of different patches. The configuration of the individual patch is assumed to converge to a local attractor, synthetically captured by a Potts state. Each Potts unit, depicted in (b) can be in any of S states, where green, orange, blue and red represent the active states (

S = 4

). The white circle at the center corresponds to the quiescent state, aimed at capturing a situation of no retrieval of the underlying local network.

Figure 3. (a) The Potts network, here intended as a model of semantic memory, is a coarse description of the cortex in terms of local patches of dense connectivity, which store activity patterns corresponding to local attractors (a). Each patch is a small local network characterized by high connectivity; diluted connections are instead present between units of different patches. The configuration of the individual patch is assumed to converge to a local attractor, synthetically captured by a Potts state. Each Potts unit, depicted in (b) can be in any of S states, where green, orange, blue and red represent the active states (

S = 4

). The white circle at the center corresponds to the quiescent state, aimed at capturing a situation of no retrieval of the underlying local network.

Figure 4. (a) A tree, adapted from [35]. 1, 2, 3, 4, 5, and 6 are at the same level of the hierarchy. If we consider the nodes 1, 3 and 6, they are each at a distance of 2 of each other, the distance being defined as the distance to the nearest common branching point. If we consider nodes 3, 4 and 5, then they are each at a distance of 1 of each other, such that we get again an equilateral triangle. If we consider 1, 2 and 3, then

d_{12} = 1

while

d_{13} = d_{23} = 2

, such that we get an isosceles triangle with two long edges. One alternative, an isosceles triangle with two short edges, is impossible to realize: there are no intermediate points between 1 and 3 or 2 and 3, as indicated in red in (b).

Figure 4. (a) A tree, adapted from [35]. 1, 2, 3, 4, 5, and 6 are at the same level of the hierarchy. If we consider the nodes 1, 3 and 6, they are each at a distance of 2 of each other, the distance being defined as the distance to the nearest common branching point. If we consider nodes 3, 4 and 5, then they are each at a distance of 1 of each other, such that we get again an equilateral triangle. If we consider 1, 2 and 3, then

d_{12} = 1

while

d_{13} = d_{23} = 2

, such that we get an isosceles triangle with two long edges. One alternative, an isosceles triangle with two short edges, is impossible to realize: there are no intermediate points between 1 and 3 or 2 and 3, as indicated in red in (b).

Figure 5. (a) The workings of a hierarchical algorithm with 3 parents and 3 child patterns per parent. The different lines of squares/circles correspond to the different units in parents/children. Colors correspond to active Potts states while white denotes the quiescent states.

S = 3

. (b) The workings of the multi-parent algorithm with

Π = 3

parents and

p_{p a r} = 3

child patterns per parent and 5 total child patterns. Black arrows and their thickness denote strength of input. The main difference with the hierarchical algorithm is that each child pattern can receive input from multiple parents. If each parent is to represent a feature and each child a concept, the algorithm entails the generation of a concept from multiple features.

Figure 5. (a) The workings of a hierarchical algorithm with 3 parents and 3 child patterns per parent. The different lines of squares/circles correspond to the different units in parents/children. Colors correspond to active Potts states while white denotes the quiescent states.

S = 3

. (b) The workings of the multi-parent algorithm with

Π = 3

parents and

p_{p a r} = 3

child patterns per parent and 5 total child patterns. Black arrows and their thickness denote strength of input. The main difference with the hierarchical algorithm is that each child pattern can receive input from multiple parents. If each parent is to represent a feature and each child a concept, the algorithm entails the generation of a concept from multiple features.

Figure 6. (a) Solid lines correspond to the analytical distributions of the field, Equation (A8), in blue is the distribution of the fields produced by a simulation of the algorithm for

n_{p} = 15

. The parameters are

N = 2000

,

S = 1

,

a_{p} = 0.4

,

n_{p} = 15 \dots 50

and

Π = 100

. (b) The mean and standard deviation of the field as a function of the number of parents.

Figure 6. (a) Solid lines correspond to the analytical distributions of the field, Equation (A8), in blue is the distribution of the fields produced by a simulation of the algorithm for

n_{p} = 15

. The parameters are

N = 2000

,

S = 1

,

a_{p} = 0.4

,

n_{p} = 15 \dots 50

and

Π = 100

. (b) The mean and standard deviation of the field as a function of the number of parents.

Figure 7. (a) Distribution of the maximal fields for

S = 2

and

n_{p} = 30

. In blue is the distribution of the fields produced by the algorithm and the black line is Equation (A18). (b) The x-axis orders patterns with different number of parents and the y-axis the fields of the units in that pattern. Red points correspond to units that are set to quiescent and green to those that are activated. The boundary between the green and the red corresponds to

h_{m}

, the minimum field required for a unit to be set to active. Parameters are

N = 2000

,

S = 2

,

a_{p} = 0.4

and

Π = 100

.

Figure 7. (a) Distribution of the maximal fields for

S = 2

and

n_{p} = 30

. In blue is the distribution of the fields produced by the algorithm and the black line is Equation (A18). (b) The x-axis orders patterns with different number of parents and the y-axis the fields of the units in that pattern. Red points correspond to units that are set to quiescent and green to those that are activated. The boundary between the green and the red corresponds to

h_{m}

, the minimum field required for a unit to be set to active. Parameters are

N = 2000

,

S = 2

,

a_{p} = 0.4

and

Π = 100

.

Figure 8. Probability density function of correlations between units (in red) and between patterns (in green) for three different values of both

a_{p}

and f, the latter yielding in this case an average of 1.5, 4.5 and 7.5 parents per pattern. The black vertical line corresponds to the average correlation with uncorrelated patterns distributed independently according to Equation (1). The parameters are

S = 5

,

a = 0.3

, and

Π = 150

. The algorithm produces correlations between patterns with high variability relative to the correlation between units, in line with ideas about semantic memory. Note that the algorithm is sensitive to the parameters and their values strongly affect the correlation between patterns.

Figure 8. Probability density function of correlations between units (in red) and between patterns (in green) for three different values of both

a_{p}

and f, the latter yielding in this case an average of 1.5, 4.5 and 7.5 parents per pattern. The black vertical line corresponds to the average correlation with uncorrelated patterns distributed independently according to Equation (1). The parameters are

S = 5

,

a = 0.3

, and

Π = 150

. The algorithm produces correlations between patterns with high variability relative to the correlation between units, in line with ideas about semantic memory. Note that the algorithm is sensitive to the parameters and their values strongly affect the correlation between patterns.

Figure 9. (a) Boxplots of

C^{μ ν}

for different values of

a_{p}

, with

f = 0.05

fixed. (b) Boxplots of

C^{μ ν}

for different values of f with

a_{p} = 0.4

fixed. The parameters

a_{p}

and f play different roles in generating the correlations. Increasing the extent

a_{p}

of the input they receive from each parent increases the overall similarity of those children having shared parents, as evidenced by the increasing skewness of the distributions. In contrast, increasing the prolificity f, leads to an increase in the mean number of shared parents, such that all children are more correlated, as shown by the shift in the overall distribution. The black horizontal line corresponds to the average correlation with uncorrelated patterns distributed according to Equation (1). Other parameters are

a = 0.3

,

S = 5

and

Π = 150

.

Figure 9. (a) Boxplots of

C^{μ ν}

for different values of

a_{p}

, with

f = 0.05

fixed. (b) Boxplots of

C^{μ ν}

for different values of f with

a_{p} = 0.4

fixed. The parameters

a_{p}

and f play different roles in generating the correlations. Increasing the extent

a_{p}

of the input they receive from each parent increases the overall similarity of those children having shared parents, as evidenced by the increasing skewness of the distributions. In contrast, increasing the prolificity f, leads to an increase in the mean number of shared parents, such that all children are more correlated, as shown by the shift in the overall distribution. The black horizontal line corresponds to the average correlation with uncorrelated patterns distributed according to Equation (1). Other parameters are

a = 0.3

,

S = 5

and

Π = 150

.

Figure 10. (a) Another visualization of the correlation distribution of Figure 8, with

f = 0.05

,

a_{p} = 0.4

and

Π = 150

, decomposed into the distribution for each number of shared parents. (b) Fraction of pairs of patterns (left y-axis, note the logarithmic scale) and mean correlation between those pairs (right y-axis, linear scale) as a function of number of shared parents. The red horizontal line corresponds to the average correlation with uncorrelated patterns distributed according to Equation (1). Pairs of patterns having more shared parents are markedly fewer, although on average more correlated, so they do not affect much the overall mean correlation.

Figure 10. (a) Another visualization of the correlation distribution of Figure 8, with

f = 0.05

,

a_{p} = 0.4

and

Π = 150

, decomposed into the distribution for each number of shared parents. (b) Fraction of pairs of patterns (left y-axis, note the logarithmic scale) and mean correlation between those pairs (right y-axis, linear scale) as a function of number of shared parents. The red horizontal line corresponds to the average correlation with uncorrelated patterns distributed according to Equation (1). Pairs of patterns having more shared parents are markedly fewer, although on average more correlated, so they do not affect much the overall mean correlation.

Figure 11. The x-axis lists all of the features used to compute the correlation between the nouns in the toy example of Section 1.1, sorted according to their summed weights across all nouns (reported on a semi-logarithmic y-axis). The exponent of the fit,

ζ = 0.078

, indicates that the semantics of this particular set of nouns is effectively dominated by a set of order

1 / ζ ≃ 10

features.

Figure 11. The x-axis lists all of the features used to compute the correlation between the nouns in the toy example of Section 1.1, sorted according to their summed weights across all nouns (reported on a semi-logarithmic y-axis). The exponent of the fit,

ζ = 0.078

, indicates that the semantics of this particular set of nouns is effectively dominated by a set of order

1 / ζ ≃ 10

features.

Figure 12. One sample representation of parent-child relations. The squares on the top row represent parents, while the circles at the bottom row represent children. Black lines represent input from the parents to the children. The strength with which each parent affects its children is proportional to

exp (- ζ π)

, where

π

indexes the parents, as explained in the text. For illustration, there are

Π = 10

parents,

p_{p a r} = 5

children per parent and

p = 50

total children.

Figure 12. One sample representation of parent-child relations. The squares on the top row represent parents, while the circles at the bottom row represent children. Black lines represent input from the parents to the children. The strength with which each parent affects its children is proportional to

exp (- ζ π)

, where

π

indexes the parents, as explained in the text. For illustration, there are

Π = 10

parents,

p_{p a r} = 5

children per parent and

p = 50

total children.

Figure 13. Probability density function of correlations between units (in red) and between patterns (in green) for three different values of the dominance rate

ζ

and prolificity f, keeping

a_{p} = 0.4

constant. For the low value of

ζ = 0.001

, this figure reproduces the middle panel of Figure 8. For higher values of

ζ

, where the parents become highly heterogeneous, we see the emergence of large correlations.

Figure 13. Probability density function of correlations between units (in red) and between patterns (in green) for three different values of the dominance rate

ζ

and prolificity f, keeping

a_{p} = 0.4

constant. For the low value of

ζ = 0.001

, this figure reproduces the middle panel of Figure 8. For higher values of

ζ

, where the parents become highly heterogeneous, we see the emergence of large correlations.

Figure 14.

C^{μ ν}

for different values of

ζ

, with

a_{p} = 0.4

and

f = 0.05

fixed. Other parameters are

a = 0.3

,

S = 5

and

Π = 150

.

Figure 14.

C^{μ ν}

for different values of

ζ

, with

a_{p} = 0.4

and

f = 0.05

fixed. Other parameters are

a = 0.3

,

S = 5

and

Π = 150

.

Figure 15. (a) Storage capacity as a function of the sparsity a for different values of the dilution parameter

c_{m} / N

, for uncorrelated patterns (

C_{a s} = \tilde{a}

), obtained through solutions of the mean-field equations. (b) Storage capacity for uncorrelated patterns (in black) and for correlated patterns (in green). Dots correspond to simulations while the dash-dotted lines to solutions of the mean-field equations. For uncorrelated patterns this is the same curve as in panel (a) with

c_{m} / N = 0.1

. It is apparent that the mean-field treatment yields better results for uncorrelated patterns; for correlations, it over-estimates the storage capacity. Parameters are

N = 2000

,

c_{m} = 200

,

S = 5

,

U = 0.5

and

β = 200

.

Figure 15. (a) Storage capacity as a function of the sparsity a for different values of the dilution parameter

c_{m} / N

, for uncorrelated patterns (

C_{a s} = \tilde{a}

), obtained through solutions of the mean-field equations. (b) Storage capacity for uncorrelated patterns (in black) and for correlated patterns (in green). Dots correspond to simulations while the dash-dotted lines to solutions of the mean-field equations. For uncorrelated patterns this is the same curve as in panel (a) with

c_{m} / N = 0.1

. It is apparent that the mean-field treatment yields better results for uncorrelated patterns; for correlations, it over-estimates the storage capacity. Parameters are

N = 2000

,

c_{m} = 200

,

S = 5

,

U = 0.5

and

β = 200

.

Figure 16. (a) Storage capacity

α_{c}

as a function of the sparsity a for different values of the correlation parameters

a_{p}

and f. The storage capacity is defined as the critical storage at which half of all cued patterns are retrieved with overlap of

0.7

and above. Increasing

a_{p}

and f are both generally detrimental to the capacity. (b)

α_{c}

as a function of the number of Potts states S, which shows that the superlinear increase derived analytically in [34] for randomly correlated patterns (the black curve) is only really approached, within this limited S range, with patterns that are very close to randomly correlated (the orange curve) (c)

α_{c}

as a function of the connectivity

c_{m}

for random dilution (see Equation (7)). The capacity decreases as a function of increasing connectivity. When not explicitly varied, parameters are

N = 2000

,

c_{m} = 200

,

a = 0.1

,

S = 5

,

U = 0.5

,

β = 200

,

ζ = 10^{- 6}

and

Π = 150

.

Figure 16. (a) Storage capacity

α_{c}

as a function of the sparsity a for different values of the correlation parameters

a_{p}

and f. The storage capacity is defined as the critical storage at which half of all cued patterns are retrieved with overlap of

0.7

and above. Increasing

a_{p}

and f are both generally detrimental to the capacity. (b)

α_{c}

as a function of the number of Potts states S, which shows that the superlinear increase derived analytically in [34] for randomly correlated patterns (the black curve) is only really approached, within this limited S range, with patterns that are very close to randomly correlated (the orange curve) (c)

α_{c}

as a function of the connectivity

c_{m}

for random dilution (see Equation (7)). The capacity decreases as a function of increasing connectivity. When not explicitly varied, parameters are

N = 2000

,

c_{m} = 200

,

a = 0.1

,

S = 5

,

U = 0.5

,

β = 200

,

ζ = 10^{- 6}

and

Π = 150

.

Figure 17. Storage capacity curves as a function of different correlation parameters. (a)

α_{c}

as a function of f. The full lines correspond to simulations while the dashed line corresponds to solutions of the mean-field equations. It can be seen that similar to Figure 16a, the over-estimation of the capacity through the SCSNA approach holds also for other values of f. (b)

α_{c}

as a function of

a_{p}

. When not explicitly varied, the correlation parameters are

a_{p} = 0.4

,

f = 0.05

,

ζ = 10^{- 6}

and

Π = 150

. Network parameters are

N = 2000

,

c_{m} = 200

,

a = 0.1

,

S = 5

.

Figure 17. Storage capacity curves as a function of different correlation parameters. (a)

α_{c}

as a function of f. The full lines correspond to simulations while the dashed line corresponds to solutions of the mean-field equations. It can be seen that similar to Figure 16a, the over-estimation of the capacity through the SCSNA approach holds also for other values of f. (b)

α_{c}

as a function of

a_{p}

. When not explicitly varied, the correlation parameters are

a_{p} = 0.4

,

f = 0.05

,

ζ = 10^{- 6}

and

Π = 150

. Network parameters are

N = 2000

,

c_{m} = 200

,

a = 0.1

,

S = 5

.

Figure 18. Mean overlap (left y-axis) and mutual information per connection (dashed black curve, and righty-axis) as a function of the storage load

α

, for different values of the dominance

ζ

. For low values of

ζ

, the information falls abruptly at a value of the storage load

α

, and similarly for very high values of

ζ

, while for intermediate values we observe a more gradual decay, starting at lower values of the storage load. For intermediate values of

ζ

however, the information does not go to zero, but rather saturates at a certain value. We call this residual information. In Figure 19, we plot this residual information as a function of

ζ

. Network parameters are

N = 2000

,

c_{m} = 200

,

S = 5

,

a = 0.1

,

U = 0.5

,

β = 200

. Correlation parameters are

a_{p} = 0.4

,

f = 0.05

and

Π = 150

.

Figure 18. Mean overlap (left y-axis) and mutual information per connection (dashed black curve, and righty-axis) as a function of the storage load

α

, for different values of the dominance

ζ

. For low values of

ζ

, the information falls abruptly at a value of the storage load

α

, and similarly for very high values of

ζ

, while for intermediate values we observe a more gradual decay, starting at lower values of the storage load. For intermediate values of

ζ

however, the information does not go to zero, but rather saturates at a certain value. We call this residual information. In Figure 19, we plot this residual information as a function of

ζ

. Network parameters are

N = 2000

,

c_{m} = 200

,

S = 5

,

a = 0.1

,

U = 0.5

,

β = 200

. Correlation parameters are

a_{p} = 0.4

,

f = 0.05

and

Π = 150

.

Figure 19. (a) Storage capacity (left y-axis) and residual mutual information between cued memory and configuration retrieved, after capacity collapse, as a function of

ζ

(righty-axis). The storage capacity displays a trough for intermediate values of the dominance

ζ

that is due to an increased clustering of the patterns and the inability of the network to retrieve each one of them with relatively high precision. The apparent increase of the capacity, for very high values of the parameter

ζ

, is instead due to such clusters vanishing altogether, one by one, as the inputs from weaker parents are dominated by small input to a random state (see Equation (31)). The residual mutual information corroborates the results from the storage capacity: it is only for intermediate values of

ζ

that the network can retrieve some information after capacity collapse. Network parameters are

N = 2000

,

c_{m} = 200

,

S = 5

,

a = 0.1

,

U = 0.5

,

β = 200

. Correlation parameters are

a_{p} = 0.4

,

f = 0.05

and

Π = 150

. (b) Phase diagram of the residual information, as a function of the dominance

ζ

in the x-axis and prolificity f in the y-axis, giving a fuller picture of the phase transition in (a). Note that the transition to non-zero residual mutual information occurs at higher values of

ζ

with increasing f. The black horizontal line, plotted for clarity, corresponds to the value of f used in (a). The three white dots correspond to three points for which a cluster analysis is reported in Figure 20.

Figure 19. (a) Storage capacity (left y-axis) and residual mutual information between cued memory and configuration retrieved, after capacity collapse, as a function of

ζ

(righty-axis). The storage capacity displays a trough for intermediate values of the dominance

ζ

that is due to an increased clustering of the patterns and the inability of the network to retrieve each one of them with relatively high precision. The apparent increase of the capacity, for very high values of the parameter

ζ

, is instead due to such clusters vanishing altogether, one by one, as the inputs from weaker parents are dominated by small input to a random state (see Equation (31)). The residual mutual information corroborates the results from the storage capacity: it is only for intermediate values of

ζ

that the network can retrieve some information after capacity collapse. Network parameters are

N = 2000

,

c_{m} = 200

,

S = 5

,

a = 0.1

,

U = 0.5

,

β = 200

. Correlation parameters are

a_{p} = 0.4

,

f = 0.05

and

Π = 150

. (b) Phase diagram of the residual information, as a function of the dominance

ζ

in the x-axis and prolificity f in the y-axis, giving a fuller picture of the phase transition in (a). Note that the transition to non-zero residual mutual information occurs at higher values of

ζ

with increasing f. The black horizontal line, plotted for clarity, corresponds to the value of f used in (a). The three white dots correspond to three points for which a cluster analysis is reported in Figure 20.

Figure 20. Cluster analysis applied to patterns generated by the algorithm, for three different values of the dominance parameter

ζ

, chosen at salient points of the phase diagram in Figure 19. Increasing

ζ

, the patterns generated by the algorithm become more and more clustered, as the strongest parent of each pattern comes to dominate its activity.

Figure 20. Cluster analysis applied to patterns generated by the algorithm, for three different values of the dominance parameter

ζ

, chosen at salient points of the phase diagram in Figure 19. Increasing

ζ

, the patterns generated by the algorithm become more and more clustered, as the strongest parent of each pattern comes to dominate its activity.

Table 1. Ultrametric content computed for distances of triplets of patterns generated by the algorithm, for six different parameter values of the prolificity and the dominance. An increased ultrametric content reflects an increased clustering in the correlations between patterns. For

f = 0.2

and

ζ = 0.1

, the patterns yield values of the ultrametric content index close to that obtained from the nouns (∼0.5). The corresponding clustering structure of the patterns can be seen in Figure 20d.

Table 1. Ultrametric content computed for distances of triplets of patterns generated by the algorithm, for six different parameter values of the prolificity and the dominance. An increased ultrametric content reflects an increased clustering in the correlations between patterns. For

f = 0.2

and

ζ = 0.1

, the patterns yield values of the ultrametric content index close to that obtained from the nouns (∼0.5). The corresponding clustering structure of the patterns can be seen in Figure 20d.

		$ζ$
		0.001	0.02	0.1	1.0	10
$f$	0.05	0.395	0.429	0.435	0.416	0.397
$f$	0.2	0.389	0.404	0.507	0.507	0.507

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Boboeva, V.; Brasselet, R.; Treves, A. The Capacity for Correlated Semantic Memories in the Cortex. Entropy 2018, 20, 824. https://doi.org/10.3390/e20110824

AMA Style

Boboeva V, Brasselet R, Treves A. The Capacity for Correlated Semantic Memories in the Cortex. Entropy. 2018; 20(11):824. https://doi.org/10.3390/e20110824

Chicago/Turabian Style

Boboeva, Vezha, Romain Brasselet, and Alessandro Treves. 2018. "The Capacity for Correlated Semantic Memories in the Cortex" Entropy 20, no. 11: 824. https://doi.org/10.3390/e20110824

APA Style

Boboeva, V., Brasselet, R., & Treves, A. (2018). The Capacity for Correlated Semantic Memories in the Cortex. Entropy, 20(11), 824. https://doi.org/10.3390/e20110824

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Capacity for Correlated Semantic Memories in the Cortex

Abstract

1. Introduction

1.1. Correlations

1.2. Connectivity

2. Results

2.1. The Potts Network

2.2. Generating Correlated Representations

2.2.1. Single Parents and Ultrametrically Correlated Children

2.2.2. Multiple Parents and Non-Trivially Organized Children

2.2.3. The Algorithm Operating on Simple Binary Units

2.2.4. The Algorithm Operating on Genuine Potts Units

2.2.5. Resulting Patterns and Their Correlations

2.2.6. The Ultrametric Limit

2.2.7. The Random Limit

2.2.8. Semantic Dominance

2.3. Storage Capacity of the Potts Network with Correlated Patterns

2.3.1. Self-Consistent Signal to Noise Analysis

2.3.2. Numerical Solutions of Mean-Field Equations and Simulations

2.3.3. The Effect of Correlation Parameters f and a p

2.3.4. Correlated Retrieval

2.3.5. Residual Information: Memory Beyond Capacity

2.3.6. Residual Memory Interpreted through Cluster Analysis

2.3.7. Residual Memory Rides on Fine Differences in Ultrametric Content

3. Discussion: A New Model for the Extraction of Semantic Structure

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Calculation of the Probability Distribution of the Field for S = 1

Appendix B. Calculation of the Probability Distribution of the Field for S = 2

Appendix C. Ultrametric Content

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.3.3. The Effect of Correlation Parameters f and $a_{p}$