Assessing the Relevance of Specific Response Features in the Neural Code

Hugo Gabriel Eyherabide; Inés Samengo

doi:10.3390/e20110879

and

¹

Department of Computer Science and Helsinki Institute for Information Technology, University of Helsinki Gustaf Hällströmin katu 2b, FI00560 Helsinki, Finland

²

Department of Medical Physics, Centro Atómico Bariloche and Instituto Balseiro, 8400 San Carlos de Bariloche, Argentina

^*

Author to whom correspondence should be addressed.

Entropy2018, 20(11), 879;https://doi.org/10.3390/e20110879

This article belongs to the Special Issue Information Theory in Neuroscience

Version Notes

Order Reprints

Abstract

The study of the neural code aims at deciphering how the nervous system maps external stimuli into neural activity—the encoding phase—and subsequently transforms such activity into adequate responses to the original stimuli—the decoding phase. Several information-theoretical methods have been proposed to assess the relevance of individual response features, as for example, the spike count of a given neuron, or the amount of correlation in the activity of two cells. These methods work under the premise that the relevance of a feature is reflected in the information loss that is induced by eliminating the feature from the response. The alternative methods differ in the procedure by which the tested feature is removed, and the algorithm with which the lost information is calculated. Here we compare these methods, and show that more often than not, each method assigns a different relevance to the tested feature. We demonstrate that the differences are both quantitative and qualitative, and connect them with the method employed to remove the tested feature, as well as the procedure to calculate the lost information. By studying a collection of carefully designed examples, and working on analytic derivations, we identify the conditions under which the relevance of features diagnosed by different methods can be ranked, or sometimes even equated. The condition for equality involves both the amount and the type of information contributed by the tested feature. We conclude that the quest for relevant response features is more delicate than previously thought, and may yield to multiple answers depending on methodological subtleties.

Keywords:

neural code; representation; decoding; spike-time precision; discrimination; noise correlations; information theory; mismatched decoding

1. Introduction

Understanding the neural code involves, among other things, identifying the relevant response features that participate in the representation of information. Different studies have proposed several candidates, for example, the spiking rate [1,2], the response latency [3], the temporal organisation of spikes [4], the amount of synchrony in a given brain area [5], the amount of correlation between the activity of different neurons [6], or the phase of the local field potential at the time of spiking [7], to cite a few. One way of evaluating the relevance of each candidate feature is to assess how much information is lost by ignoring that feature. This strategy involves the comparison of the mutual information between the stimulus and the so-called full response (a collection of response features including the tested one) and the same information calculated with a reduced response, obtained by dropping the tested feature from the full response. If the tested feature is relevant, the information encoded by the reduced response should be smaller than that of the full response.

The procedure is fairly straightforward when the response features are defined in terms of variables that take definite values in each stimulus presentation, as for example, the spike count C fired in a fixed time window, or the latency L between the stimulus and the first spike. The full response in this case is a two-component vector

[C, L]

, the value of which is uniquely defined for each stimulus presentation—let us assume that in this example, C is never equal to 0, so L is always well defined. The reduced response is a one-component vector, either C or L, depending whether we are evaluating the relevance of the latency or the spike count, respectively. If the latency or the spike count are relevant, then the information encoded by C or L, respectively, should be smaller than that of the pair

[C, L]

. Throughout this paper, we often use C and L as examples of response features that take a precise value in each trial, to contrast with other features that are only defined in the whole collection of trials, as discussed below.

The method becomes more controversial when applied to response properties that can only be defined in multiple stimulus presentations, as for example, the amount of correlation in the activity of two or more neurons, or the temporal precision of the elicited spikes. These properties cannot be calculated from single responses, so more sophisticated methods are required to delete the tested feature. There are several alternative procedures to perform such deletion, and several are also the ways in which the lost information can be calculated. Interestingly, the lost information depends markedly on the chosen method, implying that the so-called relevance of a given feature is a subtle concept, that needs to be specified precisely. When assessing the relevance of noise correlations, two different sets of strategies have been proposed by the seminal works of Nirenberg et al. [8] and Schneidman et al. [9]. The first proposal evaluated the role of noise correlations in decoding the information represented in neural activity, whereas the second, in the amount of encoded information. Quite surprisingly, the contribution of correlations to the decoded information was shown to sometimes exceed the amount of encoded information [9], seemingly contradicting the intuitive idea that the encoded information constitutes an upper bound to the decoded information. The apparent inconsistency between the two measures has not been observed in later extensions of the technique, where the relevance of other response aspects was evaluated, such as spike-time precision, spike-counts or spike-onsets. Moreover, it has even been argued that the inconsistency was exclusively observed when assessing the role of noise correlations [10,11,12,13].

In this paper, for the first time, the different methods used in the literature to delete a given response feature are distinguished, and the implications of each method are discussed and compared. We show that the data processing inequality, stating that the decoded information cannot surpass the encoded information, can only be invoked with some - and not all - deletion procedures. The distinction between such procedures allows us to identify the conditions in which the decoded information can exceed the encoded information, and to demonstrate that there was no logical inconsistency in previous studies. We also show explicit examples where the decoded information surpasses the encoded information also when assessing the role of other response aspects different from noise correlations. In order to explain why such behaviours have not been identified until now, we scrutinise the arguments given in the literature to claim that only noise correlations could exhibit such syndrome. We conclude that although the measures employed to assess the relevance of individual response features initially distinguished clearly between the relevance for encoding and the relevance for decoding, this distinction was eventually lost in later modifications of the measures. By diagnosing the confusion, we prove that indeed, the response features for which the decoded information can surpass the encoded information are not restricted to noise correlations.

More generally, we discuss a wide collection of strategies employed to assess the relevance of individual response features, ranging from those encoded-oriented to those decoded-oriented. This distinction is related to the way the tested feature contributes to the performance of decoders, which can be mismatched or not. The relevance of the tested feature obtained with some of the measures is always bounded by the relevance of another measure. Yet, not all measures can be ordered hierarchically. There are examples where the relevance of a feature obtained with one method may surpass or be surpassed by the relevance of another, depending on the specific values taken by the prior stimulus probability and the conditional response probabilities. We analyse a collection of carefully chosen examples to identify the cases where this is so. In certain restricted conditions, however, the hierarchy, or even the equality, can be ensured. Here we establish these conditions by means of analytic reasoning, and discuss their implications in terms of the amount and type of information encoded by the tested feature.

We also present examples in which the measures to assess the relevance of a given feature can be used to extract qualitative knowledge about the type of information encoded by the feature. In other words, we assess not only how much information is encoded by an individual feature, but also what kind of information is provided, with respect to individual stimulus attributes. Again, we prove that the type of encoded information depends on the method employed to assess it.

Finally, given that one important property of measures of relevance hinges on whether they represent the operation of matched or mismatched decoders, we also explore the consequences of operating mismatched decoders on noisy responses, instead of real responses. We conclude that it may be possible to improve the performance of a mismatched decoder by adding noise. From the theoretical point of view, this observation underscores the fact that the conditions for optimality for matched decoders need not hold for mismatched decoders. From the practical perspective, our results open new opportunities for potentially simpler, more efficient and more resilient decoding algorithms.

In Section 2.1, we establish the notation, and we introduce some of the key concepts that will be used throughout the paper. These concepts are employed in Section 2.2 to determine the cases where the data-processing inequality can be ensured. In Section 2.3 we introduce 9 measures of feature relevance that were previously defined in the literature, and briefly discuss their meaning, similarities and discrepancies. A numeric exploration of a set of carefully chosen examples is employed in Section 2.4 to detect the pairs of measures for which no general hierarchical order exists. In Section 2.5 we discuss the consequences of employing measures that are conceptually linked to matched or mismatched decoders. Later, in Section 2.6, we explore the way in which different measures of feature relevance arrogate different qualitative meaning to the type of information encoded by the tested feature. In Section 2.7 we discuss the conditions under which encoding-oriented measures provide the same amount of information as their decoding-oriented counterparts, and also the conditions under which the equality extends also to the content of that information. Then, in Section 2.8, we observe that sometimes, mismatched decoders may improve their performance when operating upon noisy responses. We discuss some relations of our work with other approaches and to the limiting sampling problem in Section 3, and we close with a summary of the main results of the paper in Section 4.

2. Results

2.1. Definitions

2.1.1. Statistical Notation

When no risk of ambiguity arises, we here employ the standard abbreviated notation of statistical inference [14], denoting random variables with letters in upper case, and their values, in lower case. For example, the symbol

P (x | y)

always denotes the conditional probability of the random variable X taking the value x given that the random variable Y takes the value y. This notation may lead to confusion or be inappropriate, for example, when the random variable X takes the value u given that the random variable Y takes the value v. In those cases, we explicitly indicate the random variables and their values, as for example

P (X = u | Y = v)

.

In the study of the neural code, the relevant random variables are the stimulus S and the response

R

generated by the nervous system. In this paper, we discuss the statistics of the true responses observed experimentally, and compare them with a theoretical model that describes how responses would be, if the encoding strategy were different. To differentiate these two situations, we employ the variable

R_{ex}

for the experimental responses (the real ones), and

R_{su}

for the surrogate responses (the fictitious ones). The associated conditional probability distributions are

P_{ex} (R_{ex} = r | S = s)

and

P_{su} (R_{su} = r | S = s)

, which are often abbreviated as

P_{ex} (r | s)

and

P_{su} (r | s)

, respectively. Once these distributions are known, and given the prior stimulus probabilities

P (s)

, the joint probabilities

P_{ex} (r, s)

and

P_{su} (r, s)

can be deduced, as well as the marginals

P_{ex} (r)

and

P_{su} (r)

. When interpreting the abbreviated notation, readers should keep in mind that

P_{ex}

governs the variable

R_{ex}

, and

P_{su}

,

R_{su}

. If a statement is made about a distribution P or a response variable

R

that has no sub-index, the argument is intended for both the real and surrogate distributions or variables.

2.1.2. Encoding

The process of converting stimuli S into neural responses

R

(e.g., spike-trains, local-field potentials, electroencephalographic or other brain signals, etc.) is called “encoding” [9,15]. The encoding process is typically noisy, in the sense that repeated presentations of the same stimulus may yield different neural responses, and is characterised by the joint probability distribution

P (s, r)

. The associated marginal probabilities are

\begin{matrix} P (s) & = & \sum_{r} P (s, r), \\ P (r) & = & \sum_{s} P (s, r), \end{matrix}

from which the conditional response probability

P (r | s) = P (s, r) / P (s)

, and the posterior stimulus probability

P (s | r) = P (s, r) / P (r)

can be defined.

The mutual information that

R

contains about S is

I (S; R) = \sum_{s, r} P (s, r) {log}_{2} \frac{P (s | r)}{P (s)} .

(1)

More generally, the mutual information

I (S; X)

about S contained in any random variable X, including but not limited to

R

, can be computed using the above formula with

R

replaced by X. For compactness, we denote

I (S; X)

as

I_{X}

unless ambiguity arises.

2.1.3. Data Processing Inequalities

When the response

R_{2}

is a post-processed version of the response

R_{1}

, the joint probability distribution

P (s, r_{1}, r_{2})

can be written as

P (s, r_{1}) P (r_{2} | r_{1})

. This decomposition implies that

R_{2}

is conditionally independent of S. In these circumstances, the information about S contained in

R_{2}

cannot exceed the information about S contained in

R_{1}

[16]. In addition, the accuracy of the optimal decoder operating on

R_{2}

cannot exceed the accuracy of the optimal decoder operating on

R_{1}

[17]. These results constitute the data processing inequalities.

2.1.4. Decoding

The process of transforming responses

r

into estimated stimuli

\hat{s}

is called “decoding” [9,15]. More precisely, a decoder is a mapping

r \to \hat{s}

defined by a function

\hat{s} = D (r)

. The inverse of this function is

D^{- 1}

, and when D is not injective,

D^{- 1}

is a multi-valued mapping. The joint probability

P (s, \hat{s})

of the presented and estimated stimuli, also called “confusion matrix” [12], is

P (s, \hat{s}) = \sum_{r \in D^{- 1} (\hat{s})} P (s, r),

(2)

where the sum runs over all responses

r

that are mapped onto

\hat{s}

by D. The information that

\hat{S}

preserves about S is

I_{\hat{S}}

, and can be calculated from the confusion matrix of Equation (2). The decoding accuracy above chance level is here defined as

A = \sum_{s} P (S = s, \hat{S} = s) - max_{s} P (s) .

(3)

2.1.5. Optimal Decoding

Although all mappings D are formally admissible as decoders, not all are useful. The aim of a decoder is to make a good guess of the external stimulus S from the neural response

R

. It is therefore important to be able to construct decoders that make good guesses, or at least, as good as the mapping from stimuli to responses allows. Optimal decoders (also called Bayesian or maximum-a-posteriori decoders, as well as ideal homunculus, or observer, among other names) are defined as [18,19]

\hat{s} = D_{opt} (r) = arg max_{s} P (s | r) = arg max_{s} P (s, r) .

(4)

This mapping selects, for each response

r

, the stimulus

\hat{s}

that most likely generated

r

. It is optimal in the sense that any other decoding algorithm yields a confusion matrix with lower decoding accuracy. Equation (4) depends on

P (s, r)

, so the decoder cannot be defined before knowing the functional shape of the joint probability distribution between stimuli and responses. The process of estimating

P (s, r)

from real data, and the subsequent insertion of the obtained distribution in Equation (4) is called the training of the decoder. The word “training” makes reference to a gradual process, originally stemming from a computational strategy employed to estimate the distribution progressively, while the data was being gathered. However, in this paper we do not discuss estimation strategies from limited samples, so for us, “training a decoder” is equivalent to constructing a decoder from Equation (4).

2.1.6. Extensions of Optimal Decoding

The study of Ince et al. [20] introduced the concept of ranked decoding, in which each response

r

is mapped onto a list of K stimuli

\hat{s} = ({\hat{s}}_{1}, \dots, {\hat{s}}_{K})

ordered according to their posterior probabilities so that

P ({\hat{s}}_{k} | r) \geq P ({\hat{s}}_{k + 1} | r)

(with

1 \leq k < K

, and

K \leq

the total number of stimuli in the experiment). Ranked decoding can provide useful models for intermediate stages in the decision pathway, and the information loss induced by ranked decoding was computed recently [17]. The joint probability associated with ranked decoding is

P (s, \hat{s}) = \sum_{r \in D^{- 1} (\hat{s})} P (s, r),

(5)

where the sum runs over all response vectors

r

that produce the same ranking

\hat{s}

. Although

P (s, \hat{s})

can be used to compute the information

I_{\hat{S}}

between S and

\hat{S}

, it cannot be used to compute the decoding accuracy above chance level because the support of

\hat{S}

(i.e., the set of stimulus lists) is not contained in the support of S (i.e., the set of stimuli).

2.1.7. Approximations to Optimal Decoding

For given probabilities

P (r | s)

and

P (s)

, Equation (4) defines a mapping between each response

r

and a candidate stimulus

\hat{s}

. In the study of the neural code, scientists often wonder what would happen if responses were not governed by the experimentally recorded distribution

P_{ex} (r | s)

, but by some other surrogate distribution

P_{su} (r | s)

. If we replace

P_{ex} (r | s)

by

P_{su} (r | s)

in Equation (4), we define a new decoding algorithm

\hat{s} = D_{su} (r) = arg max_{s} P_{su} (s | r) = arg max_{s} P_{su} (s, r) .

(6)

which, as discussed below, may or may not be optimal, depending on how the decoder is used.

2.1.8. Two Different Decoding Strategies

One alternative, here referred to as “decoding method

α

” is that, for each response

r

obtained experimentally, one decodifies a stimulus

\hat{s}

using the new mapping of Equation (6). In this case, the chain

s \to r \to \hat{s}

gives rise to the confusion matrix

P^{α} (s, \hat{s}) = \sum_{r \in D_{su}^{- 1} (\hat{s})} P_{ex} (s, r),

(7)

where the sum runs over all response vectors

r

that are mapped onto

\hat{s}

by the new decoding algorithm

D_{su}

, and the probability

P_{ex} (r, s)

appearing in the right-hand side is the real one, since responses

r

are generated experimentally. It is easy to see that in this case, the decoding accuracy of the new algorithm is suboptimal, since responses

r

are generated with the original distribution

P_{ex} (r | s)

, and for that distribution, the optimal decoder is given by Equation (4) with

P = P_{ex}

. In the literature, training a decoder with a probability

P_{su} (r | s)

and then operating it on variables that are generated with

P_{ex} (r | s)

is called mismatched decoding. In what follows, information values calculated from the distribution of Equation (7) are noted as

I_{\hat{S}}^{α}

.

A second alternative, “decoding method

β

,” is that, for each stimulus s, a surrogate response

R_{su}

is drawn using the new distribution

P_{su} (r | s)

. If the sampled value is

R_{su} = r

, the stimulus

\hat{s} = D_{su} (r)

is decoded. In this case, the confusion matrix is

P^{β} (s, \hat{s}) = \sum_{r \in D_{su}^{- 1} (\hat{s})} P_{su} (s, r),

(8)

where as before, the sum runs over all response vectors

r

that are mapped onto

\hat{s}

by the decoding algorithm

D_{su} (r)

, but now the probability

P_{su} (r, s)

appearing in the right-hand side is the surrogate one, since responses

R_{su}

are not generated experimentally. In this case, there is no mismatch between the construction and operation of the decoder, and

D_{su}

is optimal, in the sense that no other algorithm decodes

R_{su}

with higher decoding accuracy. One should bear in mind, however, that the surrogate responses are not the responses observed experimentally, that they may well take values in a response set that does not coincide with the set of real responses, and that

R_{su}

is not necessarily obtained by transforming the real response

R_{ex}

with a stimulus-independent mapping (see below). In what follows, information values calculated from the distribution of Equation (8) are noted as

I_{\hat{S}}^{β}

. Methods

α

and

β

can be easily extended to encompass also ranked decoding, mutatis mutandis.

The two alternative decoding methods yield two different decoding accuracies. To distinguish them, we use the notation

A_{R_{1}}^{R_{2}}

. The superscript indicates the variable whose probability distribution is used to construct the decoder in Equation (4), and consequently, determines the set of

r \in D_{su}^{- 1} (\hat{s})

that contribute to the sums of Equations (7) and (8). The subscript indicates the variable upon which the decoder is applied, and its probability distribution is summed in the right-hand side of Equations (7) and (8). That is,

A_{R_{1}}^{R_{2}}

is computed through Equation (3) with

P_{R_{1}}^{R_{2}} (s, \hat{s}) = \sum_{r \in D_{R_{2}}^{- 1} (\hat{s})} P (S = s, R_{1} = r),

(9)

so that

P^{α} (s, \hat{s}) = P_{R_{ex}}^{R_{su}} (s, \hat{s})

and

P^{β} (s, \hat{s}) = P_{R_{su}}^{R_{su}} (s, \hat{s})

.

2.2. The Applicability of the Data-Processing Inequality

Assessing the relevance of a response feature typically involves a subtraction

Δ I = I - I^{'}

, where I and

I^{'}

represent the mutual information between stimuli and a set of response features containing or not containing the tested feature, respectively. The magnitude of

Δ I

is often interpreted as the information provided by the tested feature. This interpretation requires

Δ I

to be positive, since intuitively, one would imagine that removing a response feature cannot increase the encoded information. As shown below, a formal proof of this intuition may or may not be possible invoking the data processing inequality (see Section 2.1.3 and reference [16]), depending on the method used to eliminate the tested feature. As a consequence, there are cases in which

Δ I

is indeed negative (see below). In these cases, the tested feature is detrimental to information encoding [9].

2.2.1. Reduced Representations

There are several procedures by which the tested feature can be removed from the response. The validity of the data-processing inequalities (see definition in Section 2.1.3) depends on the chosen procedure. In order to specify the conditions in which the inequalities hold, we here introduce the concept of reduced representations. When the response feature under evaluation is removed from

R_{ex}

by a deterministic mapping

R_{su} = f (R_{ex})

, we call the obtained variable

R_{su}

a reduced representation of

R_{ex}

. A required condition for a mapping to be a reduced representation is that the function f be stimulus-independent, that is, that the value of

R_{su}

be conditionally independent from s. Mathematically, this means that

P (r_{su}, s | r_{ex}) = P (r_{su} | r_{ex}) P (s | r_{ex})

. If the mapping f and the conditional response distribution

P_{ex} (r | s)

are known, the distribution

P_{su} (r | s)

can be derived using standard methods. The data processing inequality ensures that for all reduced representations,

I_{R_{ex}} \geq I_{R_{su}}

.

Reduced representations are usually employed when the response feature whose relevance is to be assessed takes a definite value in each trial, as happens for example, with the number of spikes in a fixed time window, the latency of the firing response, or the activity of a specific neuron in a larger population of neurons. In these cases it is easy to construct

R_{su}

simply by dropping from

R_{ex}

the tested feature, or by fixing its value with some deterministic rule.

Reduced representations can also be used in other cases, for example, when the relevance of the feature response accuracy is assessed. This feature does not take a specific value in each trial; only by comparing multiple trials can the response accuracy be determined. A widely-used strategy is to represent spike trains with temporal bins of increasing duration, and to evaluate how the amount of information decreases as the representation becomes coarser. A sequence of surrogate responses is thereby defined, by progressively disregarding the fine temporal precision with which spike trains were recorded (Figure 1).

Figure 1. Assessing the relevance of response accuracy by varying the duration of the temporal bin. (a) Hypothetical intracellular recording of the spike patterns elicited by a single neuron after presenting in alternation two visual stimuli, □ and ◯, each of which triggers two possible responses displayed in columns 1 and 3 for □, and 2 and 4 for ◯. Stimulus probabilities and conditional response probabilities are arbitrary. Time is discretized in bins of 5 ms. The responses are recorded within 30 ms time-windows after stimulus onset. Spikes are fired with latencies that are uniformly distributed between 0 and 10 ms after the onset of □, and between 20 and 30 ms after the onset of ◯. Responses are represented by counting the number of spikes within consecutive time-bins of size 5, 10 and 15 ms starting from stimulus onset, thereby yielding discrete-time sequences

R_{ex}

,

R_{su}^{1}

and

R_{su}^{2}

, respectively; (b) Same as a, but with stimuli producing two different types of response patterns composed of 2 or 3 spikes.

Several studies have reported an information

I_{R_{su}}

that decreases monotonically with the duration

δ t

of the time bin (for example [21,22,23]). If there is a specific temporal scale in which spike-time precision is relevant—the alleged argument goes—a sudden drop in

I_{R_{su}} (δ t)

appears at the relevant scale. It should be noted, however, that the data processing inequality does not ensure that

I_{R_{su}} (δ t)

be a monotonically decreasing function of

δ t

. In the example of Figure 1, representations

R_{su}^{1}

and

R_{su}^{2}

are defined with long temporal bins, the durations of which are integer multiples of the bin used for

R_{ex}

. Hence,

R_{su}^{1}

and

R_{su}^{2}

are reduced representations of

R_{ex}

, and the data processing inequality does indeed guarantee that

I_{R_{ex}} \geq I_{R_{su}^{1}}

and

I_{R_{ex}} \geq I_{R_{su}^{2}}

. However,

R_{su}^{2}

is not a reduced representation of

R_{su}^{1}

, so there is no reason why

I_{R_{su}^{2}}

should be smaller than

I_{R_{su}^{1}}

, and indeed, Figure 1b shows an example where it is not. The representation constructed with bins of intermediate duration, namely 10 ms, does not distinguish between the two stimuli, whereas those of shorter and longer duration, 5 and 15 ms, do. A similar effect can be observed in the experimental data (freely available online) of Lefebvre et al. [24], when analysed with bins of sizes 5, 10 and 15 ms in windows of total duration 60 ms. Although these examples are rare, they demonstrate that there is no theoretical substantiation to the expectation of

I_{R_{su}}

to drop monotonically with increasing

δ t

.

2.2.2. Stochastically Reduced Representations

When the response feature under evaluation is removed from the response variable

R_{ex}

by a stochastic mapping

R_{ex} \to R_{su}

, the obtained variable

R_{su}

is called a stochastically reduced representation of

R_{ex}

. A required condition for a mapping to be a stochastically reduced representation is that the probability distribution of each

R_{su}

be dependent on

R_{ex}

, but conditionally independent from s. In these circumstances, the data processing inequality ensures that

I_{R_{ex}} \geq I_{R_{su}}

. If the statistical properties of the noisy components of the mapping are known, as well as the conditional response probability distribution

P_{ex} (r | s)

, the distribution

P_{su} (r | s)

can be derived using standard methods. Formally, stochastic representations

R_{su}

are obtained through stimulus-independent stochastic functions of the original representation

R_{ex}

. After observing that

R_{ex}

adopted the value

r_{ex}

, these functions produce a single value

r_{su}

for

R_{su}

chosen with transition probabilities

Q (r_{su} | r_{ex})

such that

P_{su} (r_{su} | s) = \sum_{r_{ex}} P_{ex} (r_{ex} | s) Q (r_{su} | r_{ex}) .

(10)

To illustrate the utility of stochastically reduced representations, we discuss their role in providing alternative strategies when assessing the relevance of spike-timing precision, not by changing the bin size as in Figure 1, but by randomly manipulating the responses, as illustrated in Figure 2.

Figure 2. Examples of stochastic codes. Alternative ways of assessing the relevance of spike-timing precision. (a) Stochastic function (arrows on the left) modeling the encoding process. The elicited response

r_{ex}

is turned into a surrogate response

r_{su}

with a transition probability

Q (r_{su} | r_{ex})

given by Equation (11). This function turns

R_{ex}

into a stochastic representation

R_{su}

by shuffling spikes and silences within bins of

15 m s

starting from stimulus onset; (b) Responses

r_{ex}

in panel (a) are transformed by a stochastic function with

Q (r_{su} | r_{ex})

given by Equation (12), which introduces jitter uniformly distributed within

15 m s

windows centered at each spike; (c) Responses

r_{ex}

in panel (a) are transformed by a stochastic function with

Q (r_{su} | r_{ex})

given by Equation (13), which models the inability to distinguish responses with spikes occurring in adjacent bins, or equivalently, with distances

d^{spike} [q = 1] \leq 1

or

d^{interval} [q = 1] \leq 1

(see [25,26] for further remarks on these distances). Notice that

R_{su}

samples the same response set as

R_{ex}

.

The method of Figure 2a yields the same information

I_{R_{su}}

and response accuracy as the method producing

R_{su}^{2}

in Figure 1. Each method yields responses that can be related to the responses of the other method through a stimulus-independent deterministic or stochastic function. Both methods suffer from the same drawback: They treat spikes differently depending on their location within the 15 ms time window. Indeed, both methods preserve the distinction between two spikes located in different windows, but not within the same window, even if the separation between the spikes is the same. The mapping illustrated in Figure 2a has transition probabilities

Q (r_{su} | r_{ex}) = \frac{1}{3} [\begin{matrix} 1 & 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 \end{matrix}],

(11)

where rows enumerate the elements of the ordered set

R_{e x} = {[2], [3], [4]}

from where

R_{ex}

is sampled, and columns enumerate the elements of the ordered set

R_{s u} = {[1], [2], [3], [4], [5], [6]}

from where

R_{su}

is sampled.

A third method, jittering, consists in shuffling the recorded spikes within time windows centered at each spike (Figure 2b). The responses generated by this method need not be obtainable from the responses generated by the mappings of Figure 2a or Figure 1 through stimulus-independent stochastic functions. Still, the method of Figure 2b inherently yields a stochastic code, and, unlike the methods discussed previously, treats all spikes in the same manner. The mapping illustrated in Figure 2b has transition probabilities

Q (r_{su} | r_{ex}) = \frac{1}{3} [\begin{matrix} 1 & 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 1 \end{matrix}],

(12)

where rows enumerate the elements of the ordered set

R_{e x} = {[2], [3], [4]}

from where

R_{ex}

is sampled, and columns enumerate the elements of the ordered set

R_{s u} = {[1], [2], [3], [4], [5]}

from where

R_{su}

is sampled.

As a fourth example, consider the effect of response discrimination, as studied in the seminal work of Victor and Purpura [25]. There, two responses were considered indistinguishable when some measure of distance between the responses was less than a predefined threshold. However, neural responses were transformed through a method based on cross-validation that is not guaranteed to be stimulus-independent. Depending on the case, hence, this fourth method may or may not be a stochastically reduced representation. The case chosen in Figure 2c is a successful example, and the associated matrix of transition probabilities is

Q (r_{su} | r_{ex}) = \frac{1}{6} [\begin{matrix} 3 & 3 & 0 \\ 2 & 2 & 2 \\ 0 & 3 & 3 \end{matrix}],

(13)

where rows and columns enumerate the elements of the ordered set

R_{e x} = R_{s u} = {[2], [3], [4]}

from where both

R_{ex}

and

R_{su}

are sampled.

Other methods exist which merge indistinguishable responses, thereby yielding reduced representations. These methods, however, are limited to notions of similarity that are transitive, a condition not fulfilled, for example, by those based on Euclidean distance, edit distance, or by the case of Figure 2c.

Stochastically reduced representations include reduced representations as limiting cases. Indeed, when for each

r_{ex}

there is a

r_{su}

such that

Q (r_{su} | r_{ex}) = 1

, stochastic representations become reduced representations (Figure 3). The possibility to include stochasticity, however, broadens the range of alternatives. Consider for example the hypothetical experiment in Figure 3a, in which the neural responses

R_{ex} = [L, C]

can be completely characterized by the first-spike latencies (L) and the spike counts (C). The importance of C can be studied for example by using a reduced code that replaces all C-values with a constant (Figure 3b). In this case,

Q (r_{su} | r_{ex}) = [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}],

(14)

where rows enumerate the elements of the ordered set

R_{e x} = {[2, 1], [3, 1], [3, 2], [4, 2]}

from where

R_{ex}

is sampled, and columns enumerate the elements of the ordered set

R_{s u} = {[2, 1], [3, 1], [4, 1]}

from where

R_{su}

is sampled.

Figure 3. Stochastically reduced representations include and generalize deterministically reduced representations. (a) Analogous description to Figure 1a, but with responses characterized using a representation

R_{ex} = [L, C]

based on the first-spike latency (L) and the spike-count (C); (b) Deterministic transformation (arrows) of

R_{ex}

in panel a into a reduced code

R_{su} = [\hat{L}, 1]

, which ignores the additional information carried in C by considering it constant and equal to unity. This reduced code can also be reinterpreted as a stochastic code with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (14); (c) The additional information carried in C is here ignored by shuffling the values of C across all trails with the same L, thereby turning

R_{ex}

in panel a into a stochastic code

R_{su} = [\hat{L}, \hat{C}]

with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (15); (d) The additional information carried in C is here ignored by replacing the actual value of C for one chosen with some possibly L-dependent probability distribution (Equation (16)).

Another alternative is to assess the relevance of C by means of a stochastic code that shuffles the values of C across all responses with the same L (Figure 3c). In this case,

Q (r_{su} | r_{ex}) = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & a & \bar{a} & 0 \\ 0 & a & \bar{a} & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(15)

where rows enumerate the elements of the ordered set

R_{e x} = {[2, 1], [3, 1], [3, 2], [4, 2]}

from where

R_{ex}

is sampled, and columns enumerate the elements of the ordered set

R_{s u} = {[2, 1], [3, 1], [3, 2], [4, 2]}

from where

R_{su}

is sampled. The parameter a is arbitrary, as long as

0 < a < 1

. We use the notation

\bar{a} = 1 - a

.

A third option is to use a stochastic code that preserves the original value of L but chooses the value of C from some possibly

L -

dependent probability distribution (Figure 3d), for which

Q (r_{su} | r_{ex}) = [\begin{matrix} b & 0 & 0 & \bar{b} & 0 & 0 \\ 0 & c & 0 & 0 & \bar{c} & 0 \\ 0 & c & 0 & 0 & \bar{c} & 0 \\ 0 & 0 & d & 0 & 0 & \bar{d} \end{matrix}]

(16)

where rows enumerate the elements of the ordered set

R_{e x} = {[2, 1], [3, 1], [3, 2], [4, 2]}

from where

R_{ex}

is sampled, and columns enumerate the elements of the ordered set

R_{s u} = {[2, 1], [3, 1], [4, 1], [2, 2], [3, 2], [4, 2]}

from where

R_{su}

is sampled. The parameters

a, b, c

and d are arbitrary, as long as

0 < a, b, c, d < 1

; and we have used the notation

\bar{x} = 1 - x

for any number x.

2.2.3. Modification of the Conditional Response Probability Distribution

When the response feature under evaluation is removed by altering the real conditional response probability distribution

P_{ex} (r | s)

, and transforming it into a surrogate distribution

P_{su} (r | s)

, the obtained response model is here said to implement a probabilistic removal of the tested feature. Probabilistic removals are usually employed when assessing the relevance of correlations between neurons in a population, since correlations are not a variable that can be deleted from each individual response. For example, if

R = (R_{1}, \dots, R_{n})

represents the spike count of n different neurons, the real distribution

P_{ex} (r_{1}, \dots, r_{n} | s)

is replaced by a new distribution

P_{su} (r_{1}, \dots, r_{n} | s)

in which all neurons are conditionally independent, that is,

P_{su} (r | s) = P_{N I} (r | s) = \prod_{i = 1}^{n} P_{ex} (r_{i} | s),

(17)

where, following the notation introduced previously [17], the generic subscript “

su

” was replaced by “

NI

” to indicate “noise-independent”.

The probabilistic removal of a response feature may or may not be describable in terms of a deterministically or a stochastically reduced representation. In other words, there may or may not exist a mapping

R_{ex} \to R_{su}

, or equivalently, a matrix of transition probabilities

Q (r_{su} | r_{ex})

, that captures the replacement of

P_{ex} (r | s)

by

P_{su} (r | s)

. It is important to assess whether such a matrix exists, since the data processing inequality is only guaranteed to hold with reduced representations, stochastic or not. If no reduced representation can capture the effect of a probabilistic removal, the data processing inequality may not hold, and

I_{R_{su}}

may well be larger than

I_{R_{ex}}

.

In order to determine whether a stochastically reduced representation exists, the first step is to discern whether Equation (10) constitutes a compatible or an incompatible linear system for the matrix elements

Q (r_{su} | r_{ex})

. If the system is incompatible, there is no solution. In the compatible case, which is often indeterminate, a solution entirely composed of non-negative numbers that sum up to unity in each row is required. Given enough time and computational power, the problem can always be solved in the framework of linear programming [27]. In practical cases, however, the search is often hampered by the curse of dimensionality. To facilitate the labour, here we list a few necessary (though not sufficient) conditions that must be fulfilled for the mapping to exist. If any of the following properties does not hold, Equation (10) has no solution, so there is no need to begin a search.

Property 1.

Let

μ (s)

be a probability distribution defined in the set of stimuli that may or may not be equal to the actual distribution with which stimuli appear in the experiment under study. For any stimulus s, the inequality

I_{μ} (R_{su}; S = s) \leq I_{μ} (R_{ex}; S = s)

between stimulus-specific informations [28,29] must hold, where

I_{μ} (R; S = s) = \sum_{r} P (r | s) {log}_{2} \frac{P (r | s)}{\sum_{s^{'}} P (r | s^{'}) μ (s^{'})} .

(18)

Proof.

If

Q (r_{su} | r_{ex})

exists, then Equation (10) can be inserted in Equation (18). Using the log-sum inequality [16], Property 1 follows. □

If we multiply both sides of the inequality by

μ (s^{'})

and sum over

s^{'}

, we obtain an inequality between the mutual informations

I_{μ} (R_{su}; S) \leq I_{μ} (R_{ex}; S)

. If

μ (s) = P (s)

, this result reduces to the data-processing inequality

I_{R_{su}} \leq I_{R_{ex}}

.

Property 2.

If

Q (r_{su} | r_{ex})

exists, then

Q (r_{su} | r_{ex}) = 0

whenever

P_{ex} (s, r_{ex}) > 0

and

P_{su} (s, r_{su}) = 0

for at least some s.

Proof.

Suppose that

Q (r_{su} | r_{ex}) > 0

when

P_{ex} (s, r_{ex}) > 0

for some s. Then, Equation (10) yields

P_{su} (r_{su} | s) > 0

, contradicting the hypothesis that

P_{su} (r_{su} | s) = 0

. Hence,

Q (r_{su} | r_{ex})

must vanish. □

For example, in Figure 4a, we decorrelate first-spike latencies (L) and spike counts (C) by replacing the true conditional distribution

P_{ex} (r | s)

(left panel) by its noise-independent version

P_{su} = P_{N I} (r | s)

defined in Equation (17) (middle panel). Before searching for a mapping

R_{ex} \to R_{su}

, we verify that the condition

I_{R_{ex}} > I_{R_{su}}

holds. Moreover, for several choices of

μ

(◯) and

μ

(□), one may confirm that

I_{μ} (R_{ex}; S = ◯) > I_{μ} (R_{su}; S = ◯)

, as well as

I_{μ} (R_{ex}; S = □) > I_{μ} (R_{su}; S = □)

. These results motivate the search for a solution of Equation (10) for

Q (r_{su} | r_{ex})

. The transition probability must be zero at least whenever

R_{su} \in {[1, 3]; [2, 3]; [3, 3]; [3, 2]; [3, 1]}

and

R_{ex} \in {[1, 2]; [2, 1]}

(Property 2). One possible solution is

Q (r_{su} | r_{ex}) = \frac{1}{2} [\begin{matrix} 2 b & \bar{b} c & \bar{c} \bar{b} & \bar{b} c & 0 & 0 & \bar{c} \bar{b} & 0 & 0 \\ \bar{a} & 2 a & 0 & 0 & \bar{a} & 0 & 0 & 0 & 0 \\ a & 0 & 0 & 2 \bar{a} & a & 0 & 0 & 0 & 0 \\ 0 & b & 0 & b & 2 \bar{b} c & \bar{c} \bar{b} & 0 & \bar{c} \bar{b} & 0 \\ 0 & 0 & b & 0 & 0 & \bar{b} c & b & \bar{b} c & 2 \bar{c} \bar{b} \end{matrix}] .

(19)

where each response is defined by a vector

[L, C]

, and rows and columns enumerate the elements of the ordered sets

R_{e x} = {[1, 1], [1, 2], [2, 1], [2, 2], [3, 3]}

and

R_{s u} = {[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3]}

from where

R_{ex}

and

R_{su}

are sampled, respectively. In Equation (19),

a = P_{ex} ([1, 2] | □)

;

b = P_{ex} ([1, 1] | ◯)

; and

c = P_{ex} ([2, 2] | ◯) / \bar{b}

.

Figure 4. Relation between probabilistic removal and stochastic codes. (a) Cartesian coordinates depicting: on the left, responses

R_{ex}

of a neuron for which L and C are positively correlated when elicited by ◯, and negatively correlated when elicited by □; in the middle, the surrogate responses

R_{su} = R_{N I}

that would occur should L and C be noise independent (middle); and on the right, a stimulus-independent stochastic function that turns

R_{ex}

into

R_{su}

with

Q (r_{su} | r_{ex})

given by Equation (19); (b) Same description as in (a), but with L and C noise independent given □, and with the stochastic function depicted on the right turning

R_{ex}

into

R_{N I}

given ◯ but not □.

However, stochastically reduced representations are not always guaranteed to exist. For example, in Figure 4b, it is easy to verify that the condition

I_{μ} (R_{ex}; S = □) < I_{μ} (R_{su}; S = □)

holds for any

μ (◯) \neq 0

. Therefore, no stochastic mapping can transform

R_{ex}

into

R_{su}

in such a way that

P_{ex} (r | s)

is converted into

P_{su} (r | s)

. Schneidman et al. [9] employed an analogous example, but involving different neurons instead of response aspects. The two examples of Figure 4 motivate the following theorem:

Theorem 1.

No deterministic mapping

R_{ex} \to R_{su}

exists transforming the conditional probability

P_{ex} (r | s)

into its noise-independent version

P_{su} = P_{N I} (r | s)

defined in Equation (17). Stochastic mappings

R_{ex} \to R_{su}

may or may not exist, depending on the conditional probability

P_{ex} (r | s)

.

Proof.

See Appendix B.2. □

In addition, when a stochastic mapping

R_{ex} \to R_{su}

exists, the values of the probabilities

Q (r_{su} | r_{ex})

may well depend on the discarded response aspect, as well as on the preserved response aspects. We mention this fact, because when assessing the relevance of noise correlations, the marginals

P_{ex} (r_{i} | s)

suffice for us to write down the surrogate distribution

P_{su} (r | s) = P_{N I} (r | s)

, with no need to know the full distribution

P_{ex} (r | s)

containing the noise correlations. One could have hoped that perhaps also the mapping

R_{ex} \to R_{su}

(assuming that such a mapping exists) could be calculated with no knowledge of the noise correlations. This is, however, not always true, as stated in the theorem below. Two experiments with the same marginals and different amounts of noise correlations may require different mappings to eliminate noise correlations, as illustrated in the the example of Figure 5. More formally:

Theorem 2.

The transition probabilities

Q (r_{su} | r_{ex})

of stochastic codes that ignore noise correlations may depend both on the marginal likelihoods (preserved at the output of the mapping), and on the noise correlations (eliminated at the output of the mapping).

Proof.

See Appendix B.3. □

Figure 5. Stochastically reduced representations that ignore noise correlations may depend on them.(a) Cartesian coordinates representing a hypothetical experiment in which two different stimuli, □ and ◯, elicit single neuron responses (

R_{su} = R_{N I}

) that are completely characterized by their first-spike latency (L) and spike counts (C). Both L and C are noise independent; (b) Cartesian coordinates representing a hypothetical experiment with the same marginal probabilities

P_{ex} (l | s)

and

P_{ex} (c | s)

as in panel (a), with one among many possible types of noise correlations between L and C; (c) Stimulus-independent stochastic function transforming the noise-correlated responses

R_{ex}

of panel (b) into the noise-independent responses

R_{su} = R_{N I}

of panel (a). The transition probabilities

Q (r_{su} | r_{ex})

are given in Equation (20), and they bear an explicit dependence on the amount of noise correlations.

The solution of Equation (10) for the example of Figure 5 is

Q (r_{su} | r_{ex}) = \frac{1}{2} [\begin{matrix} \bar{a} & 2 a & 0 & \bar{a} & 0 & 0 & 0 \\ a & 0 & 2 \bar{a} & a & 0 & 0 & 0 \\ 0 & 0 & 0 & b & 2 \bar{b} & 0 & b \\ 0 & 0 & 0 & \bar{b} & 0 & 2 b & \bar{b} \end{matrix}],

(20)

where each response is defined by a vector

[L, C]

, and rows and columns enumerate the elements of the ordered sets

R_{e x} = {[1, 2], [2, 1], [2, 3], [3, 2]}

and

R_{s u} = {[1, 1], [1, 2], [2, 1], [2, 2], [2, 3], [3, 2], [3, 3]}

from where

R_{ex}

and

R_{su}

are sampled, respectively. In Equation (20),

a = P (R_{ex} = [1, 2] | S = □)

; and

b = P (R_{ex} = [3, 2] | S = ◯)

. The fact that the matrix in Equation (20) bears an explicit dependence on these parameters–and not only on

P_{ex} (L | S)

and

P_{ex} (C | S)

–implies that the transformation between

R_{ex}

and

R_{su}

depends on the amount of noise correlations in

R_{ex}

.

2.3. Multiple Measures to Assess the Relevance of a Specific Response Feature

The importance of a specific response feature has been previously quantified in many ways (see [17,30] and references therein), which have oftentimes led to heated debates about their merits and drawbacks [9,11,12,17,31,32,33]. Here we consider several measures, to underscore the diversity of the meanings with which the relevance of a given feature has been assessed so far. They are mathematically defined as

\begin{matrix} Δ I_{R_{su}} = I_{R_{ex}} - I_{R_{su}} \end{matrix}

(21)

\begin{matrix} Δ I_{\hat{S}} = I_{R_{ex}} - I_{\hat{S}}^{β} \end{matrix}

(22)

\begin{matrix} Δ I_{\hat{S}} = I_{R_{ex}} - I_{\hat{S}}^{β} \end{matrix}

(23)

\begin{matrix} Δ A_{R_{su}} = A_{R_{ex}}^{R_{ex}} - A_{R_{su}}^{R_{su}} \end{matrix}

(24)

\begin{matrix} Δ I_{}^{D} = \sum_{s, r} P_{ex} (s, r) ln \frac{P_{ex} (s | r)}{P_{su} (s | r)} \end{matrix}

(25)

\begin{matrix} Δ I_{}^{D L} = min_{θ} \sum_{s, r} P_{ex} (s, r) ln \frac{P_{ex} (s | r)}{P_{su} (s | r, θ)} \end{matrix}

(26)

\begin{matrix} Δ I_{}^{L S} = I_{R_{ex}} - I_{\hat{S}}^{α} \end{matrix}

(27)

\begin{matrix} Δ I_{}^{B} = I_{R_{ex}} - I_{\hat{S}}^{α} \end{matrix}

(28)

\begin{matrix} Δ A_{}^{B} = A_{R_{ex}}^{R_{ex}} - A_{R_{ex}}^{R_{su}} \end{matrix}

(29)

Equations (22)–(24) are based on matched decoders, that is, decoders operating on responses governed by the same probability distribution involved in their construction (method

β

). Instead, Equations (25)–(28) are based on the operation of mismatched decoders (method

α

). Each measure of Equations (21)–(24) has one or two homologous measures in Equations (25)–(29), as illustrated in Figure 6.

Figure 6. Relations between the measures defined in Equations (21)–(29). The four measures on the left are either encoding-oriented (

Δ I_{R_{su}}

, on a pink background), or half-way between encoding- and decoding-oriented (the last three, gray background). The five measures on the right are all decoding-oriented (light-blue background). Each measure on the left has a conceptually related measure on the right on the same line, except for

Δ I_{R_{su}}

, which has two associated decoding-oriented measures:

Δ I^{D}

and

Δ I^{L D}

. The distinction between the measures on pink and on gray background relies on the fact that

Δ I_{R_{su}}

does not involve a decoding process. Instead,

Δ I_{\hat{S}}, Δ I_{\hat{S}}

and

Δ A_{R_{su}}

decode a stimulus (or rank the stimuli) with decoding method

β

. This decoding is not meant to be applicable to real experiments, since (as opposed to the truly decoding-oriented measures on the right, that operate with method

α

) the decoding is applied to the surrogate responses

R_{su}

, not the real ones

R_{ex}

.

We here describe the measures briefly, and refer the interested reader to the original papers.

In Equation (21),

I_{R_{ex}}

and

I_{R_{su}}

are the mutual informations between the set of stimuli and a set of responses governed by the distributions

P_{ex} (r | s)

and

P_{su} (r | s)

, respectively. Thus,

Δ I_{R_{su}}

is the simplest way in which the information encoded by the true responses can be compared with that of the surrogate responses. This comparison has been employed for more than six decades in neuroscience [34,35] to study, for example, the encoding of different stimulus features in spike counts, in synchronous spikes, and in other forms of spike patterns, both in single neurons and populations (see [30] and references therein).

The measure

Δ I_{}^{D}

defined in Equation (25) was introduced by Nirenberg et al. [8] to study the role of noise correlations, and was later extended to arbitrary deterministic mappings [10,12,13]. Here we use the supra-script D to indicate that the measure is the “divergence” (in the Kullback-Leibler sense) between the posterior stimulus distributions calculated with the real and the surrogate responses, respectively. In [10], Nirenberg and Latham argued that the important feature of

Δ I_{}^{D}

is that it represents the information loss of a mismatched decoder trained with

P_{su} (r | s)

but operated on the real responses, sampled from

P_{ex} (r | s)

. Not before long, Schneidman et al. [9] noticed that

Δ I_{}^{D}

can exceed

I_{R_{ex}}

. The interpretation of

Δ I_{}^{D}

as a measure of information loss would imply that decoders trained with surrogate responses can lose more information than the one encoded by the real response. In fact,

Δ I_{}^{D}

tends to infinity if

P_{su} (s | r) \to 0

when

P (s | r) > 0

for some s. In the limit,

Δ I_{}^{D}

becomes undefined when

P_{su} (r) = 0

and

P_{ex} (r) > 0

. To avoid this peculiar behavior, Latham and Nirenberg generalized the theoretical framework used to derive

Δ I^{D}

[11], giving rise to the measure

Δ I_{}^{D L}

of Equation (26). Here, the supra-script

D L

makes reference to “Divergence Lowest”, since the measure was presented as the lowest possible information loss of a decoder trained with

P_{su} (r | s)

. In the definition of

Δ I^{L D}

, the parameter

θ

is a real scalar. The distribution

P_{su} (s | r, θ)

was defined by Latham and Nirenberg [11] as proportional to

P (s) P_{su} {(r | s)}^{θ}

. This definition has several problems, as discussed in [11,17,36,37,38,39]. In Appendix B.1 we demonstrate a theorem that resolves the issues appearing in previous definitions, and justifies the use of

P_{su} (s | r, θ) \propto \{\begin{matrix} P (s) & if \exists \hat{s}, \hat{r} such that P_{ex} (\hat{r} | \hat{s}) > P_{su} (\hat{r} | \hat{s}) = 0 \\ 0 & if P_{su} (r | s) = P_{ex} (r | s) = 0 for some but not all s \\ P (s) P_{su} {(r | s)}^{θ} & otherwise \end{matrix}

(30)

From the conceptual point of view,

Δ I^{D L}

represents the information loss of a mismatched decoder trained with

P_{su} (r | s)

and operated on

R_{ex}

. Latham and Nirenberg [11] showed that, unlike

Δ I_{}^{D}

, it is possible to demonstrate that

Δ I_{}^{D L} \leq I_{R_{ex}}

. Hence,

Δ I_{}^{D L}

never yields a tested feature encoding more information than the full response. The proof in [11] ignored a few specific cases that we discuss in the Theorem A1 of Appendix B.1. Still, even in those additional cases, the inequality

Δ I_{}^{D L} \leq I_{R_{ex}}

holds.

In Equations (22) and (23),

\hat{S}

and

\hat{S}

denote a sorted stimulus list and the most-likely stimulus, respectively, both decoded by evaluating Equation (6) (or its ranked version) on a response

r

sampled from the surrogate distribution

P_{su} (r | s)

(method

β

). Estimating mutual informations using decoders can be traced back at least to Gochin et al. [40], and comparing the estimations of two decoders that take different response features into account, at least to Warland et al. [41].

The measures

Δ I_{\hat{S}}

and

Δ I_{\hat{S}}

are paired with

Δ I_{}^{L S}

and

Δ I_{}^{B}

, respectively, since the latter are obtained from the former when replacing the decoding method from

β

to

α

. The measure

Δ I_{}^{L S}

was introduced by Ince et al. [20], and quantifies the difference between the information in

R_{ex}

, and the one in the output of decoders that, after observing a variable

r

sampled with distribution

P_{ex} (r | s)

(method

α

), produce a stimulus list sorted according to

P_{su} (s | r)

. The supra-script

L S

indicates “List of Stimuli”. Similarly,

Δ I_{}^{B}

, quantifies the difference between the information encoded in

R_{ex}

and that encoded in the output of a decoder trained by inserting

P_{su} (s | r)

into Equation (6), and operated on

r

sampled with distribution

P_{ex} (r | s)

(method

α

). The supra-script B stands for the “Bayesian” nature of the involved decoder. The use of these measures can be traced back at least to Nirenberg et al. [8], although in that case, decoders were restricted to be linear. The measure

Δ I_{\hat{S}}

of Equation (22) is new, and we have introduced it here as the homologous of

Δ I_{}^{L S}

. When the number of stimuli is two,

Δ I_{\hat{S}} = Δ I_{\hat{S}}

, since selecting the optimal stimulus is (as a computation) in one-to-one correspondence with ranking the two candidate stimuli.

The accuracy loss

Δ A_{R_{su}}

defined in Equation (24) entails the comparison between the performance of two decoders, one trained with and applied on

R_{ex}

, and one trained with and applied on

R_{su}

. Such comparisons have also a long history in neuroscience [42,43] (see [9,12] for further discussion). The accuracy loss

Δ A^{B}

also compares two decoders. The first, is the same as for

Δ A_{R_{su}}

, but the second is trained with

R_{su}

and applied on

R_{ex}

.

The measures

Δ I_{}^{L S}

,

Δ I_{}^{B}

, and

Δ A_{}^{B}

are undefined if the actual responses

R_{ex}

are not contained in the set of surrogate responses

R_{su}

. In other words, a decoder constructed with

P_{su} (r | s)

does not know what output to produce when evaluated in a response

r

for which

P_{su} (r) = 0

. This situation never happens when evaluating the relevance of noise correlations with

P_{su} = P_{N I}

, but it may well be encountered in more general situations, as for example, in Figure 3B.

2.4. Relating the Values Obtained with Different Measures

If a mapping

R_{ex} \to R_{su}

exists transforming

P_{ex} (r | s)

into

P_{su} (r | s)

, we may use the decoding procedure of Equation (6) to construct the transformation chain

R_{ex} \to R_{su} \to \hat{S} \to \hat{S}

[17,44]. Consequently,

Δ I_{R_{su}}

,

Δ I_{\hat{S}}

and

Δ I_{\hat{S}}

can be interpreted as accumulated information losses after the first, second and third transformations, respectively, and

Δ A_{R_{su}}

, as the accuracy loss after the first transformation. The data processing theorems (Section 2.1.3) ensure that these measures are never negative. This property, however, cannot be guaranteed in the absence of a reduced transformation

R_{ex} \to R_{su}

, stochastic or deterministic. Indeed, in the example of Figure 4b, if both stimuli are equiprobable, and both responses

R_{ex}

associated with ◯ are equiprobable, then

Δ I_{R_{su}} = Δ I_{\hat{S}} = Δ I \hat{S} \approx - 79 %

of

I_{R_{ex}} \approx 0.31 bits

, implying that the surrogate responses encode more information about the stimulus than the original, experimental responses. Removing the correlations between spike count and latency, hence, increases the information, so correlations can be concluded to be detrimental to information encoding.

Irrespective of whether a (deterministic or stochastic) mapping

R_{ex} \to R_{su}

exists, the data processing inequality guarantees that

Δ I_{R_{su}} \leq Δ I_{\hat{S}} \leq Δ I_{\hat{S}}

, since

\hat{S}

is a deterministic function of

R_{su}

, and

\hat{S}

is a deterministic function of

\hat{S}

. The inequality holds irrespective of the sign of each measure.

All decoder-oriented measured are guaranteed to be non-negative. The very definitions of

Δ I_{}^{D}

and of

Δ I_{}^{D L}

imply they cannot be negative, since they are both Kullback-Leibler divergences between two probability distributions. The sequence of reduced transformations

R_{ex} \to \hat{S} \to S

, in turn, guarantees the non-negativity of

Δ I_{}^{L S}, Δ I_{}^{B}

and

Δ A_{}^{B}

, through the Data Processing Inequalities.

In order to assess whether decoding-oriented measures are always larger or smaller than their encoding (or gray) counterparts, we performed a numerical exploration comparing each encoding/gray-oriented measure with its decoding-oriented homologue. The exploration was conducted by calculating the values of these measures for a large collection of possible stimulus prior probabilities

P (s)

, and response conditional probabilities

P_{ex} (r | s)

in the examples of Figure 2, Figure 3, Figure 4 and Figure 7. The details of the numerical exploration are in Appendix A. The measures in the first group were sometimes greater and sometimes smaller than those of the second group, depending on the case and the probabilities (Table 1). Consequently, our results demonstrate that there is no general rule by which measures of one type bound the measures of the other type.

Figure 7. Stochastic codes may play different roles in encoding and decoding. (a) Hypothetical experiment with two stimuli □ and ◯, which are transformed (solid and dashed lines) into neural responses containing a single spike (

C = 1

) fired at different phases (

Φ

) with respect to a cycle of 20 ms period starting at stimulus onset. The phases have been discretized in intervals of size

π / 2

and wrapped to the interval

[0, 2 π)

. The encoding process is followed by a circular phase-shift that transforms

R_{ex} = Φ

into another code

R_{su} = \hat{Φ}

with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (31). The set of all

R_{su}

coincides with the set of all

R_{ex}

; (b) Same as (a), except that stimuli are four (

, Ⓐ,

, and Ⓑ), and phases are measured with respect to a cycle of

30 m s

period and discretized in intervals of size

π / 3

. The encoding process is followed by a stochastic transformation (lines on the right) that introduces jitter, thereby transforming

R_{ex} = Φ

into another code

R_{su} = \hat{Φ}

with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (32).

Table 1. Numerical exploration of the maximum and minimum differences between several measures of information and accuracy losses. The values are expressed as percentages of

I_{R_{ex}}

(the information encoded in

R_{ex}

) or

A_{R_{ex}}^{R_{ex}}

(the maximum accuracy above chance level when decoders operate on

R_{ex}

). All examples involve two stimuli, so

Δ I_{\hat{S}} = Δ I_{\hat{S}}

and

Δ I_{}^{L S} = Δ I_{}^{B}

. The absolute value of

Δ A_{R_{su}} - Δ A_{}^{B}

can become extremely large when

A_{R_{su}}^{R_{su}} \approx 0

. Dashes represent cases in which decoding-oriented measures are undefined, as explained in Section 2.4.

The exploration also included the example of Figure 7a. In panel (a), the transition probabilities are

Q (r_{su} | r_{ex}) = [\begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{matrix}],

(31)

where rows and columns enumerate the elements of the ordered sets

R_{e x} = R_{s u} = {[1], [2], [3], [4]}

from where both

R_{ex}

and

R_{su}

are sampled. For panel b,

Q (r_{su} | r_{ex}) = \frac{1}{2} [\begin{matrix} 2 \bar{a} & a & a & 0 & 0 & 0 \\ 0 & b & b & 2 \bar{b} & 0 & 0 \\ 0 & 0 & 0 & 2 \bar{b} & b & b \\ 2 \bar{a} & 0 & 0 & 0 & a & a \end{matrix}],

(32)

with

0 < a, b < 1

, rows enumerating the elements of

R_{e x} = {[2], [3], [5], [6]}

, and columns those of

R_{s u} = {[1], [2], [3], [4], [5], [6]}

.

An important issue is to identify the situations in which

Δ I_{R_{su}}

gives exactly the same result as either

Δ I_{}^{D}

or

Δ I_{}^{D L}

. It is not easy to determine the conditions for the equality between

Δ I_{R_{su}}

and

Δ I_{}^{D L}

. Yet, for the equality between

Δ I_{R_{su}}

and

Δ I_{}^{D}

, and in the specific case in which

P_{su} (r | s) = P_{N I} (r | s)

as given by Equation (17), the following theorem holds.

Theorem 3.

When assessing the relevance of noise correlations,

Δ I_{}^{D} = Δ I_{R_{su}}

if and only if

λ = \sum_{r} [P_{ex} (r) - P_{su} (r)] {log}_{2} [P_{su} (r)] = 0 .

(33)

Moreover,

λ ≶ 0

implies that

Δ I_{}^{D} ≶ Δ I_{R_{su}}

.

Proof.

See Appendix B.4. □

Equation (33) implies that neither the prior stimulus probabilities

P (s)

nor the conditional response probabilities

P_{ex} (r | s)

intervene in the condition for the equality, beyond the effect they have in fixing the value of

P_{ex} (r)

and

P_{su} (r)

. Each response

r

makes a contribution to the value of

λ

, which favours

Δ I^{D}

whenever

P_{su} (r) > P_{ex} (r)

, and

I_{R_{ex}}

in the opposite case. As pointed out by [10], all responses

r

for which

P_{ex} (r) = 0

and

P_{su} (r) > 0

give a null contribution to

Δ I^{D}

, and a negative contribution to

I_{R_{ex}}

, implying that correlations in such responses are irrelevant for decoding, and detrimental to encoding.

The fact that encoding-oriented measures neither bound nor are bounded by decoding-oriented measures is a daunting result. If, when working in a specific example, one gets a positive value with one measure and a negative value with another, the interpretation must carefully distinguish between the two paradigms. One may wonder, however, if such distinction is also required when correlations are absolutely essential for one of the measures, in that they capture the whole of the encoded information. Could the other measure conclude that they are irrelevant? Or that they are only mildly relevant? Luckily, in this case, the answer is negative. In other words, when the tested feature is fundamental, then

Δ I_{}^{D}

and

Δ I_{R_{su}}

coincide, and no conflict arises between encoding and decoding, as proven by the following theorem:

Theorem 4.

Δ I_{}^{D L} = I_{R_{ex}}

if and only if

Δ I_{R_{su}} = I_{R_{ex}}

, regardless of whether stochastic codes exist that map the actual responses

R_{ex}

into the surrogate responses

R_{su} = R_{N I}

generated assuming noise independence.

Proof.

See Appendix B.5. □

The conclusion is that if a given feature is 100% relevant for encoding, then it is also 100% relevant for decoding, and vice versa. Hence, although

Δ I_{R_{su}}

and

Δ I_{}^{D L}

often differ in the relevance they ascribe to a given feature, the discrepancy is only encountered when the tested feature is not the only informative feature in play. When the removal of the feature is catastrophic (in the sense that it brings about a complete information loss), then both

Δ I_{R_{su}}

and

Δ I_{}^{D L}

diagnose the situation equally.

2.5. Relation between Measures Based on Decoding Strategies $α$ and $β$

The results of Table 1 may seem puzzling because decoding happens after encoding. Therefore—one may naively reason—the data processing theorems should have forbidden both

Δ I_{R_{su}}

to surpass

Δ I_{}^{D}

,

Δ I_{}^{D L}

, or

Δ I_{}^{B}

, as well as

Δ A_{R_{su}}

to surpass

Δ A_{}^{B}

. However, even though decoding indeed happens after encoding, the data processing theorem is not violated. The theorem certainly ensures that

Δ I_{R_{su}}

and

Δ A_{R_{su}}

constitute lower bounds for measures related to decoders that operate on responses generated by

P_{su} (r | s)

, but not for measures related to decoders that operate on responses generated by

P_{ex} (r | s)

, such as happens with

Δ I_{}^{D}

,

Δ I_{}^{D L}

,

Δ I_{}^{B}

, and

Δ A_{}^{B}

.

This observation about the validity of the data processing inequality is different from the one discussed in Section 2.2. There, we discussed the conditions under which

Δ I_{R_{su}}

could be guaranteed to be non-negative, the crucial factor being the existence of a stochastic mapping

R_{ex} \to R_{su}

. Now we are discussing a different aspect, regarding whether decoding-related measures can or cannot be bounded by encoding-oriented measures. The conclusion is that in general terms, the answer is negative, because decoding-related measures operate with decoding strategy

α

, a strategy never addressed by the encoding measures. The surrogate variable

R_{su}

participating in the encoding measure

Δ I_{R_{su}}

is not the response decoded by the measures of Equations (25)–(28), so the data processing inequalities need not hold. That being said, there are specific instances in which both types of measures coincide, two of them discussed in Theorems 3 and 4 and a third case later in Theorem 5.

Other explanations have been given in the literature for the fact that sometimes, decoding oriented measures surpass their encoding counterparts. For example, it has been alleged [10] that when

Δ I_{}^{D}, Δ I_{}^{D L}

or

Δ I_{}^{B}

are smaller than

Δ I_{R_{su}}

, this is either due to (a) the impossibility to define a stimulus-independent reduction

R_{ex} \to R_{su}

that yields

P_{ex} (r | s) \to P_{su} (r | s)

(and therefore the data-processing inequality is not guaranteed to hold), or due to (b) the fact that surrogate responses often sample values of response space that are never reached by real responses (and therefore, the losses of matched decoders may be larger than the ones of mismatched ones). However, Figure 2c constitutes a counterexample of both arguments, since there, the stimulus-independent stochastic reduction exists, and the response set of

R_{ex}

and

R_{su}

coincide.

One could also wonder whether the discrepancy between the values obtained with encoding-oriented measures and decoding-oriented measures only occurs in examples where a stochastic reduction

R_{ex} \to R_{su}

exists, and the involved transition matrix

Q (r_{su} | r_{ex})

depends on the joint probabilities

P_{ex} (r, s)

, and not only on the marginals, as discussed in Theorem 2. However, Figure 2b,c provide examples in which

Q (r_{su} | r_{ex})

does not depend on

P (r, s)

, and yet, the discrepancies are still observed.

The distinction between decoding strategies

α

and

β

is also crucial when using the measure

Δ I_{}^{D}

. This measure was introduced by Nirenberg et al. [8] for the specific case in which the tested feature is the amount of noise correlations, that is, when

P_{su} (s | r) = P_{N I} (s | r)

. The measure was later extended to arbitrary deterministic mappings

R_{su} = f (R_{ex})

[10,12,13], with the instruction to use an expression like Equation (25), but with

P_{su} (s | r)

replaced by

P (s | R_{su} = f (r)) = P_{su} (s | f (r))

. It should be noted, however, that as soon as this replacement is made,

Δ I_{}^{D}

becomes exactly equal to

Δ I_{R_{su}}

. Specifically, the measure

Δ I_{}^{D}

now describes the information loss of a decoder that operates on a response variable generated with the surrogate distribution

P_{su} (r | s)

(decoding method

β

). If we want to keep the original spirit, and associate

Δ I_{}^{D}

with a decoder that operates on a response variable generated with the real distribution

P_{ex} (r | s)

(decoding method

α

), in Equation (25),

P_{su} (s | r)

should not be modified. Only the evaluation of the surrogate variable

R_{su}

in the experimentally observed value

R_{ex} = r

describes a mismatched decoder constructed with

P_{su} (r | s)

and operated on

R_{ex}

(mathematical details in Appendix C).

2.6. Assessing the Type of Information Encoded by Individual Response Features

When the stimulus contains several attributes (as shape, color, sound, etc.), by removing a specific response feature it is possible to assess not only how much information is encoded by the feature, but also, what type of information. Identifiying the type of encoded information implies determining the stimulus feature represented by the tested response feature. As shown in this section, the type of encoded information is as dependent on the method of removal as is the amount. In other words, the different measures defined in Equations (21)–(29) sometimes associate a feature with the encoding of different stimulus attributes.

In the example of Figure 8, we use four compound stimuli

S = [S_{F}, S_{L}]

, generated by choosing independently a frame (

S_{F} = □

or

◯

) and a letter (

S_{L} = A

or

B

), thereby yielding

, Ⓐ,

, and Ⓑ. Stimuli are transformed into neural responses

R = [L, C]

with different number of spikes (

1 \leq C \leq 5

) fired at different first-spike latencies (

1 \leq L \leq 4

; time has been discretized in 5 ms bins). Latencies are only sensitive to frames whereas spikes counts are only sensitive to letters, thereby constituting independent-information streams:

P (s, r) = P (s_{F}, l) P (s_{L}, c)

[33]. The equality in the numerical value of two measures does not imply that both measures assign the same meaning to the information encoded by the tested response feature. Indeed, the two measures may sometimes report the tested response feature to encode two different aspects of the set of stimuli. Consider a decoder that is trained using the noisy data

R_{su}

shown in Figure 8a, but it is asked to operate on either the same noisy data with which it was trained (strategy

β

), or with the quality data

R_{ex}

of Figure 8b (strategy

α

). The information losses

Δ I_{R_{su}}

,

Δ I_{}^{D}

, and

Δ I_{}^{D L}

are all equal to

50 %

of

I (S, R_{ex}) = 2 bits

. Therefore, the information loss is independent of whether, in the operation phase, the decoder is fed with responses generated with

P_{su} (r | s)

or with

P_{ex} (r | s)

.

Figure 8. Assessing the amount and type of information encoded by. (a) Noisy data

R_{su} = [L, C]

recorded in response of the compound stimulus

S = [S_{F}, S_{L}]

; (b) Quality data

R_{ex} = [L, C]

recorded in the case of panel (a), but without noise; (c) Stimulus-independent stochastic transformation with transition probabilities

Q (r_{su} | r_{ex})

given by Equation (34), that introduces independent noise both in the latencies and in the spike counts, thereby transforming

R_{ex}

into

R_{su}

and rendering

R_{su}

a stochastic code; (d) Degraded data

\overset{˘}{R}

obtained by adding latency noise to the quality data; (e) Representation of the stimulus-independent stochastic transformation

R_{ex} \to \overset{˘}{R}

with transition probabilities

Q (\overset{˘}{r} | r_{ex})

given by Equation (35) that adds latency noise in panel (d).

The transformation

Q (r_{su} | r_{ex})

causes some responses

R_{su}

to occur for all stimuli, so when decoding with method

β

, some information about frames is lost (that is,

I (S_{F}, R_{su}) \approx 33 %

of

I (S_{F}, R_{ex}) = 1 bit

), as well as some information about letters (that is,

I (S_{L}, R_{su}) \approx 67 %

of

I (S_{L}, R_{ex}) = 1 bit

). In other words, decoding

R_{su}

causes a partial information loss

Δ I_{R_{su}}

that is composed of both frame and letter information. Instead, when decoding

R_{ex}

with method

α

, there is no information loss about letters: For the responses

R_{ex}

that actually occur, the decoder trained with

R_{su}

can perfectly identify the letters, because

P_{su} (C = 2 | S_{L} = A) = P_{su} (C = 4 | S_{L} = B) = 1

. The information about frames, on the other hand, is completely lost, since

P_{su} (l | □) = P_{su} (l | ◯)

whenever l adopts a value that actually occurs in

R_{ex}

, namely 2 or 3. This example shows that the fact that two decoding procedures give the same numerical loss does not mean that they draw the same conclusions regarding the role of the tested feature in the neural code. Ananalogous computations yield analogous results for the hypothetical experiment shown in Figure 7b.

If responses

r_{ex}

and

r_{su}

are written as vectors

[L, C]

, and the values of

Q (r_{su} | r_{ex})

are arranged in a rectangular structure, in Figure 8c the transition probabilities are

Q (r_{su} | r_{ex}) = \frac{1}{9} [\begin{matrix} 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 1 \end{matrix}],

(34)

where rows and columns indicate the ordered sets

R_{e x} = {[2, 2], [3, 2], [2, 4], [3, 4]}

and

R_{s u} = {1, 2, 3, 4} \times {1, 2, 3, 4, 5}

, where × denotes the Cartesian product with colexicographical order, that is, ordered as

[1, 1], [2, 1], [3, 1], [4, 1], [1, 2]

, etc. In Figure 8e

Q (\overset{˘}{r} | r_{ex}) = \frac{1}{3} [\begin{matrix} 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \end{matrix}],

(35)

with rows and columns with the same convention as in Equation (34).

Finally, the noisy data (Figure 8a) can be obtained by transforming the degraded data (Figure 8d) with the transition matrix

Q (r_{su} | \overset{˘}{r}) = \frac{1}{3} [\begin{matrix} 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 \end{matrix}] .

(36)

with rows and columns indicating the ordered sets

{[1, 2, 3, 4] \times [2, 4]}

and

{1, 2, 3, 4} \times {1, 2, 3, 4, 5}

, respectively, where × denotes the Cartesian product with colexicographical order.

2.7. Conditions for Equality of the Amount and Type of Information Loss Reported by Different Measures

We now derive the conditions under which encoding/gray-oriented measures coincide with their decoding-oriented counterparts, as observed in Figure 2a and Figure 3d. That is, we derive the conditions under which the following equalities hold:

Δ I_{R_{su}} = Δ I_{}^{D} = Δ I_{}^{D L},

(37)

Δ I_{\hat{S}} = Δ I_{}^{L S},

(38)

Δ I_{\hat{S}} = Δ I_{}^{B},

(39)

Δ A_{R_{su}} = Δ A_{}^{B} .

(40)

The example in Figure 7a showed that the existence of deterministic mappings does not suffice for a qualitative and quantitative equivalence of different measures. Furthermore, the example of Figure 3b showed that the equalities require the space of

R_{su}

to include the space of

R_{ex}

, or else the decoding method

α

may be undefined. We demonstrate that the Equations (37)–(40) arise, and moreover, that there is no discrepancy in the type of information assessed by these different measures, whenever the mapping from

R_{ex}

into

R_{su}

can be described using positive-diagonal idempotent stochastic matrices [45]. Specifically, we prove the following theorem:

Theorem 5.

Consider a stimulus-independent stochastic function f from a representation

R_{ex}

into another representation

R_{su}

, such that the range

R

of

R_{su}

includes that of

R_{ex}

, and with transition probabilities

Q (r_{su} | r_{ex})

that can be written as positive-diagonal idempotent right stochastic matrices with row and column indices that enumerate the elements of

R

in the same order. Then, Equations (37)–(40) hold.

Proof.

See Appendix B.6. □

The theorem states that the equalities of Equations (37)–(40) can be guaranteed whenever the removal of the tested response feature involves a (deterministic or) stochastic mapping

R_{ex} \to R_{su}

that induces a partition within the set of real responses

R_{ex}

, and

R_{su}

is obtained by rendering all responses inside each partition indistinguishable (but not across partitions). To sample

R_{su}

, the probabilities of individual responses inside each partition are re-assigned, rendering their distinction uninformative [30].

This theorem provides sufficient but not necessary conditions for the equalities to hold. The important aspect, however, is that it ensures that the equalities hold not only in numerical value, but also, in the type of information that different measures ascribe to the tested feature. Two different methods preserve or lose information of different type if, when decoding a stimulus, the trials with decoding errors tend to confound different attributes of the stimulus, as in the example of Figure 8. The conditions of Theorem 5, however, ensure that the strategies

α

and

β

always decode exactly the same stimulus (see Appendix B.6), so there can be no difference in the confounded attributes. Pushing the argument further, one could even argue that responses (real or surrogate) encode more information than the identity of the stimulus that originated them. For a fixed decoded stimulus, the response still contains additional information [46], that refers to (a) the degree of certainty with which the stimulus is decoded, and (b) the rank of the alternative stimuli, in case the decoded stimulus was mistaken [20]. Both meanings are embodied in the whole rank of a posteriori probabilities

P_{su} (s | r)

, not just the maximal one. Yet, under the conditions of the theorem, the entire rankings obtained with methods

α

and

β

coincide (see Appendix B.6). Therefore, even within this broader interpretation, there can be no difference in the qualitative aspects of the information preserved or lost by one and the other.

For example, in Figure 7b, we found that all information losses are equal (that is,

Δ I_{\hat{R}}

,

Δ I_{\hat{S}}

,

Δ I_{\hat{S}}

,

Δ I_{}^{D}

,

Δ I_{}^{D L}

,

Δ I_{}^{L S}

, and

Δ I_{}^{B}

are all

50 %

), and both accuracy losses are equal (that is,

Δ A_{\hat{R}}

and

Δ A_{}^{B}

are both

\approx 67 %

). However, the conditions of Theorem 5 do not hold. The matrix of Equation (32) is not block-diagonal, nor it can be taken to that shape by incorporating new rows (to make it square), and permuting both rows and columns, in such a way that the response vectors are enumerated in the same order by both indices. For this reason, the losses are not guaranteed to be of the same type.

Instead, the transition probabilities of Equations (15) and (16) can be turned into positive-diagonal idempotent right stochastic matrices. Equation (15) is already in the required format. To take Equation (16) to the conditions of Theorem 5, two new rows need to be incorporated, associated to the responses

[4, 1]

and

[2, 2]

, that do not occur experimentally. Those rows can contain arbitrary values, since the condition

P_{ex} ([4, 1] | S) = P_{ex} ([2, 2] | S) = 0, \forall S

renders them irrelevant. Arranging the columns so that both rows and columns enumerate the same list of responses, Equation (16) can be written as

Q (r_{su} | r_{ex}) = [\begin{matrix} b & \bar{b} & 0 & 0 & 0 & 0 \\ b & \bar{b} & 0 & 0 & 0 & 0 \\ 0 & 0 & c & \bar{c} & 0 & 0 \\ 0 & 0 & c & \bar{c} & 0 & 0 \\ 0 & 0 & 0 & 0 & d & \bar{d} \\ 0 & 0 & 0 & 0 & d & \bar{d} \end{matrix}],

(41)

with

R_{e x} = R_{s u} = {[2, 1], [2, 2], [3, 1], [3, 2], [4, 1], [4, 2]}

. Hence, in these two examples, both the amount and type of information of encoding and decoding-based measures coincide.

2.8. Improving the Performance of Decoders Operating with Strategy $α$

In a previous paper [17], we demonstrated that neither

Δ I_{}^{D}

nor

Δ I_{}^{D L}

constitute lower bounds on the information loss induced by decoders constructed by disregarding the tested response feature. This means that some decoders may exist, that perform better than

D_{su} (r)

defined in Equation (6). In this section we discuss one possible way in which some of these improved decoders may be constructed, inspired in the example of Figure 8. Quite remarkably, the construction involves the addition of noise to the real responses, before feeding them to the decoder of Equation (6). Panel (a) shows a decoder constructed with noisy data (

R_{su}

), and then employed to decode quality data (

R_{ex}

; Figure 8b), thereby yielding information losses

Δ I_{}^{D} = Δ I_{}^{D L} = 50 %

. These losses can be decreased by feeding the decoder with a degraded version

\overset{˘}{R}

of the quality data (Figure 8d) generated through a stimulus-independent transformation that adds latency noise (Figure 8e). Decoding

R_{ex}

as if it were

R_{su}

by first transforming

R_{ex}

into

\overset{˘}{R}

results in

Δ I_{}^{D} = Δ I_{}^{D L} \approx 33 %

, thereby recovering

33 %

of the information previously lost. On the contrary, adding spike-count noise will tend to increase the losses. Thus, adding suitable amounts and type of noise can increase the performance of approximate decoders, and the result is not limited to the case in which the response aspect is the amount of noise correlations. In addition, this result also indicates that, contrary to previously thought [47], decoding algorithms need not match the encoding mechanisms for performing optimally from an information-theoretical standpoint. All these results are a consequence of the fact that decoders operating with strategy

α

are not optimal, so it is possible to improve their performance by deterministic or stochastic manipulations of the response. In practice, our results open up the possibility of increasing the efficiency of decoders constructed with approximate descriptions of the neural responses, usually called approximate or mismatched decoders, by adding suitable amounts and types of noise to the decoder input.

3. Related Issues

3.1. Relation to Decomposition-Based Methods

Many measures of different types have been developed to assess how different response features of the neural code interact with each other. Some are based on direct comparisons between the information encoded by individual features, or collections of features (see for example [48,49,50], to cite just a few among many). Others distinguish between two or more potential dynamical models of brain activity [51], for example, by differentiating between conditional and unconditional correlations between neurons in the frequency domain [52]. Yet others, rely on decompositions or projections based on information geometry. In those, the mutual information between stimuli and responses

I_{R}

is broken down as

I_{R} = \sum_{i} I_{R_{i}}^{'} + Synergy Terms + Redundancy Terms

, where

I_{R_{i}}^{'}

represents the information contributed by the individual response feature

R_{i}

, and the remaining terms incorporate the synergy or redundancy between them. In the original approaches [53,54,55,56,57], the terms

I_{R_{i}}^{'}

represented the information

I (R_{i}; S)

encoded in single response aspects irrespective of what be encoded in other aspects. In later studies, [58,59,60,61,62], these terms accounted for the information that is only encoded in individual aspects, taking care of excluding whatever be redundant with other aspects. The approach discussed in this paper is in the line of the studies Nirenberg et al. [8] and Schneidman et al. [9] and all their consequences. This line has some similarities and some discrepancies with the decomposition-based studies. We here comment on some of these relations.

-: First, the measure $Δ I_{R_{su}}$ quantifies the relevance of a given feature with the difference $I_{R_{ex}} - I_{R_{su}}$ . When the surrogate response $R_{su}$ is equal to the original response $R_{ex}$ with just a single component $R_{i}$ eliminated, $Δ I_{R_{su}}$ is equal to $I (R_{i}; s | {\bar{R}}_{i})$ , where ${\bar{R}}_{i}$ is the collection of all response aspects except $R_{i}$ . In this case, $Δ I_{R_{su}}$ coincides with the sum of the unique and the synergistic contributions of the dual decompositions in the newest set of methods [63].
-: Second, when assessing the relevance of a given response feature, we are often inclined to draw conclusions about the cost of ignoring the tested feature when aiming to decode the original stimulus. As shown in this paper, those conclusions depend not only on how stimuli are encoded, but also, on how they are decoded. The decomposition-based methods are mainly focused in the encoding problem, so they are less suited to draw conclusions about decoding.
-: Finally, as discussed in Figure 8, not only the amount of (encoded or decoded) information matters, but also, what type. Decomposition-based methods, although not yet reaching a full consensus in their formulation, provide a valuable attempt to characterize how both the type and the amount of information is structured within the set of analyzed variables, in a way that is complementary to the present approach, specifically in analyzing the structure of the lattices obtained by associating different response features [58,63].

3.2. The Problem of Limited Sampling

Throughout the paper we assumed that the distribution

P_{ex} (s, r)

is known, or is accessible to the experimenter. In the examples, when we calculated information values, we plugged the true distributions into the formulas, without discussing the fact that such distribution may not be easily estimated with finite amounts of data. Whichever method is used to estimate

P_{ex} (s, r)

, to a larger or lesser degree, the outcome is no more than an approximation. Hence, even

I_{R_{ex}}

(which is supposed to be the full information) is estimated approximately. Since

P_{su} (s, r)

is a modified version of

P_{ex} (s, r)

, also

P_{su} (s, r)

can only be estimated approximately. Information measures, including Kullback-Leibler divergences, are highly sensitive to variations in the involved probabilities [20,32,64,65,66,67,68,69], and the latter are unavoidable in high-dimensional response spaces. The assessment of the relevance of a given feature, hence, requires experiments that contain sufficient samples so as to ensure that the correcting methods work. When the response space is large, the measures

Δ I_{S}

,

Δ I^{B}

and the loss of accuracies are less sensitive to limited sampling than

Δ I_{R_{su}}

,

Δ I^{D}

and

Δ I^{L D}

.

In addition, the problem of finite sampling can also be formulated as an attempt to determine the relevance of the feature “Accuracy in the estimation of

P_{ex} (r | s)

”. This feature is not a property of the nervous system, but rather, of our ability to characterise it. Still, the framework developed here can also handle this methodological problem. The estimated distribution can be interpreted as a stochastic modification

P_{su} (r | s)

of the true distribution

P_{ex} (r | s)

. As long as the caveats discussed in this paper are taken into account, the measures of Equations (21)–(29) may serve to evaluate the cost of modeling

P_{ex} (r | s)

out of finite amounts of data.

4. Conclusions

Several measures have been proposed in the literature to assess the relevance of specific response features in the neural code. All proposals are based on the idea that by removing the tested feature from the response, the neural code deteriorates, and the lost information is a useful measure of the relevance of the feature. In this paper, we demonstrated that the neural code may or may not deteriorate when removing a response feature, depending on the nature of the tested feature, and on the method of removal, in ways previously unseen. First, we determined the conditions under which the data processing inequality can be invoked. Second, we showed that decoding-oriented measures may result in larger or smaller losses than their encoding (or gray) counterparts, even for response aspects that, unlike noise correlations, can be modeled as stimulus-independent transformations of the full response. Third, we demonstrated that both types of measures coincide under the conditions of Theorem 5. Fourth, we showed that evaluating the role of a response feature in the neural code involves not only an assessment of its contribution to the amount of encoded information, but also, to the meaning of that information. Such meaning is as dependent as the amount on the measure employed to assess it. Finally, our results open up the possibility that simple and cheap decoding strategies, based on the addition of an adequate type and amount of noise, be more efficient and resilient than previously thought. We conclude that the assessment of the relevance of a specific response feature cannot be performed without a careful justification for the selection of a specific method of removal.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, writing, visualization: H.G.E. Formal analysis, resources, writing, editing: I.S.

Funding

This work was supported by the Ella and Georg Ehrnrooth Foundation, Consejo Nacional de Investigaciones Científicas y Técnicas of Argentina (06/C444), Universidad Nacional de Cuyo (PIP 0256), and Agencia Nacional de Promoción Científica y Tecnológica (grant PICT Raíces 2016 1004).

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Appendix A. On the Information and Accuracy Differences

Each value in Table 1 (except for those associated with Figure 3b; see below) was computed using the Nelder-Mead simplex algorithm for optimization, as implemented by the function fminsearch of Matlab 2016. For accuracy reasons, only examples in which

I_{R_{ex}} \geq 10^{- 6} bits

and

A_{R_{ex}}^{R_{ex}} \geq 10^{- 6}

were considered. Furthermore, parameters defining the joint stimulus-response probabilities and the transition matrices were restricted to the interval

[0.05, 0.95]

. Each difference between two measures defined in Equations (21)–(29) was computed repeatedly, with random initial values for the stimulus-response probabilities and the transition matrices, until the value of the difference failed to increase or decrease in 20 consecutive runs.

The values in Table 1 for Figure 3b were computed analytically with

P_{ex} ([3, 2]) > 0

or

P_{ex} ([4, 2]) > 0

, but not both. In those cases, the measures

Δ I_{}^{D}

,

Δ I_{}^{B}

, and

Δ A_{}^{B}

are undefined, whereas

Δ I_{}^{D L} = 100 %

, for the reasons given in Section 2.4. However,

Δ I_{R_{su}}

and

Δ I_{\hat{S}}

can vary between

0 %

and

100 %

, for example, attaining

0 %

when

P_{ex} ([3, 1]) \to 0

, and

100 %

when

P_{ex} ([2, 1]) \to 0

and

P_{ex} ([4, 2]) \to 0

. The information

I_{R_{ex}}

equals the stimulus entropy, regardless of the response probabilities. The values in Table 1 for Figure 3d were computed by setting

b = c = d = 0.5

in Equation (16). The values in Section 2.4 for Figure 7b were obtained by setting

P_{ex} (s, r) = 1 / 4

for the stimulus-response pairs shown in the figure, and are valid for any transition probability matrix set as in Equation (32) with

b = a

. The values in Section 2.4 for Figure 8 were obtained by setting

P_{ex} (s, r) = 1 / 4

for the stimulus-response pairs shown in the figure.

Appendix B. Proofs

Appendix B.1. Derivation of Equation (30)

The definition of

Δ I_{}^{D L}

involves the probability

P_{su} (s | \hat{r}, θ)

defined in [11,36,38] as proportional to

P (S) \prod_{i} P^{θ} (R_{i} | S)

, where the exponent

θ

is chosen so as to maximize

Δ I_{}^{D L}

. This definition has been recently shown to be invalid when

\exists r, s

such that

P_{su} (r | s) = 0

for a stimulus s or a response

r

for which

P_{ex} (r | s) \neq 0

[17]. This problem never appears when evaluating the relevance of noise correlations with

P_{su} (r | s) = P_{N I} (r | s)

as stated by Equation (17). Yet, it may well appear in more general cases, including those arising from stochastically reduced codes. To overcome it, we prove the theorem

Theorem A1.

The probability

P (s | r, θ)

that appears in the definition of

Δ I_{}^{D L}

is

P_{su} (s | r, θ) \propto \{\begin{matrix} P (s) & i f \exists \hat{s}, \hat{r} s u c h t h a t P_{ex} (\hat{r} | \hat{s}) > P_{su} (\hat{r} | \hat{s}) = 0 \\ 0 & i f P_{su} (r | s) = P_{ex} (r | s) = 0 f o r s o m e b u t n o t a l l s \\ P (s) P_{su} {(r | s)}^{θ} & o t h e r w i s e \end{matrix}

Proof.

According to Latham and Nirenberg [11], the probability

P_{su} (s | r, θ)

is the one that minimizes the Kullback-Leibler divergence

D_{KL} [P^{*} (r, s) | | p (r) p (s)]

with respect to the distribution

P^{*} (r, s)

, subject to the constraints

\begin{matrix} {⟨{log}_{2} P_{su} (r | s)⟩}_{P^{*} (r, s)} = {⟨{log}_{2} q (r | s)⟩}_{P (s, r)} \end{matrix}

(A1)

\begin{matrix} \sum_{s} P^{*} (r, s) = P (r) . \end{matrix}

(A2)

The minimization problem can be formulated in terms of an objective function to be minimized, in which the constraints appear with Lagrange multipliers, and

θ

is the one accompanying Equation (A1). Using the standard conventions that

0 log 0 = 0

and

x log 0 = \infty

for

x > 0

, Equation (A1) is fulfilled if

\exists \hat{r}, \hat{s}

such that

P (\hat{s} | \hat{r}, θ) > 0

if

P_{ex} (\hat{r}, \hat{s}) > P_{su} (\hat{r} | \hat{s}) = 0

. The first part of the theorem immediately follows by solving Equation (B15) in [11] as there indicated with

β = 0

. If

∄ \hat{r}, \hat{s}

such that

P_{ex} (\hat{r}, \hat{s}) > P_{su} (\hat{r} | \hat{s}) = 0

, then Equation (A1) is fulfilled only if

P (s, r | θ) = 0

when

P_{su} (r | s) = P_{ex} (r, s) = 0

. The second and third parts of the theorem immediately follows using Bayes’ rule. □

Appendix B.2. Proof of Theorem 1

Proof.

The second part is proved by the two examples in Figure 4. The first part was proved in [9], at least for cases in which the set of the surrogate responses

R_{su} = R_{N I}

differ from the set of the real responses

R_{ex}

. When they both coincide, we can prove the first part by contradiction, assuming that a deterministic mapping exists from

R_{ex}

into

R_{N I}

. If both variables sample the same response space, the deterministic mapping must be one-to-one, otherwise the variable

R_{N I}

would sample a smaller set. Therefore, both

R_{N I}

and

R_{ex}

maximize the conditional entropy given S over the probability distributions with the same marginals, since one-to-one mappings do not modify the entropy, and

R_{N I}

is defined as the distribution with maximal conditional entropy with fixed marginals. Because the probability distribution achieving this maximum is unique [16],

P_{su} (r | s)

and

P_{ex} (r | s)

must be the same, thereby proving the theorem. □

Appendix B.3. Proof of Theorem 2

Proof.

We prove the dependency on the marginal likelihoods by computing

Q (r_{su} | r_{ex})

for the hypothetical experiment of Figure 4a, and observing that the result depends on the marginal likelihood

P_{ex} (L | s)

. To that end, we rewrite Equation (10) for

R_{su} = [1, 2]

as

P_{su} ([1, 2] | □) = P_{ex} ([1, 2] | □) Q ([1, 2] | [1, 2]) + P_{ex} ([2, 1] | □) Q ([1, 2] | [2, 1]) .

Note that

P_{ex} ([1, 2] | □) = 1 - P_{ex} ([2, 1] | □) = P_{ex} (L = 1 | □)

and

P_{su} ([1, 2] | □) = P_{ex} {(L = 1 | □)}^{2}

. Using this and rearranging the terms, we obtain the quadratic equation

P_{ex} {(L = 1 | □)}^{2} + P_{ex} (L = 1 | □) \{Q ([1, 2] | [2, 1]) - Q ([1, 2] | [1, 2])\} - Q ([1, 2] | [2, 1]) = 0,

that is solved by

P_{ex} (L = 1 | □) = 0.5 [δ q + {(δ q^{2} + 4 Q ([1, 2] | [2, 1]))}^{0.5}],

where

δ q = Q ([1, 2] | [1, 2]) - Q ([1, 2] | [2, 1])

. Hence, any change in

P_{ex} (L = 1 | □)

must be followed by some change in

Q (r_{su} | r_{ex})

, thereby proving the first part.

We prove the dependency on the noise correlations by computing

Q (r_{su} | r_{ex})

for the hypothetical experiment of Figure 5, and observing that the result not only depends on the marginal likelihoods

P_{ex} (L | s)

and

P_{ex} (C | s)

, but in many cases, it also depends on the joint distributions

P_{ex} (L, C | s)

. Hence, varying the amount of noise correlations, even if keeping the marginals fixed, yields a variation in the mapping

Q (r_{su} | r_{ex})

.

We proceed by reductio ad absurdum. If

Q (r_{su} | r_{ex})

does not depend on the amount of noise correlations in

P_{ex} (r | s)

, we may assume that if we vary

P_{ex} (r | s)

but keep the marginals

P_{ex} (r_{i} | s)

fixed, the transition probabilities

Q (r_{su} | r_{ex})

remain unchanged. Under this hypothesis, Equation (10) is valid for many choices of

P_{ex} (r | s)

. In this context, consider the set of all response distributions with the same marginals as

P_{ex} (r | s)

that can be turned into

P_{su} (r | s)

through

Q (r_{su} | r_{ex})

. This set includes

P_{su} (r | s)

, and therefore,

Q (r_{su} | r_{ex})

should be able to transform

P_{su} (r | s)

into itself. In addition, Property 2 requires that

Q (r_{su} | [2, 2]) = 0

when

r_{su} \neq [2, 2]

because either

P (r_{su} | □) = 0

or

P (r_{su} | ◯) = 0

for those responses. Normalization yields

Q ([2, 2] | [2, 2]) = 1

. Furthermore, computing Equation (10) for

R_{su} = [2, 2]

yields

0 = P_{ex} ([1, 1] | □) Q ([2, 2] | [1, 1]) + P_{ex} ([1, 2] | □) Q ([2, 2] | [1, 2]) + P_{ex} ([2, 1] | □) Q ([2, 2] | [2, 1]),

which shows that

Q ([2, 2] | r_{ex}) = 0

when

r_{ex} \in {[1, 1], [1, 2], [2, 1]}

. Consequently, the resulting

Q (r_{su} | r_{ex})

yields through Equation (10) that

P_{su} ([2, 2] | □) = P_{ex} ([2, 2] | □)

. After noticing that

P_{su} ([2, 2] | □) = P_{ex} (L = 2 | □) P_{ex} (C = 2 | □),

and that

P_{ex} ([2, 2] | □) = P_{ex} (L = 2 | □) - P_{ex} ([1, 2] | □) = P_{ex} (C = 2 | □) - P_{ex} ([2, 1] | □),

we can show that, after some straightforward algebra, Equation (10) only holds if

P_{su} (r | □) = P_{ex} (r | □)

for all

r

. Thus, the initial hypothesis yields a transition matrix

Q (r_{su} | r_{ex})

that is unable to transform

R_{ex}

into

R_{su}

when

R_{ex}

is noise correlated, and thus

Q (r_{su} | r_{ex})

necessarily depends on the amount of noise correlations in

R_{ex}

. □

Appendix B.4. Proof of Theorem 3

Proof.

The condition

Δ I^{D} = Δ I_{R_{su}}

implies that

\sum_{s r} P_{ex} (s, r) log [\frac{P_{ex} (s | r)}{P_{su} (s | r)}] = I_{R_{ex}} - I_{R_{su}} .

(A3)

However,

Δ I^{D} = I_{R_{ex}} - \sum_{s r} P_{ex} (s, r) log [\frac{P_{su} (r | s)}{P_{su} (r)}] .

Hence, Equation (A3) becomes

- \sum_{s r} P_{ex} (s, r) log [\frac{P_{su} (r | s)}{P_{su} (r)}] = - I_{R_{su}}

(A4)

In addition, when evaluating the relevance of noise correlations,

P_{su} (r, s) = P (s) P_{N I} (r | s)

as established by Equation (17). Hence,

\begin{matrix} - \sum_{s r} P_{ex} (s, r) log [\frac{P_{su} (r | s)}{P_{su} (r)}] = \sum_{j} H (R_{j} | s) + \sum_{s r} P_{ex} (r, s) log [P_{su} (r)] \end{matrix}

(A5)

\begin{matrix} - I_{R_{su}} = \sum_{j} H (R_{j} | s) + \sum_{s r} P_{su} (s, r) log [P_{su} (r)] . \end{matrix}

(A6)

Replacing Equations (A5) and (A6) in Equation (A4),

\sum_{s r} P_{ex} (s, r) log [P_{su} (r)] = \sum_{s r} P_{su} (s, r) log [P_{su} (r)] .

Summing in s, and rearranging,

\sum_{r} [P_{ex} (r) - P_{su} (r)] log [P_{su} (r)] = 0 .

If instead of an equality, we start with an inequality, that same inequality can be kept all through the proof. □

Appendix B.5. Proof of Theorem 4

Proof.

Consider a neural code

R_{ex} = [R_{1}, \dots, R_{N}]

and recall that the range of

R_{N I}

includes that of

R_{ex}

. Therefore,

Δ I_{}^{D L} = I_{R_{ex}}

implies that the minimum in Eqution (26) is attained when

θ = 0

. In that case, Equation (B13a) in [11] yields

\sum_{s, r_{n}} P (s, r_{n}) {log}_{2} P (r_{n} | s) = \sum_{s, r_{n}} P (s) P_{ex} (r_{n}) {log}_{2} P_{ex} (r_{n} | s), \forall 1 \leq n \leq N, \forall n .

After some more algebra and recalling that the Kullback-Leibler divergence is never negative, this equation becomes

I_{R_{n}} = 0

, implying that when read isolatedly, single responses contain no information about the stimulus. Consequently

Δ I_{R_{N I}} = I_{R_{ex}}

, thereby proving the “only if” part. For the “if” part, it is sufficient to notice that the last equality implies that

P_{N I} (r | s) = P_{N I} (r)

. □

Appendix B.6. Proof of Theorem 5

Proof.

The conditions on f and

Q (r_{su} | r_{ex})

ensure that

Q (r_{su} | r_{ex})

can be written as a block-diagonal matrix, each block composed of the same rows with no zeros, and that each block can be associated with a non-overlapping partition

R_{1}, \dots, R_{M}

of the range of f. Under these conditions,

P (r_{su} | r_{ex}) = P (r_{su} | R_{m})

when

r_{ex} \in R_{m}

. Hence, for

r_{su} \in R_{m}

,

P (r_{su} | s) = P (r_{su} | R_{m}) P (R_{m} | s)

, yielding

P (s | r_{su}) = P (s | R_{m})

and

P (s | r_{su}, θ) = P (s | R_{m}, θ)

. Recomputing Equations (21)–(29) with these equalities in mind immediately yields the equalities in the theorem.

Even when the amount of information is equal, differences in the type of information may arise because the measures are based on different decoding strategies, here denoted

α

and

β

. However, under the conditions of the theorem, decoding strategy

α

and decoding strategy

β

are one and the same. Because

P (s | r_{su}) = P (s | R_{m})

, both decoding strategies choose s only based on the partition

R

of

r_{ex}

or

r_{su}

, respectively. Mathematically, both choose s according to

\hat{s} = arg max_{s} P (s | R (r)),

where

R (r)

denotes the mapping from

r

into

R

, which is the same regardless of whether

r

is

r_{ex}

or

r_{su}

. Because

Q (r_{su} | r_{ex})

maps each partition onto itself, the responses within each partition of

r_{su}

is completely generated by the responses in each partition of

r_{ex}

, and thus the decoding strategies are applied to the same set of

r_{ex}

. Hence, both decoding strategies are defined and operate in the same manner, yielding the same information. □

Appendix C. On the Computation of ΔI^D

The information loss caused by mismatched decoders (decoding strategy

α

) when

R_{su} = f (R_{ex})

has previously been computed as

Δ I_{}^{D}

but with

P_{su} (s | r)

replaced by

P (s | R_{su} = f (r)) = P_{su} (s | f (r))

[10,12,13]. The latter represents the probability of s given that

R_{su}

takes the value

f (r)

, thereby limiting f to deterministic mappings. However, the probabilities

P_{su} (s | r)

and

P_{su} (s | f (r))

are not equivalent, since

\begin{matrix} P_{su} (s | r) & \propto \sum_{r = f (\hat{r})} P_{ex} (\hat{r}, s) \\ P_{su} (s | f (r)) & \propto \sum_{f (r) = f (\hat{r})} P_{ex} (\hat{r}, s) \end{matrix}

These two definitions raise the question of which alternative is the appropriate one when computing the information loss caused by mismatched decoders.

To resolve this question, notice that replacing

P_{su} (s | r)

with

P_{su} (s | f (r))

in Equation (6) yields the decoding algorithm

\hat{s} = arg max_{s} P_{su} (s | f (r)) .

This algorithm entails first transforming the observed

r

into

r_{su} = f (r)

, and then choosing the stimulus

\hat{s} = D_{su} (\hat{r})

with a matched probability. Hence, its operation is analogous to the decoding algorithm

β

, and not, as originally intended, to the decoding algorithm

α

.

To illustrate the difference, recall the experiment in Figure 7a and suppose that the observed response is

r = 0.25 π

. The decoding algorithm

α

reads this value, computes

P_{su} (s | 0.25 π)

, and decodes

\hat{s} = □

. Instead, the decoding algorithm proposed in [10,12,13], first transforms the value of

r = 0.25 π

into

f (r) = 0.75 π

, then computes

P_{su} (s | 0.75 π)

, and finally decodes

\hat{s} = ◯

. This mode of operation corresponds to the decoding algorithm

β

.

The above discrepancy can also be seen from the change in the operational meaning of

Δ I_{}^{D}

caused by the replacement. To that end, recall that

Δ I_{}^{D}

was first introduced as a comparison between the average number of binary questions required to identify s after observing

r

when using two optimal question-asking strategies, one tailored for

P_{ex} (s | r)

and the other for

P_{su} (s | r)

[8]. Mathematically, this difference can be written as

Δ I_{}^{D} = \sum_{s, r} P_{ex} (s, r) {log}_{2} P_{ex} (s | r) - \sum_{s, r} P_{ex} (s, r) {log}_{2} P_{su} (s | r) .

(A7)

In each term, the argument of the logarithms is determined by the question-asking strategy, whereas the weight of the averages is determined by the probability distribution of the variables on which the strategy is applied [8,10,16]. Equation (A7) describes the decoding strategy

α

.

Replacing

P_{su} (s | r)

with

P_{su} (s | f (r))

turns Equation (A7) into

\begin{matrix} Δ \tilde{I} & = \sum_{s, r} P_{ex} (s, r) {log}_{2} P_{ex} (s | r) - \sum_{s, r} P_{ex} (s, r) {log}_{2} P (s | f (r)) \\ = \sum_{s, r} P_{ex} (s, r) {log}_{2} P_{ex} (s | r) - \sum_{s, r_{su}} P_{su} (s, r_{su}) {log}_{2} P_{su} (s | r_{su}) \\ = Δ I_{R_{su}} . \end{matrix}

Unlike

Δ I_{}^{D}

, this difference compares the average number of binary questions required to identify s after observing

r

using a question-asking strategy that is optimal for

P_{ex} (s | r)

, with the average number of binary questions required to identify s after observing

r_{su}

using the a question-asking strategy that is optimal for

P_{su} (s | r_{su})

. This is the way the decoding strategy

β

operates, not

α

.

Naively, one may think that a change in

P_{su} (s | r)

, regardless of its size, may turn the measure

Δ I_{R_{su}}

, typically regarded as an encoding-oriented measure and here linked to the decoding algorithm

β

, into the decoding-oriented measure

Δ I_{}^{D}

. However, notice that this change cannot occur through the equations above due to the change induced in

P_{su} (s, r_{su})

. For that to actually occur, one must write

Δ I_{R_{su}}

differently, as for example:

Δ I_{R_{su}} = \sum_{s, r} P_{ex} (s, r) {log}_{2} P_{ex} (s | r) - \sum_{s, r_{su}} P_{ex} (s, r) {log}_{2} P_{su} (s | r_{su}) .

In this reformulation, the second term can be interpreted as the average number of binary questions required to identify s after observing

r

using a question-asking strategy that is optimal for

P_{su} (s | r_{su})

, but only after converting

r

into

r_{su}

. Any change in

P_{su} (s | r_{su})

immediately renders

P_{su} (s | r_{su})

a mismatched probability for

r_{su}

, and makes the second term represent the average number of binary questions required to identify s after observing

r

using the question-asking strategy that is optimal for an altered version of

P_{su} (s | r_{su})

but only after converting

r

into

r_{su}

, which need not resemble the meaning of the second term in

Δ I_{}^{D}

.

References

Adrian, E.D. The impulses produced by sensory nerve endings. J. Physiol. 1926, 61, 49–72. [Google Scholar] [CrossRef] [PubMed]
Hubel, D.H.; Wiesel, T.N. Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 1959, 148, 173–180. [Google Scholar] [CrossRef]
Thorpe, S.; Fize, D.; Marlot, C. Speed of processing in the human visual system. Nature 1996, 6582, 520–522. [Google Scholar] [CrossRef] [PubMed]
Abeles, M. Corticonix: Neural Circuits of the Cerebral Cortex; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Gray, C.M.; König, P.; Engel, A.K.; Singer, W. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature 1989, 6213, 334–337. [Google Scholar] [CrossRef] [PubMed]
Franke, F.; Fiscella, M.; Sevelev, M.; Roska, B.; Hierlemann, A.; da Silveira, R.A. Structures of Neural Correlation and How They Favor Coding. Neuron 2016, 89, 409–422. [Google Scholar] [CrossRef] [PubMed]
O’Keefe, J. Hippocampues, theta, and spatial memory. Curr. Opin. Neurobiol. 1993, 6, 917–924. [Google Scholar] [CrossRef]
Nirenberg, S.; Carcieri, S.M.; Jacobs, A.L.; Latham, P.E. Retinal ganglion cells act largely as independent encoders. Nature 2001, 411, 698–701. [Google Scholar] [CrossRef] [PubMed]
Schneidman, E.; Bialek, W.; Berry, M.J. Synergy, redundancy, and independence in population codes. J. Neurosci. 2003, 23, 11539–11553. [Google Scholar] [CrossRef] [PubMed]
Nirenberg, S.; Latham, P.E. Decoding neuronal spike trains: How important are correlations? Proc. Natl. Acad. Sci. USA 2003, 100, 7348–7353. [Google Scholar] [CrossRef] [PubMed]
Latham, P.E.; Nirenberg, S. Synergy, redundancy, and independence in population codes, revisited. J. Neurosci. 2005, 25, 5195–5206. [Google Scholar] [CrossRef] [PubMed]
Quiroga, R.Q.; Panzeri, S. Extracting information from neuronal populations: Information theory and decoding approaches. Nat. Rev. Neurosci. 2009, 10, 173–185. [Google Scholar] [CrossRef] [PubMed]
Latham, P.E.; Roudi, Y. Role of correlations in population coding. In Principles of Neural Coding; Panzeri, S., Quian Quiroga, R., Eds.; CRC Press: Boca Raton, FL, USA, 2013; Chapter 7; pp. 121–138. [Google Scholar]
Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury Press: Duxbury, MA, USA, 2002. [Google Scholar]
Panzeri, S.; Brunel, N.; Logothetis, N.K.; Kayser, C. Sensory neural codes using multiplexed temporal scales. Trends Neurosci. 2010, 33, 111–120. [Google Scholar] [CrossRef] [PubMed]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
Eyherabide, H.G.; Samengo, I. When and why noise correlations are important in neural decoding. J. Neurosci. 2013, 33, 17921–17936. [Google Scholar] [CrossRef] [PubMed]
Knill, D.C.; Pouget, A. The Bayesian brain: The role of uncertainty in neural coding and computation. Trends Neurosci. 2004, 27, 712–719. [Google Scholar] [CrossRef] [PubMed]
Van Bergen, R.S.; Ma, W.J.; Pratte, M.S.; Jehee, J.F.M. Sensory uncertainty decoded from visual cortex predicts behavior. Nat. Neurosci. 2015, 18, 1728–1730. [Google Scholar] [CrossRef] [PubMed]
Ince, R.A.A.; Senatore, R.; Arabzadeh, E.; Montani, F.; Diamond, M.E.; Panzeri, S. Information-theoretic methods for studying population codes. Neural Netw. 2010, 23, 713–727. [Google Scholar] [CrossRef] [PubMed]
Reinagel, P.; Reid, R.C. Temporal coding of visual information in the thalamus. J. Neurosci. 2000, 20, 5392–5400. [Google Scholar] [CrossRef] [PubMed]
Panzeri, S.; Petersen, R.S.; Schultz, S.R.; Lebedev, M.; Diamond, M.E. The Role of Spike Timing in the Coding of Stimulus Location in Rat Somatosensory Cortex. Neuron 2001, 29, 769–777. [Google Scholar] [CrossRef]
Rokem, A.; Watzl, S.; Gollisch, T.; Stemmler, M.; Herz, A.V.M.; Samengo, I. Spike-timing precision underlies the coding efficiency of auditory receptor neurons. J. Neurophysiol. 2006, 95, 2541–2552. [Google Scholar] [CrossRef] [PubMed]
Lefebvre, J.L.; Zhang, Y.; Meister, M.; Wang, X.; Sanes, J.R. γ-Protocadherins regulate neuronal survival but are dispensable for circuit formation in retina. Development 2008, 135, 4141–4151. [Google Scholar] [CrossRef] [PubMed]
Victor, J.D.; Purpura, K.P. Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol. 1996, 76, 1310–1326. [Google Scholar] [CrossRef] [PubMed]
Victor, J.D. Spike train metrics. Curr. Opin. Neurobiol. 2005, 15, 585–592. [Google Scholar] [CrossRef] [PubMed]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Fano, R.M. Transmission of Information; The MIT Press: Cambridge, MA, USA, 1961. [Google Scholar]
DeWeese, M.R.; Meister, M. How to measure the information gained from one symbol. Netw. Comput. Neural Syst. 1999, 10, 325–340. [Google Scholar] [CrossRef]
Eyherabide, H.G.; Samengo, I. Time and category information in pattern-based codes. Front. Comput. Neurosci. 2010, 4, 145. [Google Scholar] [CrossRef] [PubMed]
Eckhorn, R.; Pöpel, B. Rigorous and extended application of information theory to the afferent visual system of the cat. I. Basic concepts. Kybernetik 1974, 16, 191–200. [Google Scholar] [CrossRef] [PubMed]
Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Network 1996, 7, 87–107. [Google Scholar] [CrossRef] [PubMed]
Eyherabide, H.G. Disambiguating the role of noise correlations when decoding neural populations together. arXiv, 2016; arXiv:1608.05501. [Google Scholar]
MacKay, D.M.; McCulloch, W.S. The limiting information capacity of a neuronal link. Bull. Math. Biophys. 1952, 14, 127–135. [Google Scholar] [CrossRef]
Fitzhugh, R. The statistical detection of threshold signals in the retina. J. Gen. Physiol. 1957, 40, 925–948. [Google Scholar] [CrossRef] [PubMed]
Merhav, N.; Kaplan, G.; Lapidoth, A.; Shamai Shitz, S. On information rates for mismatched decoders. IEEE Trans. Inf. Theory 1994, 40, 1953–1967. [Google Scholar] [CrossRef]
Oizumi, M.; Ishii, T.; Ishibashi, K.; Hosoya, T.; Okada, M. A general framework for investigating how far the decoding process in the brain can be simplified. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2009; pp. 1225–1232. [Google Scholar]
Oizumi, M.; Ishii, T.; Ishibashi, K.; Hosoya, T.; Okada, M. Mismatched decoding in the brain. J. Neurosci. 2010, 30, 4815–4826. [Google Scholar] [CrossRef] [PubMed]
Oizumi, M.; Amari, S.I.; Yanagawa, T.; Fujii, N.; Tsuchiya, N. Measuring Integrated Information from the Decoding Perspective. PLoS Comput. Biol. 2016, 12, e1004654. [Google Scholar] [CrossRef] [PubMed]
Gochin, P.M.; Colombo, M.; Dorfman, G.A.; Gerstein, G.L.; Gross, C.G. Neural ensemble coding in inferior temporal cortex. J. Neurophysiol. 1994, 71, 2325–2337. [Google Scholar] [CrossRef] [PubMed]
Warland, D.K.; Reinagel, P.; Meister, M. Decoding visual information from a population of retinal ganglion cells. J. Neurophysiol. 1997, 78, 2336–2350. [Google Scholar] [CrossRef] [PubMed]
Optican, L.M.; Richmond, B.J. Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. J. Neurophysiol. 1987, 57, 162–178. [Google Scholar] [CrossRef] [PubMed]
Salinas, E.; Abbott, L.F. Transfer of coded information from sensory neurons to motor networks. J. Neurosci. 1995, 10, 6461–6476. [Google Scholar] [CrossRef]
Geisler, W.S. Sequential ideal-observer analysis of visual discriminations. Psychol. Rev. 1989, 96, 267–314. [Google Scholar] [CrossRef] [PubMed]
Högnäs, G.; Mukherjea, A. Probability Measures on Semigroups: Convolution Products, Random Walks and Random Matrices, 2nd ed.; Springer: New York, NY, USA, 2011. [Google Scholar]
Samengo, I.; Treves, A. The information loss in an optimal maximum likelihood decoding. Neural Comput. 2002, 14, 771–779. [Google Scholar] [CrossRef] [PubMed]
Shamir, M. Emerging principles of population coding: In search for the neural code. Curr. Opin. Neurobiol. 2014, 25, 140–148. [Google Scholar] [CrossRef] [PubMed]
Gawne, T.J.; Richmond, B.J. How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci. 1993, 13, 2758–2771. [Google Scholar] [CrossRef] [PubMed]
Gollisch, T.; Meister, M. Rapid Neural Coding in the Retina with Relative Spike Latencies. Science 2008, 5866, 1108–1111. [Google Scholar] [CrossRef] [PubMed]
Reifenstein, E.T.; Kemptner, R.; Schreiber, S.; Stemmler, M.B.; Herz, A.V.M. Grid cells in rat entorhinal cortex encode physical space with independent firing fields and phase precession at the single-trial level. Proc. Natl. Acad. Sci. USA 2012, 109, 6301–6306. [Google Scholar] [CrossRef] [PubMed]
Park, H.J.; Friston, K. Nonlinear multivariate analysis of neurophysiological signals. Science 2013, 6158, 1238411. [Google Scholar] [CrossRef] [PubMed]
Dahlhaus, R.; Eichler, M.; Sandkühler, J. Identification of synaptic connections in neural ensembles by graphical models. J. Neurosci. Methods 1997, 77, 93–107. [Google Scholar] [CrossRef]
Panzeri, S.; Schultz, S.R.; Treves, A.; Rolls, E.T. Correlations and the encoding of information in the nervous system. Proc. R. Soc. B Biol. Sci. 1999, 266, 1001–1012. [Google Scholar] [CrossRef] [PubMed]
Schultz, S.R.; Panzeri, S. Temporal Correlations and Neural Spike Train Entropy. Phys. Rev. Lett. 2001, 25, 5823–5826. [Google Scholar] [CrossRef] [PubMed]
Panzeri, S.; Schultz, S.R. A Unified Approach to the Study of Temporal, Correlational, and Rate Coding. Neural Comput. 2001, 13, 1311–1349. [Google Scholar] [CrossRef] [PubMed]
Pola, G.; Thiele, A.; Hoffmann, K.P.; Panzeri, S. An exact method to quantify the information transmitted by different mechanisms of correlational coding. Network 2003, 14, 35–60. [Google Scholar] [CrossRef] [PubMed]
Hernández, D.G.; Zanette, D.H.; Samengo, I. Information-theoretical analysis of the statistical dependencies between three variables: Applications to written language. Phys. Rev. E. 2015, 92, 022813. [Google Scholar] [CrossRef] [PubMed]
Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv, 2010; arXiv:1004.2515. [Google Scholar]
Harder, M.; Salge, C.; Polani, D. Bivariate Measure of Redundant Information. Phys. Rev. E. 2013, 87, 012130. [Google Scholar] [CrossRef] [PubMed]
Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: New York, NY, USA, 2014; Chapter 6; pp. 159–190. [Google Scholar]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
Ince, R.A.A. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy 2017, 19, 318. [Google Scholar] [CrossRef]
Chicharro, D.; Panzeri, S. Synergy and Redundancy in Dual Decompositions of Mutual Information Gain and Information Loss. Entropy 2017, 19, 71. [Google Scholar] [CrossRef]
Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E. 1996, 52, 6841–6973. [Google Scholar] [CrossRef]
Samengo, I. Estimating probabilities from experimental frequencies. Phys. Rev. E 2002, 65, 046124. [Google Scholar] [CrossRef] [PubMed]
Nemenman, I.; Bialek, W.; de Ruyter van Steveninck, R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E. 2004, 69, 056111. [Google Scholar] [CrossRef] [PubMed]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 6, 1191–1253. [Google Scholar] [CrossRef]
Panzeri, S.; Senatore, R.; Montemurro, M.A.; Petersen, R.S. Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol. 2007, 98, 1064–1072. [Google Scholar] [CrossRef] [PubMed]
Montemurro, M.A.; Senatore, R.; Panzeri, S. Tight data-robust bounds to mutual information combining shuffling and model selection techniques. Neural Comput. 2007, 11, 2913–2957. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Assessing the relevance of response accuracy by varying the duration of the temporal bin. (a) Hypothetical intracellular recording of the spike patterns elicited by a single neuron after presenting in alternation two visual stimuli, □ and ◯, each of which triggers two possible responses displayed in columns 1 and 3 for □, and 2 and 4 for ◯. Stimulus probabilities and conditional response probabilities are arbitrary. Time is discretized in bins of 5 ms. The responses are recorded within 30 ms time-windows after stimulus onset. Spikes are fired with latencies that are uniformly distributed between 0 and 10 ms after the onset of □, and between 20 and 30 ms after the onset of ◯. Responses are represented by counting the number of spikes within consecutive time-bins of size 5, 10 and 15 ms starting from stimulus onset, thereby yielding discrete-time sequences

R_{ex}

,

R_{su}^{1}

and

R_{su}^{2}

, respectively; (b) Same as a, but with stimuli producing two different types of response patterns composed of 2 or 3 spikes.

Figure 2. Examples of stochastic codes. Alternative ways of assessing the relevance of spike-timing precision. (a) Stochastic function (arrows on the left) modeling the encoding process. The elicited response

r_{ex}

is turned into a surrogate response

r_{su}

with a transition probability

Q (r_{su} | r_{ex})

given by Equation (11). This function turns

R_{ex}

into a stochastic representation

R_{su}

by shuffling spikes and silences within bins of

15 m s

starting from stimulus onset; (b) Responses

r_{ex}

in panel (a) are transformed by a stochastic function with

Q (r_{su} | r_{ex})

given by Equation (12), which introduces jitter uniformly distributed within

15 m s

windows centered at each spike; (c) Responses

r_{ex}

in panel (a) are transformed by a stochastic function with

Q (r_{su} | r_{ex})

given by Equation (13), which models the inability to distinguish responses with spikes occurring in adjacent bins, or equivalently, with distances

d^{spike} [q = 1] \leq 1

or

d^{interval} [q = 1] \leq 1

(see [25,26] for further remarks on these distances). Notice that

R_{su}

samples the same response set as

R_{ex}

.

Figure 3. Stochastically reduced representations include and generalize deterministically reduced representations. (a) Analogous description to Figure 1a, but with responses characterized using a representation

R_{ex} = [L, C]

based on the first-spike latency (L) and the spike-count (C); (b) Deterministic transformation (arrows) of

R_{ex}

in panel a into a reduced code

R_{su} = [\hat{L}, 1]

, which ignores the additional information carried in C by considering it constant and equal to unity. This reduced code can also be reinterpreted as a stochastic code with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (14); (c) The additional information carried in C is here ignored by shuffling the values of C across all trails with the same L, thereby turning

R_{ex}

in panel a into a stochastic code

R_{su} = [\hat{L}, \hat{C}]

with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (15); (d) The additional information carried in C is here ignored by replacing the actual value of C for one chosen with some possibly L-dependent probability distribution (Equation (16)).

Figure 4. Relation between probabilistic removal and stochastic codes. (a) Cartesian coordinates depicting: on the left, responses

R_{ex}

of a neuron for which L and C are positively correlated when elicited by ◯, and negatively correlated when elicited by □; in the middle, the surrogate responses

R_{su} = R_{N I}

that would occur should L and C be noise independent (middle); and on the right, a stimulus-independent stochastic function that turns

R_{ex}

into

R_{su}

with

Q (r_{su} | r_{ex})

given by Equation (19); (b) Same description as in (a), but with L and C noise independent given □, and with the stochastic function depicted on the right turning

R_{ex}

into

R_{N I}

given ◯ but not □.

Figure 5. Stochastically reduced representations that ignore noise correlations may depend on them.(a) Cartesian coordinates representing a hypothetical experiment in which two different stimuli, □ and ◯, elicit single neuron responses (

R_{su} = R_{N I}

) that are completely characterized by their first-spike latency (L) and spike counts (C). Both L and C are noise independent; (b) Cartesian coordinates representing a hypothetical experiment with the same marginal probabilities

P_{ex} (l | s)

and

P_{ex} (c | s)

as in panel (a), with one among many possible types of noise correlations between L and C; (c) Stimulus-independent stochastic function transforming the noise-correlated responses

R_{ex}

of panel (b) into the noise-independent responses

R_{su} = R_{N I}

of panel (a). The transition probabilities

Q (r_{su} | r_{ex})

are given in Equation (20), and they bear an explicit dependence on the amount of noise correlations.

Figure 6. Relations between the measures defined in Equations (21)–(29). The four measures on the left are either encoding-oriented (

Δ I_{R_{su}}

, on a pink background), or half-way between encoding- and decoding-oriented (the last three, gray background). The five measures on the right are all decoding-oriented (light-blue background). Each measure on the left has a conceptually related measure on the right on the same line, except for

Δ I_{R_{su}}

, which has two associated decoding-oriented measures:

Δ I^{D}

and

Δ I^{L D}

. The distinction between the measures on pink and on gray background relies on the fact that

Δ I_{R_{su}}

does not involve a decoding process. Instead,

Δ I_{\hat{S}}, Δ I_{\hat{S}}

and

Δ A_{R_{su}}

decode a stimulus (or rank the stimuli) with decoding method

β

. This decoding is not meant to be applicable to real experiments, since (as opposed to the truly decoding-oriented measures on the right, that operate with method

α

) the decoding is applied to the surrogate responses

R_{su}

, not the real ones

R_{ex}

.

Figure 7. Stochastic codes may play different roles in encoding and decoding. (a) Hypothetical experiment with two stimuli □ and ◯, which are transformed (solid and dashed lines) into neural responses containing a single spike (

C = 1

) fired at different phases (

Φ

) with respect to a cycle of 20 ms period starting at stimulus onset. The phases have been discretized in intervals of size

π / 2

and wrapped to the interval

[0, 2 π)

. The encoding process is followed by a circular phase-shift that transforms

R_{ex} = Φ

into another code

R_{su} = \hat{Φ}

with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (31). The set of all

R_{su}

coincides with the set of all

R_{ex}

; (b) Same as (a), except that stimuli are four (

, Ⓐ,

, and Ⓑ), and phases are measured with respect to a cycle of

30 m s

period and discretized in intervals of size

π / 3

. The encoding process is followed by a stochastic transformation (lines on the right) that introduces jitter, thereby transforming

R_{ex} = Φ

into another code

R_{su} = \hat{Φ}

with transition probabilities

Q (r_{su} | r_{ex})

defined by Equation (32).

Figure 8. Assessing the amount and type of information encoded by. (a) Noisy data

R_{su} = [L, C]

recorded in response of the compound stimulus

S = [S_{F}, S_{L}]

; (b) Quality data

R_{ex} = [L, C]

recorded in the case of panel (a), but without noise; (c) Stimulus-independent stochastic transformation with transition probabilities

Q (r_{su} | r_{ex})

given by Equation (34), that introduces independent noise both in the latencies and in the spike counts, thereby transforming

R_{ex}

into

R_{su}

and rendering

R_{su}

a stochastic code; (d) Degraded data

\overset{˘}{R}

obtained by adding latency noise to the quality data; (e) Representation of the stimulus-independent stochastic transformation

R_{ex} \to \overset{˘}{R}

with transition probabilities

Q (\overset{˘}{r} | r_{ex})

given by Equation (35) that adds latency noise in panel (d).

Table 1. Numerical exploration of the maximum and minimum differences between several measures of information and accuracy losses. The values are expressed as percentages of

I_{R_{ex}}

(the information encoded in

R_{ex}

) or

A_{R_{ex}}^{R_{ex}}

(the maximum accuracy above chance level when decoders operate on

R_{ex}

). All examples involve two stimuli, so

Δ I_{\hat{S}} = Δ I_{\hat{S}}

and

Δ I_{}^{L S} = Δ I_{}^{B}

. The absolute value of

Δ A_{R_{su}} - Δ A_{}^{B}

can become extremely large when

A_{R_{su}}^{R_{su}} \approx 0

. Dashes represent cases in which decoding-oriented measures are undefined, as explained in Section 2.4.

Table 1. Numerical exploration of the maximum and minimum differences between several measures of information and accuracy losses. The values are expressed as percentages of

I_{R_{ex}}

(the information encoded in

R_{ex}

) or

A_{R_{ex}}^{R_{ex}}

(the maximum accuracy above chance level when decoders operate on

R_{ex}

). All examples involve two stimuli, so

Δ I_{\hat{S}} = Δ I_{\hat{S}}

and

Δ I_{}^{L S} = Δ I_{}^{B}

. The absolute value of

Δ A_{R_{su}} - Δ A_{}^{B}

can become extremely large when

A_{R_{su}}^{R_{su}} \approx 0

. Dashes represent cases in which decoding-oriented measures are undefined, as explained in Section 2.4.

Cases		Figure 4a	Figure 2b	Figure 2c	Figure 3d	Figure 2a	Figure 3b	Figure 7a
$Δ I_{\hat{R}} - Δ I_{}^{D}$	min	−79	−51	−34	0	0	—	≤999
$Δ I_{\hat{R}} - Δ I_{}^{D}$	max	26	32	51	0	0	—	−20
$Δ I_{\hat{R}} - Δ I_{}^{D L}$	min	−34	−32	−16	0	0	−100	−100
$Δ I_{\hat{R}} - Δ I_{}^{D L}$	max	59	41	98	0	0	0	0
$Δ I_{\hat{R}} - Δ I_{}^{B}$	min	−67	−62	−46	−63	−87	—	−100
$Δ I_{\hat{R}} - Δ I_{}^{B}$	max	57	81	96	0	0	—	0
$Δ I_{\hat{S}} - Δ I_{}^{D}$	min	−79	−48	−34	0	0	—	≤999
$Δ I_{\hat{S}} - Δ I_{}^{D}$	max	67	92	93	63	87	—	70
$Δ I_{\hat{S}} - Δ I_{}^{D L}$	min	−34	−27	−16	0	0	−100	−100
$Δ I_{\hat{S}} - Δ I_{}^{D L}$	max	91	92	99	63	87	0	97
$Δ I_{\hat{S}} - Δ I_{}^{B}$	min	−51	−31	−17	0	0	—	−100
$Δ I_{\hat{S}} - Δ I_{}^{B}$	max	59	91	98	0	0	—	100
$Δ A_{\hat{R}} - Δ A_{}^{B}$	min	−386	−200	−150	0	0	—	≤999
$Δ A_{\hat{R}} - Δ A_{}^{B}$	max	95	67	100	0	0	—	0

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Assessing the Relevance of Specific Response Features in the Neural Code

Abstract

1. Introduction

2. Results

2.1. Definitions

2.1.1. Statistical Notation

2.1.2. Encoding

2.1.3. Data Processing Inequalities

2.1.4. Decoding

2.1.5. Optimal Decoding

2.1.6. Extensions of Optimal Decoding

2.1.7. Approximations to Optimal Decoding

2.1.8. Two Different Decoding Strategies

2.2. The Applicability of the Data-Processing Inequality

2.2.1. Reduced Representations

2.2.2. Stochastically Reduced Representations

2.2.3. Modification of the Conditional Response Probability Distribution

2.3. Multiple Measures to Assess the Relevance of a Specific Response Feature

2.4. Relating the Values Obtained with Different Measures

2.5. Relation between Measures Based on Decoding Strategies α and β

2.6. Assessing the Type of Information Encoded by Individual Response Features

2.7. Conditions for Equality of the Amount and Type of Information Loss Reported by Different Measures

2.8. Improving the Performance of Decoders Operating with Strategy α

3. Related Issues

3.1. Relation to Decomposition-Based Methods

3.2. The Problem of Limited Sampling

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. On the Information and Accuracy Differences

Appendix B. Proofs

Appendix B.1. Derivation of Equation (30)

Appendix B.2. Proof of Theorem 1

Appendix B.3. Proof of Theorem 2

Appendix B.4. Proof of Theorem 3

Appendix B.5. Proof of Theorem 4

Appendix B.6. Proof of Theorem 5

Appendix C. On the Computation of ΔID

References

Article Metrics

Citations

Article Access Statistics

2.5. Relation between Measures Based on Decoding Strategies $α$ and $β$

2.8. Improving the Performance of Decoders Operating with Strategy $α$

Appendix C. On the Computation of ΔI^D