Estimating Mutual Information for Spike Trains: A Bird Song Example

Zebra finches are a model animal used in the study of audition. They are adept at recognizing zebra finch songs, and the neural pathway involved in song recognition is well studied. Here, this example is used to illustrate the estimation of mutual information between stimuli and responses using a Kozachenko–Leonenko estimator. The challenge in calculating mutual information for spike trains is that there are no obvious coordinates for the data. The Kozachenko–Leonenko estimator does not require coordinates; it relies only on the distance between data points. In the case of bird songs, estimating the mutual information demonstrates that the information content of spiking does not diminish as the song progresses.


Introduction
The mutual information between two random variables X and Y is often conveniently described using a diagram like this: where the whole rectangle represents the entropy H(X, Y ) of the joint variable (X, Y ).This is, in general, less than the sum of H(X) and H(Y ) because X and Y are not independent.In this diagram, the purple and green regions together are intended to represent H(X) and the green and yellow regions H(Y ).The purple region on its own represents H(X|Y ): the entropy remaining, on average, when the value of Y is known.In the same way the yellow region represents H(Y |X).Now, the mutual information is represented by the green section: or, by substitution, Here, for illustrative purposes, mutual information is described relative to a specific example: the neural response of cells in the zebra finch auditory pathway to zebra finch song.This is both an interesting neuroscientific example and an example which is typical of a broad set of neuroscience problems.
The zebra finch is a model animal used to study both auditory processing and learning; the male finch sings, he has a single song which begins with a series of introductory notes, followed by two or three repetitions of the motif: a series of complex frequency stacks known as syllables, separated by pauses.Syllables are about 50ms long, with songs lasting about two seconds.The songs have a very rich structure and both male and female zebra finch can distinguish one zebra finch song from another.
Here we use a data set consisting of spike trains recorded while the bird is listening to one of a set of songs and we provide an estimate for the mutual information between the song identity and spike trains recorded from cells in the auditory pathway.This is an interesting and non-trivial problem.Generally, calculating mutual information is costly in terms of data because it requires the estimation of probabilities such as pY (y) and p Y |X (y|x).For this reason, some measure of correlation is often when quantifying the relationship between two random variables.However, not all data types have a correlation: calculating the correlation assumes algebraic properties of the data that are not universal.As an example, calculating the correlation between X and Y requires the calculation of E[XY ] which in turn assumes that it makes sense to multiply x and y values.This is not the case for the typical neuroscience example considered here, where the set of outcomes for X is song identities and for Y , spike trains.To circumvent this, spike trains are often replaced with something else, spike counts for example.However, this involves an implicit assumption about how information is coded.This is likely to be inappropriate in many cases.Indeed, the approach taken to calculating mutual information can involve making very strong assumptions about information coding, the very thing that is being studied.
The purpose of this review paper is to demonstrate a different approach: there is a metric-space version of the Kozachenko-Leonenko estimator [7,8] introduced in [13,3,4] and inspired by [15].This approach has been tested on simulated data, for example in [4], and this shows it to be promising.However, it is important to also test it on real data.Here it is applied in the zebra finch example.

Materials and Methods
be a data set, in our case the xi are the labels for songs in the set of stimuli, with each xi ∈ {1, . . ., ns}; ns is the number of different songs.For a given trial, yi is the spiking response.This will be a point in "the space of spike trains".What exactly is meant by the space of spike trains is less clear, but for our purposes here, the important point is that this can be regarded as a metric space, with a metric that gives a distance between any two spike trains, see [16,10], or, for a review, [5].
Given the data, the mutual information is estimated by where the particular choice of which conditional probability to use, p Y |X rather than p X|Y , has been made for later convenience.Thus, the problem of estimating mutual information is one of estimating the probability mass functions p Y |X and pY at the data points in D. In our example there is no challenge to estimating pX ; since each song is presented an equal number of times during the experiment pX (xi) = 1/ns for all xi and, in general pX (xi) is known from the experiment design.However, estimating p Y |X and pY is more difficult.In a Kozachenko-Leonenko approach this is done by first noting that for a small volume Ri containing the point yi with the estimate becoming more-and-more exact for smaller regions Ri.
If the volume of Ri were reduced towards zero pY (y) would be constant in the resulting tiny region.Here vol(Ri) denotes the volume of Ri.Now the integral R i pY (y) dy is just the probability mass contained in Ri and so it is approximated by the number of points in D that are in Ri: It should be noted at this point that this approximation becomes moreand-more exact as Ri becomes bigger.Using the notation this means This formula provides an estimate for pY (yi) provided a strategy is given for choosing the small regions Ri around each point yi.As will be seen, a similar formula can be derived for p Y |X (yi|xi), essentially by restricting the points to Di = {(xj, yj) ∈ D|xj = xi}: where, hi is the number of points in Ri with label xi and nc is the total number of points with label xi.In the example here nc = n/ns.Once the probability mass functions are estimated, it is easy to estimate the mutual information.However, there is a problem: the estimates also require the volume of Ri.In general, a metric space does not have a volume measure.Furthermore while many everyday metric spaces also have coordinates providing a volume measure, this measure it not always appropriate since the coordinates are not related to the way the data is distributed.However, the space that the yis belong to is not simply a metric space, it is also a space with a probability density, pY (y).This provides a measure of volume: In short, the volume of a region can be measured as the amount of probability mass it contains.This is useful because this quantity can in turn be estimated from data, as before, by counting points: The problem with this, though, is that it gives a trivial estimate of the probability.Substituting back into the estimate for pY (yi), Equation 8, gives pY (yi) = 1 for all points yi.This is not as surprising as it might at first seem, probability density is a volume-measure dependant quantity, that is what is meant by calling it a density and is the reason that entropy is not well-defined on continuous spaces.There is always a choice of coordinate that trivializes the density.
However, it is not the entropy that is being estimated here.It is the mutual information and this is well defined: its value does not change when the volume measure is changed.The mutual information uses more than one of the probability density on the space; in addition to pY (yi) it involves the conditional probabilities p Y |X (y|x).Using the measure defined by pY (y) does not make these conditional probability densities trivial.The idea behind the metric space estimator is to use pY (y) to estimate volumes.This trivialises the estimates for pY (yi) but it does allow us to estimate p Y |X (y|x) and use this to calculate an estimate of the mutual information.
In this way the volume of Ri is estimated from the probability that a data point is in Ri and this, in turn, is estimated by counting points.Thus, to fix the volume vol(Ri) a number h of data points is specified and for each point the h − 1 nearest data points are identified, giving h points in all when the "seed point" is included.This is equivalent to expanding a ball around yi until it has an estimated volume of h/n.This defines the small region Ri.The conditional probability is then estimated by counting how many points in Ri are points with label xi, that is, are points in Di.In fact, this just means counting how many of the h points that have been identified are in Di, or, put another way, it means counting how many of the h − 1 nearest points to the original seed point are from the same stimulus as the seed point.In summary, the small region consists of h points, to estimate p Y |X (yi|xi) the number of points in the small region corresponding to label xi is counted, this is referred to as hi so This is substituted into the formula for the density estimator, Equation 6 to get where, as before nc is the total number of trials for each song.It is assumed that each song is presented the same number of times.It would be easy to change this to allow for different numbers of trials for each song, but this assumption is maintained here for notational convenience.Substituting back into the formula for the estimated mutual information, Equation 4, gives The calculation of I0 is illustrated in Figure 1.The subscript zero has been added in order to preserve the unadorned I for the information itself and Ĩ for the debiased version of the estimator; this is discussed below.This estimate is biased and it gives a non-zero value even if the X and Y are independent.This is a common problem with estimators of mutual information.One advantage of the Kozachenko-Leonenko estimator described here is that the bias at zero mutual information can be calculated exactly.Basically, for the estimator to give a value of zero would require hi = h/ns for every i.In fact, while this is the expected value if X and Y are independent, hi has a probability distribution which can be calculated as a sort of urn problem.As detailed in [17] doing this calculation gives the debiased estimator where I b , the bias, is and u is the probability for the Hypergeometric distribution.Using the A B Figure 1: The calculation of I and the spiking data.A illustrates how the estimator is calculated.The circles and triangle are data points and red and blue represent two labels.The dashed line is the small region around the seed point in the center marked by a triangle ▲.Here h = 7 so the ball has been expanded until it includes seven points.It contains four red points, the colour of the central point, so h ▲ = 4.For illustration the points have been drawn in a two-dimensional space, but this can be any metric space.B describes the data.The spiking responses of a typical neuron to each presentation of a song is plotted as a raster plot, with a mark for each spike.The trials are grouped by song, so the ten responses in each group correspond to repeated presentations of a single stimulus.Stimulus onset is aligned at 0, with the shortest song lasting 1.65 seconds.
parameterization used by distributions.jl Obviously the estimator relies on the choice of the smoothing parameter h.Recall that for small h the counting estimates for the number of points in the small region and for the volume of the small regions are noisy.For large h the assumption the probability density is constant in the small region is poor.These two countervailing points of approximation affect I0 and I b differently.It seems a good strategy in picking h for real data is to maximize Ĩ(h) over h.This is the approach that will be adopted here.

Data
As an example we will use a data set recorded from zebra finch and made available on the Collaborative Research in Computational Neuroscience data sharing website2 [12].This data set contains a large number of recordings from neurons in different parts of zebra finch auditory pathway.The original analysis of these data are described in [2,1].The data set includes different auditory stimuli, here, though only the responses to zebra finch song are considered.There are 20 songs, so ns = 20, and each song is presented ten times, nc = 10, giving n = 200.The zebra finch auditory pathway is complex and certainly does not follow a single track, but for our purposes it looks like where CN is cochlear nuclei, MLd is mesencephalicus lateralis pars dorsalis, analogous to mammalian inferior colliculus, OV is nucleus ovoidalis, Field L is the primary auditory pallium, analogous to mammalian A1 and, finally, HVc is regarded as the locus of song recognition.The mapping of the auditory pathway and our current understanding of how to best associate features of this pathway to features of the mammalian brain is derived from, for example [6,14,9,18,1].
In the data set there are 49 cells from each of MLd and Field L and here the entropy is calculated for all 98 of these cells.

Results
Our interest in considering the mutual information for bird song was to check whether or not the early part of the spike train was more informative about the song identity.It seemed possible that the amount of information later in the spike train would be less than in the earlier portion.This does not seem to be the case.
There are a number of spike train metrics than could be used.Although these differ markedly in the mechanics of how they calculate a distance, it does appear that the more successful among them are equally good at capturing the information content.In Figure 2A the total mutual information between song identity and spike train is plotted.Here the Victor-Purpura (VP) metric [16], the spike count, earth mover distance (EMD) [11] and van Rossum metric [10] are considered.The Victor-Purpura metric and van Rossum metric both include a parameter which can be tuned, roughly corresponding to the precision of spike timing.Here the optimal value for each case has been used, chosen to maximize the average information.These values are q = 32.5 Hz for the VP metric and τ = 15 ms for the vR metric.The mutual information estimator uses the metric to order the points, each small region contains the h − 1 points nearest the seed point so the estimator does not depended on the distances themselves, just the order.Indeed, the estimated mutual information is not very sensitive to the choice of q or τ .This is demonstrated in Figure 2B where the mutual information is calculated as a function of q, the parameter for the VP metric.
The Victor-Purpura metric and van Rossum metric clearly have the highest mutual information and are very similar to each other.This indicates that the estimator is not sensitive to the choice of metric, provided the metric is one that can capture features of the spike timing as well as the overall rate.The spike count does a poor job, again indicating that A B Figure 2: Information content according to different distances.A shows mean mutual information (MI) among the 98 neurons from both regions according to different distance metrics, the Victor Purpura metric, the firing rate, the earth mover distance and the van Rossum metric.To calculate the mutual information 1.65 s of spike train is used, corresponding to the length of the short song.B shows how that mean MI varies according to the q parameter for the Victor-Purpura metric.In both case blue corresponds to MLd and red to Field L. In B the translucent band corresponds to middle 20% of data points; there is substantial variability in information across cells.
there is information contained in spike timing as well as the firing rate.Similar results were seen in [19] and in [5], though a different approach to evaluating the performance of the metrics was used there.
The cells from MLd have higher mutual information, on average, than the cells from Field L. Since Field L is further removed from the auditory nerve than MLd this is to be expected from the information processing inequality.This inequality stipulates that away from the source of information, information can only be lost, not created.
In Figure 3 the information content of the spike trains as a function of time is considered.To do this the spike trains are sliced into 100 ms slices and the information is calculated for each slice.The songs have variable length, so the mutual information becomes harder to interpret after the end of the shortest song, marked by a dashed line.Nonetheless, it is clear that the rate of information, and the information per spike, is largely unchanged through the song.

Discussion
As well as demonstrating the use of the estimator for mutual information, we were motivated here by an interest in the nature of coding in spike trains in a sensory pathway.It is clear that the neurons in MLd and Field L are not "grandmother" neurons, responding only to a specific song and only through the overall firing rate.The firing rate contains considerably less information than was measured using the spike metrics.The spike metrics, in turn, give very similar values for the mutual information, this A B Figure 3: Information content per time.These figures show the time resolved mutual information by calculating the mutual information for spiking response over 0.1 s slices; the centres of which, T , are plotted against the mean mutual information.A shows how this varies over time, with a vertical line showing the ending of the shortest stimulus.B shows the mean information per spike; although A shows a small decrease, B seems to indicate that this corresponds to a reduction in firing rate, not in the information contained in each spike.In both cases the metric is the VP metric with q = 30 Hz.
appears to indicate that the crucial requirement of a spike train metric is a "fuzzy" sensitivity to spike timing.This demonstrates the need for an estimator such as the KL estimator used here.
Approaches that do not incorporate spike timings underestimate the mutual information, but histogram methods, which do include timings are computational impractical for modest amounts of data.A pioneering paper, [19], also examines mutual information for zebra finch song, but using a histogram approach.The substantial conclusion of there was similar to the conclusion here: there was evidence that spike timings are important.However, it seems likely that this early paper was constrainted in its estimates by the size of the data set.This is suggested by the way the amount of information measured increased monotonically as the bin-width in the temporal discretization was reduced, a signature of a data-constrained estimate.
Finally it is observed that it is not the case that the precision of spiking diminishes as the song continues.Since that song can often be identified from the first few spikes of the response, it might be expected that the neuronal firing would become less precise.Precision is metabolically costly.However, although the firing rate falls slightly, the information remains constant on a per-spike basis.