#### 2.1. Individual Information Source

In order to examine ensembles of information sources, we must first describe an individual information source. Each source is itself a system that could produce a wide range of information results (e.g., entropy, mutual information, or transfer entropy values). We have chosen to use a simple system and mutual information values associated with this system as the individual information sources in this study. We wish to emphasize that this simple system is only being used to generate data to demonstrate methods for analyzing ensembles of information sources.

The model system consists of two discrete variables ($X$ and $Y$), each with only two states (0 and 1 where individual states of $X$ and $Y$ are noted as $x$ and $y$, respectively). We could imagine that the $X$ and $Y$ variables represent the spiking state (spike vs. no spike) of two neurons. Thus, $x=1$ would correspond to neuron $X$ spiking and $y=0$ would correspond to neuron $Y$ not spiking. The mutual information between the neurons could then represent the strength of the connection between the neurons in a network. Importantly, the $X$ and $Y$ variables could be other pairs of neural signals (e.g., blood-oxygen-level-dependent (BOLD) signal values, electroencephalography (EEG) voltage), a neural signal with a non-neural signal (e.g., neuron spiking and a visual stimuli), or two other signals (e.g., gene expression for a pair of genes) because the information theory-based approaches presented herein are highly generalizable across systems. This point can be somewhat confusing when considering interactions between neural variables. In this case, the information sources are the interactions between the neural variables. In other words, the “ensembles of information sources” we refer to here are the connections between the neural variables in a network.

Because our primary interest is developing methods for use with experimental data and experimental data usually consist of multiple observations (i.e., trials) of a system, we assume data from the model contains ${n}_{obs}$ number of joint observations of both variables. To simplify the mathematics of the model, we further assume that ${n}_{obs}$ is a multiple of 4, but this is not a critical assumption for the analysis. Furthermore, for both observations of $X$ and $Y$ individually, we assume that half the observations produced a state of 0 and the other half produced a state of 1. By controlling how the states of $X$ and $Y$ are jointly related, we can control the strength of the interaction between $X$ and $Y$, which will allow us to control the strength of the mutual information observed between $X$ and $Y$. In addition, by controlling the number of observations, we can explore how this critical experimental parameter influences the detection of significant results.

Specifically, the strength of interaction between

$X$ and

$Y$ is controlled by the interaction strength variable

$s$, which can range from 0 to 1. The number of joint observations of the states of

$X$ and

$Y$ is shown in

Table 1. In this simple model, when

$s=0$ (i.e., no interaction is present between

$X$ and

$Y$), each joint state of

$X$ and

$Y$ is equally likely (e.g., the distribution across each cell is uniform). When

$s=1$ (i.e., there is the strongest possible interaction between

$X$ and

$Y$), half of the observations consist of the joint state

$\left(x=0,y=0\right)$ and the other half consist of the state

$\left(x=1,y=1\right)$, which implies that the state of one variable completely determines the state of the second variable.

To simulate experimental noise in the system, the joint state observations can be randomized using a noise variable $a$, which can range from 0 to 1. The randomization processes proceed by randomly selecting (uniform likelihood) $Round\left({n}_{obs}a\right)$ joint observations and randomly permuting the $Y$ variable state among the selected joint states. Therefore, when $a=0$, there is no noise in the system and it is only governed by the interaction strength variable $s$. When $a=1$, the system is completely dominated by noise and the interaction strength variable $s$ has no impact on the eventual joint observations.

To produce a mutual information value for this simple system, we estimate the joint probability distribution

${p}_{dist}\left(x,y\right)$ by dividing the distribution of joint observation counts by the total number of observations (

${n}_{obs}$) (Note, we refer to probability distributions as

${p}_{dist}$ to avoid confusion with

p-values from statistical tests, which we simply note as

$p$). We then calculate the mutual information using Equation (1):

When assessing information theory values, significance testing is critical. This is due to the fact that the model data (and real experimental data) contain a limited number of observations and mutual information must be greater than or equal to zero [

1]. This leads to a situation where even models with no interactions will produce non-zero mutual information results due to noise and/or the finite number of observations. To account for this effect, we assessed the statistical significance of a mutual information result for each model using Monte Carlo techniques by comparing the mutual information produced by the original model to the distribution of mutual information values produced by many randomized null surrogate model data sets. The distribution of mutual information values from randomized data estimated the likelihood to observe a given mutual information value by chance even with no interactions present given the number of observations and the marginal distributions.

The randomization was accomplished by randomly permuting the $Y$ variable state for all joint observations. This process preserved the number of observations (${n}_{obs}$) and the underlying marginal distributions. (In effect, the null data consisted of models with matching ${n}_{obs}$ and $a=1$.) The number of null surrogate data sets (${n}_{MC}$) was typically set to 100, 500, or 1000 (see below). The fraction of null surrogate data sets that produced mutual information values larger than or equal to the mutual information value from the original model was used as an estimate of the p-value for the mutual information result. For example, if the original model produced a mutual information value of 0.2 bits and 30 out of 1000 null surrogate data sets produced mutual information values equal to or larger than 0.2 bits, the p-value was estimated as $p=0.03$. If no null surrogate data sets produced mutual information values larger than or equal to the original model’s mutual information result, we estimated the p-value as $p=1/\left(2{n}_{MC}\right)$.

#### 2.2. Information Source Ensemble Analysis

So far, we have described a simple individual model system to examine, methods to analyze an information theory measure from this system (in our case, mutual information), and methods to assess the statistical significance of that information theory result. Now, we will describe general methods to assess the behavior of many information values produced by ensembles with

${n}_{net}$ individual information sources. These methods do not rely on the specifics of the individual model system. Rather, these methods only require that the ensemble consist of individual elements, each of which possesses an information value and a collection of null information values derived from that source, from which a

p-value can be calculated. In general, assume we have a system of

${n}_{ens}$ individual information sources where each individual information value is noted as

${I}_{i}$ where

$i=1,2,3,\dots {n}_{ens}$. Furthermore, associated with each information value is

${I}_{null,i,j}$ null information values (where

$j=1,2,3,\dots {n}_{MC}$) and a

p-value

${p}_{i}$. The relevant parameters for our model are shown in

Table 2. To demonstrate these methods, we generated model data for three ensembles with varying levels of noise (

Figure 1).

The first question we wished to ask about the ensemble of information sources is whether the ensemble itself produced significant information results. Indeed, even with an ensemble of sources dominated by noise, some sources will randomly produce low p-values (i.e., false positives). Obviously, this is an important experimental question. For instance, one might wish to know if a group of neurons significantly encodes some sensory stimuli or whether the group of neurons significantly share information (i.e., form a network).

To determine if the ensemble produces significant information results, we performed a Kolmogorov–Smirnov (KS) test between the information values from the real information sources and the null information values produced in the significance testing of the individual information sources (i.e., the set

$\left\{{I}_{i}\right\}$ compared to the set

$\left\{{I}_{null,i,j}\right\}$). In the three example ensembles shown in

Figure 1, the low noise ensemble produced an information measure distribution very different from the null data (

Figure 1A1,B1), which resulted in a low

p-value from the KS test between the distribution of real information measures and the null data (

Figure 1D). Conversely, an ensemble dominated by noise (

Figure 1A3,B3) produced distributions of real and null information measures that were very similar and, as a result, a high

p-value via KS test (

Figure 1D).

Three important points should be made about this method. First, at the current time, ensembles that consist of individual systems governed by the same rules (i.e., homogenous systems) are required for this approach. In the future, we hope to fully characterize how this method handles ensembles of heterogeneous systems (see

Section 4.3), but we wish to emphasize that homogenous systems are a requirement of this method at this time.

Second, the use of the KS test allows for the detection of information distributions that might be smaller than expected given the null distribution. For instance, if the information values are skewed to be smaller than expected by chance, or if the distribution is bimodal but has a mean value near the mean for the null data, this method will allow for the detection of information results significantly lower than null for an ensemble. In this way, this method is useful to detect the suppression of information in an ensemble.

Third, this method does not require generating large numbers of surrogate null information values because it does not seek to assess the significance of each information source. The null information value distribution will have ${n}_{ens}\ast {n}_{MC}$ values and the original data information distribution will have ${n}_{ens}$ values. Therefore, relatively few null information values per information source (perhaps as few as 10) should be sufficient to perform the KS test between the null information distribution and the distribution of the real information values.

Next, we sought a method to conveniently and compactly present the information values produced by an ensemble. The most complete presentation method would be to show the full distribution of observed information results, but doing so is not ideal in many circumstances. For instance, when examining the time evolution of an information ensemble, it would be confusing to present distributions for each time point and attempt to compare distributions through time. Furthermore, it may be additionally difficult to present distributions along with p-value information, though a scatter plot or 2D histogram would be options to present both values simultaneously. Therefore, we developed a method of presenting the weighted mean, weighted standard error of the mean, and weighted standard deviation.

In order to calculate these weighted quantities, it was first necessary to create weights. We chose to use the

p-value for each individual information value (see

Section 2.1) in the ensemble to calculate the weight for that information value via Equation (2) (

Figure 1C):

The weights were normalized by dividing each weight by the sum of all weights. This method of weighting the data allowed for information values with lower p-values to exert more influence on the mean or error statements for the whole ensemble. We chose to perform this weighting to bias the mean and error statements in a manner that more closely reflects the information sources that are more likely to be significant (i.e., have a lower p-value). Furthermore, this weighting method does not require large numbers of null surrogate data to be generated. Given the methods used to determine whether an ensemble of information sources is significant (see above) and if two ensembles are significantly different (see below), we did not assess family-wise error in these p-values or these weights.

Next, the weighted mean (Equation (3)), weighted standard error of the mean (Equation (4)), and weighted standard deviation (Equation (5)) can be calculated and presented to convey general features of the ensemble of information sources (

Figure 1D):

Note that this weighting method will not correct for family-wise error and will still possess bias in information values resulting from randomly observed high information values (i.e., false-positives).

The third question we wished to ask is how we can assess whether two ensembles of information sources are significantly different. This question might arise experimentally when a researcher wishes to examine the effects of a treatment. For instance, one might want to quantify if a group of neurons in one treated subject share more or less information than an ensemble from a control subject. Towards this goal, a similar Monte Carlo approach as outlined previously was utilized. We generated null surrogate data of differences in the weighted mean between ensembles of information sources by randomly permuting individual information values and their associated weights between ensembles while preserving the number of information sources in each ensemble. We estimated the

p-value as the proportion of null comparisons with weighted mean differences greater than that observed in the comparison between the real ensembles while accounting for the sign of the difference (i.e., ensemble A greater than ensemble B or vice versa). In the example systems shown in

Figure 1, this ensemble comparison method produced a

p-value of

$p=0.038$ when comparing the two lower noise ensembles that had more similar weighted means, but lower

p-values (

$p<{10}^{-3}$) for comparisons with the highest noise system.