Multiscale Information Decomposition: Exact Computation for Multivariate Gaussian Processes

Exploiting the theory of state space models, we derive the exact expressions of the information transfer, as well as redundant and synergistic transfer, for coupled Gaussian processes observed at multiple temporal scales. All of the terms, constituting the frameworks known as interaction information decomposition and partial information decomposition, can thus be analytically obtained for different time scales from the parameters of the VAR model that fits the processes. We report the application of the proposed methodology firstly to benchmark Gaussian systems, showing that this class of systems may generate patterns of information decomposition characterized by mainly redundant or synergistic information transfer persisting across multiple time scales or even by the alternating prevalence of redundant and synergistic source interaction depending on the time scale. Then, we apply our method to an important topic in neuroscience, i.e., the detection of causal interactions in human epilepsy networks, for which we show the relevance of partial information decomposition to the detection of multiscale information transfer spreading from the seizure onset zone.


Introduction
The information-theoretic treatment of groups of correlated degrees of freedom can reveal their functional roles as memory structures or information processing units. A large body of recent work has shown how the general concept of "information processing" in a network of multiple interacting dynamical systems described by multivariate stochastic processes can be dissected into basic elements of computation defined within the so-called framework of information dynamics Lizier et al. (2014). These elements essentially reflect the new information produced at each moment in time about a target system in the network Pincus (1995), the information stored in the target system Lizier et al. (2012), , the information transferred to it from the other connected systems Schreiber (2000),  and the modification of the information flowing from multiple source systems to the target Lizier et al. (2010), Wibral et al. (2015). The measures of information dynamics have gained more and more importance in both theoretical and applicative studies in several fields of science Lizier et al. (2011), Wibral et al. (2011), Hlinka et al. (2013), Barnett et al. (2013), Marinazzo et al. (2014), , Porta et al. (2015), , Wollstadt et al. (2017). While the information-theoretic approaches to the definition and quantification of new information, information storage and information transfer are well understood and widely accepted, the problem of defining, interpreting and using measures of information modification has not been fully addressed in the literature.
Information modification in a network is tightly related to the concepts of redundancy and synergy between source systems sharing information about a target system, which refer to the existence of common information about the target that can be retrieved when the sources are used separately (redundancy) or when they are used jointly (synergy) Schneidman et al. (2003). Classical multivariate entropy-based approaches refer to the interaction information decomposition (IID), which reflects information modification through the balance between redundant and synergetic interaction among different source systems influencing the target Stramaglia et al. (2012Stramaglia et al. ( , 2014Stramaglia et al. ( , 2016. The IID framework has the drawback that it implicitly considers redundancy and synergy as mutually exclusive concepts, because it quantifies information modification with a single measure of interaction information McGill (1954) (also called co-information Bell (2003)) that takes positive or negative values depending on whether the net interaction between the sources is synergistic or redundant. This limitation has been overcome by the elegant mathematical framework introduced by Williams and Beer Williams and Beer (2010), who proposed the so-called partial information decomposition (PID) as a nonnegative decomposition of the information shared between a target and a set of sources into terms quantifying separately unique, redundant and synergistic contributions. However, the PID framework has the drawback that the terms composing the PID cannot be obtained unequivocally from classic measures of information theory (i.e., entropy and mutual information), but a new definition of either redundant, synergistic or unique information needs to be provided to implement the decomposition. Accordingly, much effort has focused on finding the most proper measures to define the components of the PID, with alternative proposals defining new measures of redundancy Williams and Beer (2010), Harder et al. (2013), synergy Griffith et al. (2014), Quax et al. (2017) or unique information Bertschinger et al. (2014). The proliferation of different definitions is mainly due to the fact that there is no full consensus on which axioms should be stated to impose desirable properties for the PID measures. An additional problem which so far has seriously limited the practical implementation of these concepts is the difficulty in providing reliable estimates of the information measures appearing in the IID and PID decompositions. The naive estimation of probabilities by histogram-based methods followed by the use of plug-in estimators leads to serious bias problems Panzeri et al. (2007), Faes and Porta (2014). While the use of binless density estimators Kozachenko and Leonenko (1987) and the adoption of schemes for dimensionality reduction Vlachos and Kugiumtzis (2010), Marinazzo et al. (2012) have been shown to improve the reliability of estimates of information storage and transfer , the effectiveness of these approaches for the computation of measures of information modification has not been demonstrated yet. Interestingly, both the problems of defining appropriate PID measures and of reliably estimating these measures from data are much alleviated if one assumes that the observed variables have a joint Gaussian distribution. Indeed, in such a case, recent studies have proven the equivalence between most of the proposed redundancy measures to be used in the PID Barrett (2015) and have provided closed form solutions to the issue of computing any measure of information dynamics from the parameters of the vector autoregressive (VAR) model that characterizes an observed multivariate Gaussian process , Barrett et al. (2010), .
The second fundamental question that is addressed in this study is relevant to the computation of information dynamics for stochastic processes displaying multiscale dynamical structures. It is indeed well known that many complex physical and biological systems exhibit peculiar oscillatory activities, which are deployed across multiple temporal scales Ivanov et al. (1999), Chou (2011), Wang et al. (2013). The most common way to investigate such activities is to resample at different scales, typically through low pass filtering and downsampling Costa et al. (2002), Valencia et al. (2009), the originally measured realization of an observed process, so as to yield a set of rescaled time series, which are then analyzed employing different dynamical measures. This approach is well established and widely used for the multiscale entropy analysis of individual time series measured from scalar stochastic processes. However, its extension to the investigation of the multiscale structure of the information transfer among coupled processes is complicated by theoretical and practical issues Barnett and Seth (2015), Solo (2016). Theoretically, the procedure of rescaling alters the causal interactions between lagged components of the processes in a way that is not fully understood and, if not properly performed, may alter the temporal relations between processes and thus induce spurious detection of information transfer. In practical analysis, filtering and downsampling are known to degrade severely the estimation of information dynamics and to impact consistently the detectability, accuracy and data demand Florin et al. (2010), Barnett and Seth (2017).
In recent works, we have started tackling the above problems within the framework of linear VAR modeling of multivariate Gaussian processes, with the focus on the multiscale computation of information storage and information transfer Faes et al. (2016. In this study, we aim at extending these recent theoretical advances to the multiscale analysis of information modification in multivariate Gaussian systems performed through the IID and PID decomposition frameworks. To this end, we exploit the theory of state space (SS) models Aoki and Havenner (1991) and build on recent theoretical results Barnett and Seth (2015), Solo (2016) to show that exact values of interaction transfer, as well as redundant and synergistic transfer can be obtained for coupled Gaussian processes observed at different time scales starting from the parameters of the VAR model that fits the processes and from the scale factor. The theoretical derivations are first used in examples of benchmark Gaussian systems, reporting that these systems may generate patterns of information decomposition characterized by prevalently redundant or synergistic information transfer persisting across multiple time scales or even by alternating the prevalence of redundant and synergistic source interaction depending on the time scale. The high computational reliability of the SS approach is then exploited in the analysis of real data by the application to a topic of great interest in neuroscience, i.e., the detection of information transfer in epilepsy networks.
The proposed framework is implemented in the msID MATLAB R toolbox, which is uploaded as Supplementary Material to this article and is freely available for download from www.lucafaes. net/msID.html and https://github.com/danielemarinazzo/multiscale_PID

Information Transfer Decomposition in Multivariate Processes
Let us consider a discrete-time, stationary vector stochastic process composed of M real-valued zero-mean scalar processes, Y n = [Y 1,n · · · Y M,n ] T , −∞ < n < ∞. In an information-theoretic framework, the information transfer between scalar sub-processes is quantified by the well-known transfer entropy (TE), which is a popular measure of the "information transfer" directed towards an assigned target process from one or more source processes. Specifically, the TE quantifies the amount of information that the past of the source provides about the present of the target over and above the information already provided by the past of the target itself Schreiber (2000). Taking Y j as target and Y i as source, the TE is defined as: represent the past of the source and target processes and I(·; ·|·) denotes conditional mutual information (MI). In the presence of two sources Y i and Y k and a target Y j , the information transferred toward Y j from the sources Y i and Y k taken together is quantified by the joint TE: Under the premise that the information jointly transferred to the target by the two sources is different than the sum of the amounts of information transferred individually, in the following, we present two possible strategies to decompose the joint TE into amounts eliciting the individual TEs, as well as redundant and/or synergistic TE terms.

Interaction Information Decomposition
The first strategy, which we denote as interaction information decomposition (IID), decomposes the joint TE (2) as: where I ik→j is denoted as interaction transfer entropy (ITE) because it is equivalent to the interaction information McGill (1954) computed between the present of the target and the past of the two sources, conditioned to the past of the target: The interaction TE quantifies the modification of the information transferred from the source processes Y i and Y k to the target Y j , being positive when Y i and Y k cooperate in a synergistic way and negative when they act redundantly. This interpretation is evident from the diagrams of Figure 1: in the case of synergy (Figure 1a), the two sources Y i and Y k taken together contribute to the target Y j with more information than the sum of their individual contributions (T ik→j > T i→j + T k→j ), and the ITE is positive; in the case of redundancy (Figure 1b), the sum of the information amounts transferred individually from each source to the target is higher than the joint information transfer (T i→j + T k→j > T ik→j ), so that the ITE is negative. Figure 1: Venn diagram representations of the interaction information decomposition (IID) (a,b)) and the partial information decomposition (PID) (c)). The IID is depicted in a way such that all areas in the diagrams are positive: the interaction information transfer I ik→j is positive in (a), denoting net synergy, and is negative in (b), denoting net redundancy.

Partial Information Decomposition
An alternative expansion of the joint TE is that provided by the so-called partial information decomposition (PID) Williams and Beer (2010). The PID evidences four distinct quantities measuring the unique information transferred from each individual source to the target, measured by the unique TEs U i→j and U k→j , and the redundant and synergistic information transferred from the two sources to the target, measured by the redundant TE R ik→j and the synergistic TE S ik→j . These four measures are related to each other and to the joint and individual TEs by the following equations (see also Figure 1c): In the PID defined above, the terms U i→j and U k→j quantify the parts of the information transferred to the target process Y j , which are unique to the source processes Y i and Y k , respectively, thus reflecting contributions to the predictability of the target that can be obtained from one of the sources alone, but not from the other source alone. Each of these unique contributions sums up with the redundant transfer R ik→j to yield the information transfer from one source to the target as is known from the classic Shannon information theory. Then, the term S ik→j refers to the synergy between the two sources while they transfer information to the target, intended as the information that is uniquely obtained taking the two sources Y i and Y k together, but not considering them alone. Compared to the IID defined in (3), the PID (5) has the advantage that it provides distinct non-negative measures of redundancy and synergy, thereby accounting for the possibility that redundancy and synergy may coexist as separate elements of information modification. Interestingly, the IID and PID defined in Equations (3) and (5) are related to each other in a way such that: thus showing that the interaction TE is actually a measure of the 'net' synergy manifested in the transfer of information from the two sources to the target. An issue with the PID (5) is that its constituent measures cannot be obtained through classic information theory simply subtracting conditional MI terms as done for the IID; an additional ingredient to the theory is needed to get a fourth defining equation to be added to (5) for providing an unambiguous definition of U i→j , U k→j , R ik→j and S ik→j . While several PID definitions have been proposed arising from different conceptual definitions of redundancy and synergy Harder et al. (2013), Griffith et al. (2014), Bertschinger et al. (2014), here, we make reference to the socalled minimum MI (MMI) PID Barrett (2015). According to the MMI PID, redundancy is defined as the minimum of the information provided by each individual source to the target. In terms of information transfer measured by the TE, this leads to the following definition of the redundant TE: This choice satisfies the desirable property that the redundant TE is independent of the correlation between the source processes. Moreover, it has been shown that, if the observed processes have a joint Gaussian distribution, all previously-proposed PID formulations reduce to the MMI PID Barrett (2015).

Multiscale Representation of Multivariate Gaussian Processes
In the linear signal processing framework, the M -dimensional vector stochastic process Y n = [Y 1,n · · · Y M,n ] T is classically described using a vector autoregressive (VAR) model of order p: where A k are M × M matrices of coefficients and U n = [U 1,n · · · U M,n ] T is a vector of M zero mean Gaussian processes with covariance matrix Σ ≡ E[U n U T n ] (E is the expectation operator). To study the observed process Y at the temporal scale identified by the scale factor τ , we apply the following transformation to each constituent process Y m , m = 1, . . . , M : This rescaling operation corresponds to transforming the original process Y through a two-step procedure that consists of the following filtering and downsampling steps, yielding respectively the processesỸ andȲ :Ỹ The change of scale in (9) generalizes the averaging procedure originally proposed in Costa et al. (2002), which sets q = τ − 1 and b l = 1/τ and, thus, realizes the step of filtering through the simple procedure of averaging τ subsequent samples. To improve the elimination of the fast temporal scales, in this study, we follow the idea of Valencia et al. (2009), in which a more appropriate low pass filter than averaging is employed. Here, we identify the b l as the coefficients of a linear finite impulse response (FIR) low pass filter of order q; the FIR filter is designed using the classic window method with the Hamming window Oppenheim and Schafer (1975), setting the cutoff frequency at f τ = 1/2τ in order to avoid aliasing in the subsequent downsampling step. Substituting (8) in (10a), the filtering step leads to the process representation: Hence, the change of scale introduces a moving average (MA) component of order q in the original VAR(p) process, transforming it into a VARMA(p, q) process. As we will show in the next section, the downsampling step (10b) keeps the VARMA representation, altering the model parameters.

Formulation of State Space Models
State space models are models that make use of state variables to describe a system by a set of firstorder difference equations, rather than by one or more high-order difference equations Hannan and Deistler (2012), Aoki (2013). The general linear state space (SS) model describing an observed vector process Y has the form: where the state Equation (12a) describes the update of the L-dimensional state (unobserved) process through the L × L matrix A, and the observation Equation (12b) describes the instantaneous mapping from the state to the observed process through the M × L matrix C. W n and V n are zero-mean white noise processes with covariances Thus, the parameters of the SS model (12) are (A, C, Q, R, S). (2013). This new SS representation, usually referred to as the "innovations form" SS model (ISS), is characterized by the state process Z n = E[X n |Y − n ] and by the L × M Kalman gain matrix K:

Another possible SS representation is that evidencing the innovations
The parameters of the ISS model (13) Note that the ISS (13) is a special case of (12) in which W n = KE n and V n = E n , so that Q = KVK T , R = V and S = KV.
Given an SS model in the form (12), the corresponding ISS model (13) can be identified by solving a so-called discrete algebraic Riccati equation (DARE) formulated in terms of the state error variance matrix P Solo (2016): Under some assumptions Solo (2016), the DARE (14) has a unique stabilizing solution, from which the Kalman gain and innovation covariance can be computed as: thus completing the transformation from the SS form to the ISS form.

State Space Models of Filtered and Downsampled Linear Processes
Exploiting the close relation between VARMA models and SS models, first we show how to convert the VARMA model (11) into an ISS model in the form of (13) that describes the filtered process Ỹ n . To do this, we exploit Aoki's method Aoki and Havenner (1991) defining the state process Z n = [Y T n−1 · · · Y T n−p U T n−1 · · · U T n−q ] T that, together withỸ n , obeys the state Equation (13) with parameters (Ã,C,K,Ṽ), where: whereṼ is the covariance of the innovationsẼ n = B 0 U n . Now, we turn to show how the downsampled processȲ n can be represented through an ISS model directly from the ISS formulation of the filtered processỸ n . To this end, we exploit recent theoretical findings providing the state space form of downsampled signals (Theorem III in Solo (2016)). Accordingly, the SS representation of the process downsampled at scale τ ,Ȳ n =Ỹ nτ has parameters (Ā,C,Q,R,S), whereĀ =Ã τ ,C =C,Q = Q τ ,R =Ṽ andS = S τ , with Q τ and S τ given by: Therefore, the downsampled process has an ISS representation with state processZ n =Z nτ , innovation processĒ n =Ẽ nτ and parameters (Ā,C,K,V), whereK andV are obtained solving the DARE (14) and (15) for the SS model with parameters (Ā,C,Q,R,S).
To sum up, the relations and parametric representations of the original process Y , the filtered processỸ and the downsampled processȲ are depicted in Figure 2a. The step of low pass filtering (FLT) applied to a VAR(p) process yields a VARMA(p, q) process (where q is the filter order, and the cutoff frequency is f τ = 1/2τ ); this process is equivalent to an ISS process Aoki and Havenner (1991). The subsequent downsampling (DWS) yields a different SS process, which in turn can be converted to the ISS form solving the DARE. Thus, both the filtered processỸ n and the downsampled processȲ n can be represented as ISS processes with parameters (Ã,C,K,Ṽ) and (Ā,C,K,V) which can be derived analytically from the knowledge of the parameters of the original process (A 1 , . . . , A p , Σ) and of the filter (q, f τ ). In the next section, we show how to compute analytically any measure appearing in the information decomposition of a jointly Gaussian multivariate stochastic process starting from its associated ISS model parameters, thus opening the way to the analytical computation of these measures for multiscale (filtered and downsampled) processes. Figure 2: Schematic representation of a linear VAR process and of its multiscale representation obtained through filtering (FLT) and downsampling (DWS) steps. The downsampled process has an innovations form state space model (ISS) representation from which submodels can be formed to compute the partial variances needed for the computation of information measures appearing in the IID and PID decompositions. This makes it possible to perform multiscale information decomposition analytically from the original VAR parameters and from the scale factor.

Multiscale IID and PID
After introducing the general theory of information decomposition and deriving the multiscale representation of the parameters of a linear VAR model, in this section, we provide expressions for the terms of the IID and PID decompositions of the information transfer valid for multivariate jointly Gaussian processes. The derivations are based on the knowledge that the linear parametric representation of Gaussian processes given in (8) captures all of the entropy differences that define the various information measures Barrett et al. (2010) and that these entropy differences are related to the partial variances of the present of the target given its past and the past of one or more sources, intended as variances of the prediction errors resulting from linear regression . Specifically, let us denote as E j|j,n = Y j,n − E[Y j,n |Y − j,n ], E j|ij,n = Y j,n − E[Y j,n |Y − i,n , Y − j,n ] the prediction error of a linear regression of Y j,n performed respectively on Y − j,n and (Y − j,n , Y − i,n ) and as λ j|j = E[E 2 j|j,n ], λ j|ij = E[E 2 j|ij,n ], the corresponding prediction error variances. Then, the TE from Y i to Y j can be expressed as: In a similar way, the joint TE from (Y i , Y k ) to Y j can be defined as: where λ j|ijk = E[E 2 j|ijk,n ] is the variance of the prediction error of a linear regression of Based on these derivations, one can easily complete the IID decomposition of TE by computing T k→j as in (17) and deriving the interaction TE from (3) and the PID decomposition, as well by deriving the redundant TE from (7), the synergistic TE from (6) and the unique TEs from (5).
Next, we show how to compute any partial variance from the parameters of an ISS model in the form of (13) Barnett and Seth (2015), Solo (2016). The partial variance λ j|a , where the subscript a denotes any combination of indexes ∈ {1, . . . , M }, can be derived from the ISS representation of the innovations of a submodel obtained removing the variables not indexed by a from the observation equation. Specifically, we need to consider the submodel with state Equation (13b) and observation equation: where the superscript (a) denotes the selection of the rows with indices a of a vector or a matrix. It is important to note that the submodels (13a) and (19) are not in innovations form, but are rather an SS model with parameters (A, C (a) , KVK T , V(a, a), KV (:, a)). This SS model can be converted to an ISS model with innovation covariance V (a) solving the DARE (14) and (15), so that the partial variance λ j|a is derived as the diagonal element of V (a) corresponding to the position of the target Y j . Thus, with this procedure, it is possible to compute the partial variances needed for the computation of the information measures starting from a set of ISS model parameters; since any VAR process can be represented at scale τ as an ISS process, the procedure allows computing the IID and PID information decompositions for the rescaled multivariate process (see Figure 2). It is worth remarking that, while the general formulation of IID and PID decompositions introduced in Section 2 holds for arbitrary processes, the multiscale extension detailed in Section 3 is exact only if the processes have a joint Gaussian distribution. In such a case, the linear VAR representation captures exhaustively the joint variability of the processes, and any nonlinear extension has no additional utility (a formal proof of the fact that a stationary Gaussian VAR process must be linear can be found in Barrett et al. (2010)). If, on the contrary, non-Gaussian processes are under scrutiny, the linear representation provided in Section 3.1 can still be adopted, but may miss important properties in the dynamics and thus provide only a partial description. Moreover, since the close correspondence between conditional entropies and partial variances reported in this subsection does not hold anymore for non-Gaussian processes, all of the obtained measures should be regarded as indexes of (linear) predictability rather than as information measures.

Simulation Experiment
To study the multiscale patterns of information transfer in a controlled setting with known dynamical interactions between time series, we consider a simulation scheme similar to some already used for the assessment of theoretical values of information dynamics . Specifically, we analyze the following VAR process of order M = 4: where U n = [U 1,n · · · U 4,n ] T is a vector of zero mean white Gaussian noises with unit variance and uncorrelated with each other (Σ= I). The parameter design in Equation (20) is chosen to allow autonomous oscillations in the processes Y i , i = 1, . . . , 3, obtained placing complex-conjugate poles with modulus ρ i and frequency f i in the complex plane representation of the transfer function of the vector process, as well as causal interactions between the processes at a fixed time lag of one sample and with strength modulated by the parameters b and c (see Figure 3). In this study, we set the coefficients related to self-dependencies to values generating well-defined oscillations in all processes (ρ 1 = ρ 2 = ρ 3 = 0.95) and letting Y 1 fluctuate at slower time scales than Y 2 and Y 3 (f 1 = 0.1, f 2 = f 3 = 0.025). We consider four configurations of the parameters, chosen to reproduce paradigmatic conditions of interaction between the processes: (a) isolation of Y 1 and Y 2 and unidirectional coupling Y 3 → Y 4 , obtained setting b = c = 0; (b) common driver effects Y 2 ← Y 1 → Y 3 and unidirectional coupling Y 3 → Y 4 , obtained setting b = 0 and c = 1; (c) isolation of Y 1 and unidirectional couplings Y 2 → Y 4 and Y 3 → Y 4 , obtained setting b = 0.5 and c = 0; (d) common driver effects Y 2 ← Y 1 → Y 3 and unidirectional couplings Y 2 → Y 4 and Y 3 → Y 4 , obtained setting b = 0.5 and c = 1.  (20) that we use to explore the multiscale decomposition of the information transferred to Y 4 , selected as the target process, from Y 2 and Y 3 , selected as the source processes, in the presence of Y 1 , acting as the exogenous process. To favor such exploration, we set oscillations at different time scales for Y 1 (f 1 = 0.1) and for Y 2 and Y 3 (f 2 = f 3 = 0.025), induce common driver effects from the exogenous process to the sources modulated by the parameter c and allow for varying strengths of the causal interactions from the sources to the target as modulated by the parameter b. The four configurations explored in this study are depicted in (a-d).
With this simulation setting, we compute all measures appearing in the IID and PID decompositions of the information transfer, considering Y 4 as the target process and Y 2 and Y 3 as the source processes. The theoretical values of these measures, computed as a function of the time scale using the IID and the PID, are reported in Figure 4. In the simple case of unidirectional coupling Y 3 → Y 4 (b = c = 0, Figure 4a), the joint information transferred from (Y 2 , Y 3 ) to Y 4 is exclusively due to the source Y 3 without contributions from Y 2 and without interaction effects between the sources When the causal interactions towards Y 4 are still due exclusively to Y 3 , but the two sources Y 2 and Y 3 share information arriving from Y 1 (b = 0, c = 1; Figure 4b), the IID evidences that the joint information transfer coincides again with the transfer from Y 3 (T 23→4 = T 3→4 ), but a non-trivial amount of information transferred from Y 2 to Y 4 emerges, which is fully redundant (T 2→4 = −I 23→4 ). The PID highlights that the information from Y 3 to Y 4 is not all unique, but is in part transferred redundantly with Y 2 , while the unique transfer from Y 2 and the synergistic transfer are negligible.
In the case of two isolated sources equally contributing to the target (b = 0.5, c = 0, Figure  4c), the IID evidences the presence of net synergy and of identical amounts of information transferred to Y 4 from Y 2 or Y 3 (I 23→4 > 0, T 2→4 = T 3→4 ). The PID documents that there are no unique contributions, so that the two amounts of information transfer from each source to the target coincide with the redundant transfer, and the remaining part of the joint transfer is synergistic Finally, when the two sources share common information and contribute equally to the target (b = 0.5, c = 1; Figure 4d), we find that they send the same amount of information as before, but in this case, no unique information is sent by any of the sources (T 2→4 = T 3→4 , U 2→4 = U 3→4 = 0). Moreover, the nature of the interaction between the sources is not trivial and is scale dependent: at low time scales, where the dynamics are likely dominated by the fast oscillations of Y 1 , the IID reveals net redundancy, and the PID shows that the redundant transfer prevails over the synergistic (I 23→4 < 0, R 23→4 > S 23→4 ); at higher time scales, where fast dynamics are filtered out and the slow dynamics of Y 2 and Y 3 prevail, the IID reveals net synergy, and the PID shows that the synergistic transfer prevails over the redundant (I 23→4 > 0, S 23→4 > R 23→4 ).  Equation (20). Plots depict the exact values of the entropy measures forming the interaction information decomposition (IID, upper row) and the partial information decomposition (PID, lower row) of the information transferred from the source processes Y 2 and Y 3 to the target process Y 4 generated according to the scheme of Figure 3 with four different configurations of the parameters. We find that linear processes may generate trivial information patterns with the absence of synergistic or redundant behaviors (a), patterns with the prevalence of redundant information transfer (b) or synergistic information transfer (c) that persist across multiple time scales, or even complex patterns with the alternating prevalence of redundant transfer and synergistic transfer at different time scales (d).

Application
As a real data application, we analyze intracranial EEG recordings from a patient with drugresistant epilepsy measured by an implanted array of 8 × 8 cortical electrodes and two left hip-pocampal depth electrodes with six contacts each. The data are available in epi, and further details on the dataset are given in Kramer et al. (2008). Data were sampled at 400 Hz and correspond to 10-s segments recorded in the pre-ictal period, just before the seizure onset, and 10 s during the ictal stage of the seizure, for a total of eight seizures. Defining and locating the seizure onset zone, i.e., the specific location in the brain where the synchronous activity of neighboring groups of cells becomes so strong so as to be able to spread its own activity to other distant regions, is an important issue in the study of epilepsy in humans. Here, we focus on the information flow from the sub-cortical regions, probed by depth electrodes, to the brain cortex. In Stramaglia et al. (2014), it has been suggested that Contacts 11 and 12, in the second depth electrode, are mostly influencing the cortical activity; accordingly, in this work, we consider Channels 11 and 12 as a pair of source variables for all of the cortical electrodes and decompose the information flowing from them using the multiscale IID and PID here proposed, both in the pre-ictal stage and in the ictal stage. An FIR filter with q = 12 coefficients is used, and the order p of the VAR model is fixed according to the Bayesian information criterion. In the analyzed dataset, the model order assessed in the pre-ictal phase was p = 14.61 ± 1.07 (mean ± std. dev.across 64 electrodes and eight seizures) and during the ictal phase decreased significantly to p = 11.09 ± 3.95.
In Figure 5, we depict the terms of the IID applied from the two sources (Channels {11, 12}) to any of the electrodes as a function of the scale τ , averaged over the eight seizures. We observe a relevant enhancement of the joint TE during the seizure, w.r.t. the pre-ictal period. This enhancement is determined by a marked increase of both the individual TEs from Channels 11 and 12 to all of the cortical electrodes; the patterns of the two TEs are similar to each other in both stages. The pattern of interaction information transfer displays prevalent redundant transfer for low values of τ and prevalent synergistic transfer for high τ , but the values of the interaction TE have relatively low magnitude and are only slightly different in pre-ictal and ictal conditions. It is worth stressing that at scale τ , the algorithm analyzes oscillations, in the time series, slower than 1 2τ fs s, where f s = 400 Hz.
Figure 5: Interaction information decomposition (IID) of the intracranial EEG information flow from subcortical to cortical regions in an epileptic patient. The joint transfer entropy from depth Channels 11 and 12 to cortical electrodes (a); the transfer entropy from depth Channel 11 to cortical electrodes (b); the transfer entropy from depth Channel 12 to cortical electrodes (c) and the interaction transfer entropy from depth Channels 11 and 12 to cortical electrodes (d) are depicted as a function of the scale τ , after averaging over the eight pre-ictal segments (left column) and over the eight ictal segments (right column). Compared with pre-ictal periods, during the seizure, the IID evidences marked increases of the joint and individual information transfer from depth to cortical electrodes and low and almost unvaried levels of interaction transfer.
In Figure 6, we depict, on the other hand, the terms of the PID computed for the same data. This decomposition shows that the increased joint TE across the seizure transition seen in Figure 5a is in large part the result of an increase of both the synergistic and the redundant TE, which are markedly higher during the ictal stage compared with the pre-ictal. This explains why the interaction TE of Figure 5d, which is the difference between two quantities that both increase, is nearly constant moving from the pre-ictal to the ictal stage. The quantity that, instead, clearly differentiates between Channels 11 and 12 is the unique information transfer: indeed, only the unique TE from Channel 12 increases in the ictal stage, while the unique TE from Channel 13 remains at low levels. Figure 6: Partial information decomposition (PID) of the intracranial EEG information flow from subcortical to cortical regions in an epileptic patient. The synergistic transfer entropy from depth Channels 11 and 12 to cortical electrodes (a); the redundant transfer entropy from depth Channels 11 and 12 to cortical electrodes (b); the unique transfer entropy from depth Channel 11 to cortical electrodes (c) and the unique transfer entropy from depth Channel 12 to cortical electrodes (d) are depicted as a function of the scale τ , after averaging over the eight pre-ictal segments (left column) and over the eight ictal segments (right column). Compared with pre-ictal periods, during the seizure, the PID evidences marked increases of the information transferred synergistically and redundantly from depth to cortical electrodes and of the information transferred uniquely from one of the two depth electrodes, but not from the other.
In order to investigate the variability across trials of the estimates of the various information measures, in Figure 7, we depict the terms of both IID and PID expressed for each ictal episode as average values over all 64 cortical electrodes. The analysis shows that the higher average values observed in Figures 5 and 6 at Scales 1-4 during the ictal state for the joint TE, the two individual TEs, the redundant and synergistic TEs and the unique TE from depth Channel 12 are the result of an increase of the measures for almost all of the observed seizure episodes.
These findings are largely in agreement with the increasing awareness that epilepsy is a network phenomenon that involves aberrant functional connections across vast parts of the brain on virtually all spatial scales Richardson (2012), Dickten et al. (2016). Indeed, our results document that the occurrence of seizures is associated with a relevant increase of the information flowing from the subcortical regions (associated with the depth electrode) to the cortex and that the character of this information flow is mostly redundant both in the pre-ictal and in the ictal state. Here, the need for a multiscale approach is testified by the fact that several quantities in the ictal state (e.g., the joint TE, the synergistic ITand the unique ITfrom Channel 12) attain their maximum at scale τ > 1.
Moreover, the approaches that we propose for information decomposition appear useful to improve the localization of epileptogenic areas in patients with drug-resistant epilepsy. Indeed, our analysis suggests that Contact 12 is the closest to the seizure onset zone, and it is driving the cortical oscillations during the ictal stage, as it sends unique information to the cortex. On the other hand, to disentangle this effect, it has been necessary to include also Channel 11 in the analysis and to make the PID of the total information from the pair of depth channels to the cortex; indeed, the redundancy between Channels 11 and 12 confounds the informational pattern unless the PID is performed. Figure 7: Multiscale representation of the measures of interaction information decomposition (IID, top) and partial information decomposition (PID, bottom) computed as a function of the time scale for each of the eight seizures during the pre-ictal period (black) and the ictal period (red). Values of joint transfer entropy (TE), individual TE, interaction TE, redundant TE, synergistic TE and unique TE are obtained taking the depth Channels 11 and 12 as sources and averaging over all 64 target cortical electrodes. Increases during seizure of the joint TE, individual TEs from both depth electrodes, redundant and synergistic TE and unique TE from the depth electrode 12 are evident at low time scales for almost all considered episodes.

Conclusions
Understanding how multiple inputs may combine to create the output of a given target is a fundamental challenge in many fields, in particular in neuroscience. Shannon's information theory is the most suitable frame to cope with this problem and thus to assess the informational character of multiplets of variables describing complex systems; IID indeed measures the balance between redundant and synergetic interaction within the classical multivariate entropy-based approach. Recently Shannon's information theory has been extended, in the PID, so as to provide specific measures for the information that several variables convey individually (unique information), redundantly (shared information) or only jointly (synergistic information) about the output.
The contribution of the present work is the proposal of an analytical frame where both IID and PID can be exactly evaluated in a multiscale fashion, for multivariate Gaussian processes, on the basis of simple vector autoregressive identification. In doing this, our work opens the way for both the theoretical analysis and the practical implementation of information modification in processes that exhibit multiscale dynamical structures. The effectiveness of the proposed approach has been demonstrated both on simulated examples and on real publicly-available intracranial EEG data. Our results provide a firm ground to the multiscale evaluation of PID, to be applied in all applications where causal influences coexist at multiple temporal scales.
Future developments of this work include the refinement of the SS model structure to accommodate the description of long-range linear correlations Sela and Hurvich (2009) or its expansion to the description of nonstationary processes Kitagawa (1987) and the formalization of exact crossscale computation of information decomposition within and between multivariate processes Paluš (2014). A major challenge in the field remains the generalization of this type of analysis to non-Gaussian processes, for which exact analytical solutions or computationally-reliable estimation approaches are still lacking. This constitutes a main direction for further research, because realworld processes display very often non-Gaussian distributions, which would make an extension to nonlinear models or model-free approaches beneficial. The questions that are still open in this respect include the evaluation of proper theoretical definitions of synergy or redundancy for nonlinear processes Williams and Beer (2010) Papana et al. (2011) and the assessment of the extent to which non-linear model-free methods really outperform the linear model-based approach adopted here and in previous investigations .