Efficient Algorithms for Searching the Minimum Information Partition in Integrated Information Theory

The ability to integrate information in the brain is considered to be an essential property for cognition and consciousness. Integrated Information Theory (IIT) hypothesizes that the amount of integrated information (Φ) in the brain is related to the level of consciousness. IIT proposes that, to quantify information integration in a system as a whole, integrated information should be measured across the partition of the system at which information loss caused by partitioning is minimized, called the Minimum Information Partition (MIP). The computational cost for exhaustively searching for the MIP grows exponentially with system size, making it difficult to apply IIT to real neural data. It has been previously shown that, if a measure of Φ satisfies a mathematical property, submodularity, the MIP can be found in a polynomial order by an optimization algorithm. However, although the first version of Φ is submodular, the later versions are not. In this study, we empirically explore to what extent the algorithm can be applied to the non-submodular measures of Φ by evaluating the accuracy of the algorithm in simulated data and real neural data. We find that the algorithm identifies the MIP in a nearly perfect manner even for the non-submodular measures. Our results show that the algorithm allows us to measure Φ in large systems within a practical amount of time.


Introduction
The brain receives various information from the external world.Integrating this information is an essential property for cognition and consciousness.In fact, phenomenologically, our consciousness is unified.For example, when we see an object, we cannot experience only its shape independently of its color.Or, we cannot experience only the left half of the visual field independently of the right half.Integrated Information Theory of consciousness (IIT) considers that the unification of consciousness should be realized by the ability of the brain to integrate information [1][2][3].That is, the brain has internal mechanisms to integrate information about the shape and color of an object or information of the right and left visual field, and therefore our visual experiences are unified.IIT proposes to quantify the degree of information integration by an information theoretic measure "integrated information" and hypothesizes that integrated information is related to the level of consciousness.Although the hypothesis is indirectly supported by experiments which showed the breakdown of effective connectivity in the brain during loss of consciousness [4,5], only a few studies have directly quantified integrated information in real neural data [6][7][8][9] because of the computational difficulties described below.
Conceptually, integrated information quantifies the degree of interaction between parts or equivalently, the amount of information loss caused by splitting a system into parts [10,11].IIT proposes that integrated information should be quantified at the weakest links of a system so that it quantifies information integration in a system as a whole.For example, if a system consists of two independent subsystems, the weakest links of the system are the ones connecting the two independent subsystems.In the case of two independent subsystems, integrated information is 0 because there is no information loss when the system is partitioned into the two independent subsystems.Such a critical partition of the system is called the Minimum Information Partition (MIP), where information is minimally lost, or equivalently where integrated information is minimized.In general, searching for the MIP requires an exponentially large amount of computational time because the number of partitions exponentially grows with the arithmetic growth of system size N.This computational difficulty hinders the application of IIT to experimental data, despite its potential importance in consciousness research and even in broader fields of neuroscience.
In the present study, we exploit a mathematical concept called submodularity to resolve the combinatorial explosion of finding the MIP.Submodularity is an important concept in set functions which is analogous to convexity in continuous functions.It is known that an exponentially large computational cost for minimizing an objective function is reduced to the polynomial order if the objective function satisfies submodularity.Previously, Hidaka and Oizumi showed that the computational cost for finding the MIP is reduced to O(N 3 ) [12] by utilizing Queyranne's submodular optimization algorithm [13].They used mutual information as a measure of integrated information that satisfies submodularity.The mutual information was used in the first version of IIT [1].However, the measures of integrated information proposed in the later versions of IIT are not submodular [2,3].Also, several practical measures of integrated information inspired by the original measures [11,[14][15][16] are not submodular.
In this paper, we aim to extend the applicability of submodular optimization to non-submodular measures of integrated information.We specifically consider the three measures of integrated information; mutual information Φ MI [1], stochastic interaction Φ SI [14,17,18], and geometric integrated information Φ G [11].Mutual information is strictly submodular but the others are not.Oizumi et al. previously showed a close relationship among these three measures [11,19].From this relationship, we speculate that Queyranne's algorithm might work well for the non-submodular measures.Here, we empirically explore to what extent Queyranne's algorithm can be applied to the two non-submodular measures of integrated information by evaluating the accuracy of the algorithm in simulated data and real neural data.We find that Queyranne's algorithm identifies the MIP in a nearly perfect manner even for the non-submodular measures.Our results show that Queyranne's algorithm can be utilized even for non-submodular measures of integrated information and makes it possible to practically compute integrated information across the MIP in real neural data, such as multi-unit recordings used in EEG and ECoG, which typically consist of around 100 channels.
This paper is organized as follows.We first explain that the three measures of integrated information, Φ MI , Φ SI , Φ G , are closely related from a unified theoretical framework [11,19] and there is an order relation among the three measures; Φ MI ≥ Φ SI ≥ Φ G .Next, we compare the MIP found by Queyranne's algorithm with the correct MIP found by exhaustive search in randomly generated small networks (N = 20).We also evaluate the performance of Queyranne's algorithm in larger networks (N = 50).Since the exhaustive search is intractable, we compare Queyranne's algorithm with a different optimization algorithm called the replica exchange Markov Chain Monte Carlo (REMCMC) method [20][21][22][23].Finally, we evaluate the performance of Queyranne's algorithm in ECoG data recorded in monkeys and investigate the applicability of the algorithm in real neural data.Let us consider a stochastic dynamical system consisting of N elements.We represent the past and present states of the system as X = {x 1 , . . ., x N } and X ′ = {x ′ 1 , . . ., x ′ N }, respectively.Conceptually, integrated information is designed to quantify the degree of spatio-temporal interactions between subsystems.The previously proposed measures of integrated information are generally expressed as the Kullback-Leibler divergence between the actual probability distribution p (X, X ′ ) and a "disconnected" probability distribution q (X, X ′ ) where interactions between subsystems are removed [11].

Measures of integrated information
The Kullback-Leibler divergence measures the difference between the probability distributions, and can be interpreted as the information loss when q (X, X ′ ) is used to approximate p (X, X ′ ) [24].Thus, integrated information is interpreted as information loss caused by removing interactions.In Eq. ( 2), the minimum over q should be taken to find the best approximation of p [11].
There are many ways of removing interactions between units, which lead to different disconnected probability distributions q, and also different measures of integrated information (Fig. 1).Below, we will show that three different measures of integrated information are derived from different probability distributions q.

Multi (Mutual) information Φ MI
First, consider the following partitioned probability distribution q, q X, X where M i and M ′ i are the past and present states of the i-th subsystem, respectively.In this model, all of the interactions between the subsystems are removed, i.e., the subsystems are totally independent (Fig. 1 (a)).In this case, the corresponding measure of integrated information is given by where H(•, •) represents the joint entropy.This measure is called total correlation [25] or multi information [26].As a special case when the number of subsystems is two, this measure is simply equivalent to the mutual information between the two subsystems, The measure of integrated information used in the first version of IIT (IIT 1.0) [1] is based on mutual information.Note that in IIT, a perturbational approach is used for evaluating probability distributions, which attempts to quantify actual causation by perturbing a system into all possible states [1,3,10,27].The perturbational approach requires full knowledge of the physical mechanisms of a system, i.e., how the system behaves in response to all possible perturbations.The measure defined in Eq. 5 is based on an observational probability distribution that can be estimated from empirical data.Because we aim for the empirical application of our method, we do not consider the perturbational approach here.

Stochastic interaction Φ SI
Second, consider the following partitioned probability distribution q, which partitions the transition probability from the past X to the present X ′ in the whole system into the product of the transition probability in each subsystem.This corresponds to removing the causal influences from M i to M ′ j (j = i) as well as the equal time influences between M ′ i and M ′ j (j = i) (Fig. 1 (b)).In this case, the corresponding measure of integrated information is given by where H(•|•) indicates the conditional entropy.This measure was proposed as a practical measure of integrated information by Barrett & Seth [14] following the measure proposed in the second version of IIT (IIT 2.0) [10].This measure was also independently derived by Ay as a measure of complexity [17,18].

Geometric integrated information Φ G
Aiming at only the causal influences between parts, Oizumi et al. [11] proposed to measure integrated information with the probability distribution that satisfies which means the present state of a subsystem i, M ′ i only depends on its past state M i .This corresponds to removing only the causal influences between subsystems while retaining the equal-time interactions between them (Fig. 1

(c)).
There is no closed-form expression for this measure in general.However, if the probability distributions are Gaussian, we can analytically solve the minimization over q.

Minimum Information Partition
In this section, we provide the mathematical definition of Minimum Information Partition (MIP).Then, we formulate the search for MIP as an optimization problem of a set function.The MIP is the partition that divides a system at the weakest links so that information loss caused by removing interactions among the subsystems is minimized.The information loss is quantified by the measure of integrated information.Thus, the MIP, π MIP , is defined as a partition where integrated information is minimized: where P is a set of partitions.In general, P is the universal set of partitions, including bi-partitions, tri-partitions, and so on.In this study, however, we focus only on bi-partitions for simplicity and computational time.By bi-partition, a whole system Ω is divided into a subset S (S ⊂ Ω, S = ∅) and its complement S = Ω \ S. Since a bi-partition is uniquely determined by specifying a subset S, integrated information can be considered as a function of a set S, Φ(S).Finding the MIP is equivalent to finding the subset, S MIP , that achieves the minimum of integrated information: In this way, the search of the MIP is formulated as an optimization problem of a set function.
Since the number of bi-partitions for the system with N-elements is 2 N−1 − 1, exhaustive search of the MIP in a large system is intractable.However, by formulating the MIP search as an optimization of a set function as above, we can take advantage of a discrete optimization technique and can reduce computational costs to a polynomial order, as described in the next section.

Submodular optimization
The submodularity is an important concept in set functions, which is an analogue of convexity in continuous functions [28].When objective functions are submodular, efficient algorithms are available for solving optimization problems.In particular, for symmetric submodular functions, there is a well-known algorithm by Queyranne which minimizes them [13].We utilize this method for finding the MIP in this study.

Submodularity
Mathematically, the submodularity is defined as follows.
Definition 1 (Submodularrity).Let Ω be a finite set and 2 Ω its power set.A set function f : 2 Ω → R is submodular if it satisfies the following inequality for any S, T ⊆ Ω: Equivalently, a set function f : 2 Ω → R is submodular if it satisfies the following inequality for any S, T ⊆ Ω with S ⊆ T and for any u ∈ Ω \ T: The second inequality means that the function increases more when an element is added to a smaller subset than when the element is added to a bigger subset.

Queyranne's algorithm
for any S ⊆ Ω. Integrated information Φ(S) computed by bi-partition is a symmetric function, because S and Ω \ S specifies the same bi-partition.If a function is symmetric and submodular, we can find the minimum of the function by Queyranne's algorithm with O(N 3 ) function calls [13].

Submodularity in measures of integrated information
In a previous study, Queyranne's algorithm was utilized to find the MIP when Φ MI is used as the measure of integrated information [12].As shown previously, Φ MI is submodular [12].However, the other measures of integrated information are not submodular.In this study, we apply Queyranne's algorithm to non-submodular functions, Φ SI and Φ G .When the objective functions are not submodular, Queyranne's algorithm does not necessarily find the exact MIP.We evaluated how accurately Queyranne's algorithm can find the MIP when it is used for non-submodular measures of integrated information.
There is an order relation among the three measures of integrated information, This inequality can be graphically understood from Fig. 1.The more the connections are removed, the larger the corresponding integrated information (the information loss) is.Φ MI removes all the interactions between the subsystems, Φ SI removes the causal influences between subsystems as well as the equal-time interactions between the present states, while Φ G removes only the causal influences between subsystems.
In terms of the difference from the submodular function Φ MI , Φ SI is closer to Φ MI than Φ G .One may guess that Queyranne's algorithm would provide more accurate results for Φ SI than for Φ G because Φ SI is closer to the submodular function Φ MI than to Φ G .We will show that this is indeed the case and more importantly, that Queyranne's algorithm works almost perfectly for both measures, Φ SI and Φ G .

Replica Exchange Markov Chain Monte Carlo Method
To evaluate the accuracy of Queyranne's algorithm, we compare the MIP found by Queyranne's algorithm with the "correct" MIP found by exhaustive search when the number of elements n is small enough (n 30).However, when n is large, we cannot know the correct MIP because the exhaustive search is unfeasible.To evaluate the performance of Queyranne's algorithm in a large system, we compare it with a different method, the Replica Exchange Markov Chain Monte Carlo (REMCMC) method [20][21][22][23].REMCMC, also known as parallel tempering, is a method to draw samples from probability distributions.REMCMC is an improved version of the MCMC methods.Simple MCMC methods like the Metropolis method often suffer from the problem of slow convergence.That is, a sample sequence is trapped in a local area and sample distribution takes time to converge to a target distribution.REMCMC aims at overcoming this problem.Here, we briefly explain how the MIP search problem is represented as a problem of drawing samples from a probability distribution.Details of the REMCMC method are given in Appendix B.
Let us define a probability distribution p(S; β) using integrated information Φ(S) as follows: where β(> 0) is a parameter called inverse temperature.This probability is higher/lower when Φ(S) is smaller/larger.The MIP gives the highest probability by definition.If we can draw samples from this distribution, we can selectively scan subsets with low integrated information and efficiently find the MIP, compared to randomly exploring partitions independent of the value of integrated information.

Results
We first evaluated the performance of Queyranne's algorithm in simulated networks.Throughout the simulations below, we consider the first order autoregressive (AR) model, where X and X ′ are present states and past states of a system, A is the connectivity matrix, and E is Gaussian noise.The stationary distribution of this AR model is considered.The stationary distribution of p(X, X ′ ) is a Gaussian distribution.The covariance matrix of p(X, X ′ ) consists of covariance of X, Σ(X), and cross-covariance of X and X ′ , Σ(X, X ′ ).Σ(X) is computed by solving the following equation, By using these covariance matrices, Φ SI and Φ G are analytically calculated [11] (see Appendix A).The details of the parameter settings are described in each subsection.We first evaluated the computation time of Queyranne's algorithm and compared it with that of the exhaustive search when the number of elements N changed.For each N, five connectivity matrices A were randomly generated.Each element of the connection matrix A was sampled from a normal distribution with mean 0 and variance 0.01/N.The covariance of Gaussian noise E was generated from a Wishart distribution W(σI, 2N) with covariance σI and degrees of freedom 2N, where σ corresponded to the amount of noise E and I was the identity matrix.We set σ to 0.1.
Figure 2 (a) shows the results for Φ SI .The red circles, which indicate the computational time of Queyranne's algorithm, are fit by the red solid line, log 10 T = 3.066 log 10 N − 3.681.In contrast, the black triangles, which indicate those of the exhaustive search, are fit by the black dotted line, log 10 T = 0.3074N − 3.563.This means that the computational time of Queyranne's algorithm increases in polynomial order (T ∝ N 3.066 ), while that of the exhaustive search exponentially increases (T ∝ 2.030 N ).For example, when N = 100, Queyanne's algorithm takes ∼ 283 sec while the exhaustive search takes 1.50 × 10 27 sec.This is in practice impossible to compute even with a super computer.Similarly, as shown in Fig. 2 (b), when Φ G is used, Queyranne's algorithm takes T ∝ N 4.120 while the exhaustive search takes T ∝ 2.091 N .Note that the reason why the order of the computational time of Queyrannes algorithm for Φ G is higher than that of SI is because the multi-dimensional optimization is needed to compute Φ G (see Appendix A).We evaluated the accuracy of Queyranne's algorithm by comparing the estimated MIP found by Queyranne's algorithm with the correct MIP found by exhaustive search.We used Φ SI and Φ G as the measures of integrated information.We considered two different architectures in connectivity matrix A of AR models.The first one was just a random matrix: Each element of A was randomly sampled from a normal distribution with mean 0 and variance 0.01/N.The other one was a block matrix consisting of N/2 by N/2 sub-matrices, A ij (i, j = 1, 2).Each element of diagonal sub-matrices A 11 and A 22 was drawn from a normal distribution with mean 0 and variance 0.02/N.Off-diagonal sub-matrices A 12 and A 21 were zero matrices.The covariance of Gaussian noise E in the AR model was generated from a Wishart distribution W(σI, 2N).The parameter σ was set to 0.1 or 0.01.The number of elements N was set to 20 for Φ SI and to 14 for Φ G .The reason for the difference in N is because Φ G requires much heavier computation than Φ SI (see Appendix A. We randomly generated 100 connectivity matrices A for each setting and evaluated performance using the following four measures.The following measures are averaged over 100 trials.

Accuracy of Queyranne's algorithm
Correct rate (CR) Correct rate (CR) is the rate of correctly finding the true MIP.Rank (RA) Rank (RA) is the rank of the MIP found by Queyranne's algorithm among all possible partitions.The rank is based on the Φ values computed at each partition.The partition that gives the lowest Φ is rank 1.The highest rank is equal to the number of possible bi-partitions, 2 N−1 .Error ratio (ER) Error ration (ER) is the deviation of the value of integrated information computed across the MIP found by Queyranne's algorithm from that computed across the true MIP, which is normalized by the mean error computed at all possible partitions.Error ratio is defined by where Φ MIP , Φ Q , and Φ are the amount of integrated information computed across the true MIP, that computed across the MIP found by Queyranne's algorithm, and the mean of the amounts of integrated information computed across all possible partitions, respectively.Correlation (CORR) Correlation (CORR) is the correlation between the MIP found by Queyranne's algorithm and that found by the correct MIP.Let us represent a bi-partition of N-elements as an N-dimensional vector σ = (σ 1 , . . ., σ N ) ∈ {−1, 1} N , where ±1 indicates one of the two subgroups.The absolute value of the correlation between the vector given by the true MIP (σ MIP ) and that given by the approximate MIP found by Queyranne's algorithm (σ Q ), |corr(σ MIP , σ Q )|, is computed.
The results are summarized in Table 1.This table shows that, when Φ SI was used, Queyranne's algorithm perfectly found the true MIPs for all 100 trials, even though Φ SI is not strictly submodular.Similarly, when Φ G was used, Queyranne's algorithm almost perfectly found the true MIPs.The correct rate was 100% for the normal models and 97% for the block structured models.Additionally, even when the algorithm missed the true MIP, the rank of the MIP it did find was 2 or 3.The averaged rank over 100 trials were 1.03 and 1.05 for the block structured models.Also, the error ratio in error trials were around 0.1 and the average error ratios were very small.Thus, such miss trials would not affect evaluation of the amount of integrated information in practice.However, in terms of partitions, the MIPs found by Queyranne's algorithm in error trials were markedly different from the true MIPs.
In the block structured model, the true MIPs for Φ G was the partition that split the system in halves.In contrast, the MIPs found by Queyranne's algorithm were one-vs-all partitions.
In summary, Queyranne's algorithm perfectly worked for Φ SI .With regards to Φ G , although Queyranne's algorithm almost perfectly evaluated the amount of integrated information, we may need to treat MIPs found by the algorithm carefully.This slight difference in performance between Φ SI and Φ G can be explained by the order relation Eq. (11).Φ SI is closer to the strictly submodular function Φ MI than is Φ G , which we consider to be why Queyranne's algorithm worked better for Φ SI than Φ G .We evaluated the performance of Queyranne's algorithm in large systems where an exhaustive search is impossible.We compared it with the Replica Exchange Markov Chain Monte Carlo Method (REMCMC).We applied the two algorithms to AR models generated similarly as in the previous section.The number of elements was 50 for Φ SI and 20 for Φ G , respectively.The difference in the number of elements used for the simulations is due to the difference in computational costs.We randomly generated 20 connectivity matrices A for each setting.We compared the two algorithms in terms of the amount of integrated information and the number of evaluations of Φ. REMCMC was run until a convergence criterion was satisfied.See Appendix B.3 for details of the convergence criterion.

Comparison between Queyranne's algorithm and REMCMC
The results are shown in Tables 2 and 3. "Winning percentage" indicates the fraction of trials each algorithm won in terms of the amount of integrated information at the MIPs.We can see that the MIPs found by the two algorithms exactly matched for all the trials.We consider that the algorithms probably found the true MIPs for the following two reasons.First, it is well known that REMCMC can find a minima if it is run for a sufficiently long time in many applications [23,[29][30][31].Second, the two algorithms are so different that it is unlikely that they both incorrectly identified the same partitions as the MIPs.Third, Queyranne's successfully finds the true MIPs in smaller systems as shown in the previous section.This suggests that Queyranne's algorithm works well for the larger systems.
We also evaluated the number of evaluations of Φ in both algorithms before the end of the computational processes.In our simulations, the computational process of Queyranne's algorithm ended much faster than the convergence of REMCMC.Queyranne's algorithm ends at a fixed number of evaluations of Φ depending only on N. In contrast, the number of the evaluations before the convergence of REMCMC depends on many factors such as the network models, the initial conditions, and pseudo random number sequences.Thus, the time of convergence varies among different trials.Note that by "retrospectively" examining the sequence of the Monte Carlo search, the MIPs turned out to be found at earlier points of the Monte Carlo searches than Queyranne's algorithm (which are indicated as "MIP found" in Tables 2 and 3).However, it is impossible to stop the REMCMC algorithm at these points where the MIPs were found because there is no way to tell whether these points reach the minima until the algorithm is run for enough amount of time.

Evaluation with real neural data
Finally, to ensure the applicability of Queyranne's algorithm to real neural data, we similarly evaluated the performance with electrocorticogram (ECoG) data recorded in a macaque monkey.The dataset is available at an open database, Neurotycho.org (http://neurotycho.org/).One hundred and twenty-eight channel ECoG electrodes were implanted in the left hemisphere.The electrodes were placed at 5-mm intervals, covering the frontal, parietal, temporal, and occipital lobes, and medial frontal and parietal walls.Signals were sampled at a rate of 1 kHz and down-sampled to 100 Hz for the analysis.The monkey "Chibi" was awake with the eyes covered by an eye-mask to restrain visual responses.To remove line noise and artifacts, we performed bipolar re-referencing between nearest neighbor electrode pairs.The number of re-referenced electrodes was 64 in total.
In the first simulation, we evaluated the accuracy.We extracted a 1-minute length of the signals of the 64 electrodes.Then, we randomly selected 20 and 14 electrodes 100 times for Φ SI and Φ G , respectively.We approximated the probability distribution of the signals with multivariate Gaussian distributions.The covariance matrices were computed with a time window of 1 min and a time step of 10 ms.We applied the algorithms to the 100 randomly selected sets of electrodes and measured the accuracy similarly as in Subsection 6.2.The results are summarized in Table 4.We can see that Queyranne's algorithm worked perfectly for both Φ SI and Φ G .
Next, we compared Queyranne's algorithm with REMCMC.We applied the two algorithms to the 64 re-referenced signals, and evaluated the performance in terms of the amount of integrated information and the number of evaluations of Φ, as in Subsection 6.3.We segmented 15 non-overlapping sequences of 1 minute each, and computed covariance matrices with a time step of 10 ms.We measured the average performance over the 15 sets.Here, we only used Φ SI , because Φ G requires heavy computations for 64 dimensional systems.The results are shown in Table 5.We can see that the MIPs selected by the two algorithms matched for all 15 sequences.In terms of the amount of computation, Queyranne's algorithm ended much faster than the convergence of REMCMC.

Discussions
In this study, we proposed an efficient algorithm for searching for the Minimum Information Partition (MIP) in Integrated Information Theory (IIT).The computational time of an exhaustive search for the MIP grows exponentially with the arithmetic growth of system size, which has been an obstacle to applying IIT to experimental data.We showed here that by using a submodular optimization algorithm called Queyranne's algorithm, the computational time was reduced to O(N 3.066 ) and O(N 4.120 ) for stochastic interaction Φ SI and geometric integrated information Φ G , respectively.These two measures of integrated information are non-submodular, and thus it is not theoretically guaranteed that Queyranne's algorithm will find the true MIP.We empirically evaluated the accuracy of the algorithm by comparing it with an exhaustive search in simulated data and in ECoG data recorded from monkeys.We found that Queyranne's algorithm worked perfectly for Φ SI and almost perfectly for Φ G .We also tested the performance of Queyranne's algorithm in larger systems (N ∼ 50) where the exhaustive search is intractable by comparing it with the Replica Exchange Markov Chain Monte Carlo method (REMCMC).We found that the MIPs found by these two algorithms perfectly matched, which suggests that both algorithms most likely found the true MIPs.In terms of the computational time, the number of evaluations of Φ taken by Queyranne's algorithm was much smaller than that taken by REMCMC before the convergence.Our results indicate that Queyranne's algorithm can be utilized to effectively estimate MIP even for non-submodular measures of integrated information.
Here, we discuss the pros and cons of Queyranne's algorithm in comparison with REMCMC.Since the MIPs found by both algorithms perfectly matched in our experiments, they were equally good in terms of accuracy.With regards to computational time, Queyranne's algorithm ended much faster than the convergence of REMCMC.Thus, Queyranne's algorithm would be a better choice in rather large systems (N ∼ 50).Note that if we retrospectively examine the sampling sequence in REMCMC, we find that REMCMC found the MIPs much earlier than its convergence and that the estimated MIPs did not change in the later parts of sampling process.Thus, if we could introduce a heuristic criterion to determine when to stop the sampling based on the time course of the estimated MIPs, REMCMC could be stopped earlier than its convergence.However, setting such a heuristic criterion is a non-trivial problem.Queyranne's algorithm ends within a fixed number of function calls regardless of the properties of data.If the system size is much larger (N 100), Queyranne's algorithm will be computationally very demanding because of O(N 3 ) time complexity and may not practically work.In that case, REMCMC would work better if the above-mentioned heuristics are introduced to stop the algorithm earlier than the convergence.
In this study, we considered that the three different measures of integrated information, Φ MI , Φ SI , and Φ G .Φ MI , are submodular but the other two measures, Φ SI and Φ G , are not.As we described in Section 4.3, there is a clear order relation among them (Eq.( 11)).Φ SI is closer to a submodular function Φ MI than Φ G is.This relation implies that Queyranne's algorithm would work better for Φ SI than for Φ G .We found that it was actually the case in our experiments because there were a few error trials for Φ G whereas there were no miss trials for Φ SI .For the practical use of these measures, we note that there are two major differences among the three measures.One is what they quantify.As shown in Fig. 1, Φ G measures only causal interactions between units across different time points.In contrast, Φ SI and Φ MI also measure equal time interactions as well as causal interactions.Φ G best follows the original concept of IIT in the sense that it measures only the "causal" interactions.One needs to acknowledge the theoretical difference whenever they apply one of these measures in order to correctly interpret the obtained results.The other difference is in computational costs.The computational costs of Φ MI and Φ SI are almost the same while that of Φ G is much larger, because it requires multi-dimensional optimization.Thus, Φ G may not be practical for the analysis of large systems.In that case, Φ MI or Φ SI may be used instead with care taken of the theoretical difference.
Although we resolved one of the major computational difficulties in IIT, an additional issue still remains.Searching for the MIP is an intermediate step in identifying the informational core, called the "complex".The complex is the subnetwork in which integrated information is maximized, and is hypothesized to be the locus of consciousness in IIT.Identifying the complex is also represented as a discrete optimization problem which requires exponentially large computational costs.Queyranne's algorithm cannot be applied to the search for the complex because we cannot formulate it as a submodular optimization.We expect that REMCMC would be efficient in searching for the complex and will investigate its performance in a future study.
An important limitation of this study is that we only showed the nearly perfect performance of Queyranne's algorithm in limited simulated data and real neural data.In general, we cannot tell whether Queyranne's algorithm works well for other data beforehand.For real data analysis, we recommend that the procedure below should be applied.First, as we did in Section 6.2, accuracy should be checked by comparing it with the exhaustive search in small randomly selected subsets.Next, if it works well, the performance should be checked by comparing it with REMCMC in relatively large subsets, as we did in Section 6.3.If Queyranne's algorithm works better than or equally as well as REMCMC, it is reasonable to use Queyranne's algorithm for the analysis.By applying this procedure, we expect that Queyranne's algorithm could be utilized to efficiently find the MIP in a wide range of time series data.nearby values of β.By this exchange, the sampled sequences at high inverse temperatures can escape from local minima and can explore many subsets.
We consider M-probabilities at different inverse temperatures β 1 > β 2 > • • • > β M and introduce the following joint probability: Then, the simulation process of the REMCMC consists of the following two steps.
Sampling from each distribution Samples are drawn from each distribution p(S m ; β m ) separately by using the Metropolis method as described in the previous subsection.Exchange between neighboring inverse temperatures After a given number of samples are drawn, subsets at neighboring inverse temperatures are swapped, according to the following probability p(S m ↔ S m+1 ): This probability indicates that if the integrated information at a higher inverse temperature is larger than that at a lower inverse temperature, subsets are always swapped; and otherwise, they are swapped with the probability r ′ .By iterating these two steps for sufficient time, the sample distribution converges to the joint distribution Eq.(37).
To maximize the efficiency of the REMCMC, it is important to appropriately set the multiple inverse temperatures.If the neighboring temperatures are far apart, the acceptance ratio of exchange (Eq.( 38)) becomes too small.The REMCMC is then reduced to just separately simulating distributions at different temperatures without any exchange.In a previous study [35], it was recommended to keep the average ratio higher than 0.2 for every temperature pair.At the same time, the highest/lowest inverse temperatures should be high/low enough so that sample sequence at the highest inverse temperature can reach the tips of (local) minima and that at the lowest one can search around many subsets.To satisfy these constraints, a sufficient number M of inverse temperatures are accommodated and the inverse temperatures are optimized to equalize the average of the acceptance ratio of exchanges at all temperature pairs [35][36][37][38][39]. Details of temperature setting are described below.

Initial setting
Inverse temperatures β m (m = 1, . . ., M) are initially set as follows.First, a subset is randomly selected for each m.Then, a randomly chosen element is added to or eliminated from each subset, and the absolute value of the change ∆Φ m in the amount of integrated information is taken.By using these absolute values, the highest and lowest inverse temperatures are determined by a bisection method so that the respective averages of the acceptance ratio exp(−β∆Φ 1 ) and exp(−β∆Φ M ) match the predefined values.The intermediate inverse temperatures are set to be a geometric progression:

Updating
The difference in the amount of integrated information between the candidate subset Φ(S c ) and the current subset Φ(S (t) ) is stored when the difference is positive (Φ(S c ) − Φ(S (t) ) ≥ 0).Then, by using the stored values at all the inverse temperatures, the highest and lowest inverse temperatures are determined by a bisection method so that the average of the acceptance ratio exp β Φ(S (t) ) − Φ(S c ) matches the predefined value, as in the initial setting.The intermediate inverse temperatures are set to approximately equalize the expected values of acceptance ratio of the exchange at all temperature pairs [35][36][37][38][39].The expected value is represented as a sum of two probabilities: +p(Φ m < Φ m+1 )e (β m −β m+1 )(Φ m −Φ m+1 ) dΦ m dΦ m+1 . (39) In [39], this expected value is approximated as where µ(T) and σ 2 (T) are the mean and variance of Φ, represented as functions of temperature T.
In [39], these functions are given by interpolating the sample mean and variance.In this study, these functions are estimated by regression, because the sample mean and variance are highly variable.The mean and variance at each temperature are computed at every update, and these means and variances are regressed on temperature using a continuous piecewise linear function, the anchor points of which are current temperatures.Then, to roughly equalize the expected values of the acceptance ratio of the exchange at all temperature pairs, we minimize the following cost function by varying temperatures [39]: The minimization is performed by a quasi-newton method, the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method.

Appendix B.3. Convergence criterion
One of the most commonly used MCMC convergence criteria is potential scale reduction factor (PSRF), which was proposed by Gelman & Rubin (1992) [40], and modified by Brooks & Gelman (1998) [41].In this criterion, multiple MCMC sequences are run.If all of them converge, statistics of the sequences must be about the same.This is assessed by comparing between-sequence variance and within-sequence variance of a random variable and calculating the PSRF, Rc .Large Rc suggests that some of the sequences do not converge yet.If Rc is close to 1, we can diagnose them as converged.In this study, we cut the sequence at each inverse temperature into the former and the latter halves, and applied the criterion to these two half sequences.If Rc of all the temperatures were below a predefined threshold, we regarded the sequences as converged.

Appendix B.4. Parameter settings
The number of inverse temperatures M was fixed at 6 throughout out the experiments.The highest/lowest inverse temperatures were set so that the averages of acceptance ratio become 0.01 and 0.5, respectively.The exchange process was done every 5 MCSs.The update of inverse temperatures was performed every 5 MCSs for the 200 initial MCSs.The threshold of Rc was set to 1.01.When computing Rc , we discarded the first 200 MCSs as a burn-in period and started to computing it after 300 MCSs.

Figure 1 .
Figure 1.Measures of integrated information represented by the Kullback-Leibler divergence between the actual distribution p and q.(a) Mutual information.(b) Stochastic interaction.(c) Geometric integrated information.

Figure 2 .
Figure 2. Computational time of Queyranne's algorithm and the Exhaustive search.

Table 4 .
Accuracy of Queyranne's algorithm in ECoG data

Table 5 .
Comparison of Queyranne's algorithm with REMCMC in ECoG data (SI)