Evaluating Approximations and Heuristic Measures of Integrated Information

Integrated information theory (IIT) proposes a measure of integrated information, termed Phi (Φ), to capture the level of consciousness of a physical system in a given state. Unfortunately, calculating Φ itself is currently possible only for very small model systems and far from computable for the kinds of system typically associated with consciousness (brains). Here, we considered several proposed heuristic measures and computational approximations, some of which can be applied to larger systems, and tested if they correlate well with Φ. While these measures and approximations capture intuitions underlying IIT and some have had success in practical applications, it has not been shown that they actually quantify the type of integrated information specified by the latest version of IIT and, thus, whether they can be used to test the theory. In this study, we evaluated these approximations and heuristic measures considering how well they estimated the Φ values of model systems and not on the basis of practical or clinical considerations. To do this, we simulated networks consisting of 3–6 binary linear threshold nodes randomly connected with excitatory and inhibitory connections. For each system, we then constructed the system’s state transition probability matrix (TPM) and generated observed data over time from all possible initial conditions. We then calculated Φ, approximations to Φ, and measures based on state differentiation, coalition entropy, state uniqueness, and integrated information. Our findings suggest that Φ can be approximated closely in small binary systems by using one or more of the readily available approximations (r > 0.95) but without major reductions in computational demands. Furthermore, the maximum value of Φ across states (a state-independent quantity) correlated strongly with measures of signal complexity (LZ, rs = 0.722), decoder-based integrated information (Φ*, rs = 0.816), and state differentiation (D1, rs = 0.827). These measures could allow for the efficient estimation of a system’s capacity for high Φ or function as accurate predictors of low- (but not high-)Φ systems. While it is uncertain whether the results extend to larger systems or systems with other dynamics, we stress the importance that measures aimed at being practical alternatives to Φ be, at a minimum, rigorously tested in an environment where the ground truth can be established.


Introduction
The nature of consciousness, defined as a subjective experience, has been a philosophical topic for centuries but has only recently become incorporated into mainstream neuroscience [1]. However, as consciousness is a subjective phenomenon, and thus not directly measurable, it must be operationalized to allow for empirical investigation of its nature and underlying mechanisms [2]. In other words, before reaching an attracting fixed point or periodic sequence. While generating the observed data, we periodically perturbed the network into a new state, ensuring that our data fully explored the state space of the network and that the results were not dependent on our choice of initial condition. This procedure resembles the perturbations applied by transcranial magnetic stimulation (TMS)during empirical studies of consciousness [14]. The generated time-series data consisted of 2 n epochs, where one epoch was generated by initializing/perturbing a network to an initial state and then was simulated for a total of α(n)(2 n + 1) timesteps. The function α(n) ensured parity of bits between the generated time series for networks of different sizes (see Appendix A1). This perturbation and simulation process was repeated for all possible network states (2 n ) sequentially, with each epoch appended to the last preceding epoch. The resulting simulated time series (sequence of epochs) produced an α(n)(2 n + 1)2 n -by-n matrix where each of the n columns reflected the state of a single node over time, and each row reflected the current state of each network node (0/1) at a given time. In sum, we derived a TPM from the mechanism and connectivity profile of individual nodes and then, using the TPM and perturbations, generated a time series of observed data that explored the entire state space of the network (see Figure 1b,c).  Networks were randomly generated with n binary linear threshold nodes (S i ∈ {0, 1}, θ ≥ 1.0) and connections (W ij ∈ {−1, 0, 1}). Each network was perturbed into each possible initial state, and the following state transitions were recorded. (B) The networks' node mechanism and connection weights were used to generate a transition probability matrix (TPM), containing the probability of one state leading to any other state. (C) From the TPM, we generated an "observed" time series using frequent perturbations of the initial states. The sequence of state transitions following an initial state perturbation is termed an epoch.
To investigate various measures and approximations, we needed functional information about the networks in the form of a probabilistic description of the transitions from any given state to any other state, i.e., a transition probability matrix (TPM). For each network, a TPM was constructed based on the node mechanism (linear threshold with θ = 1) and the connection weights W ij . As the generated networks were deterministic, the TPM contained only a single '1' in each row representing the next state of the network.
From the TPM, given an initial condition, we were able to generate "observed" time-series data for each network. From a given initial condition, a network may only explore part of its state space before reaching an attracting fixed point or periodic sequence. While generating the observed data, we periodically perturbed the network into a new state, ensuring that our data fully explored the state space of the network and that the results were not dependent on our choice of initial condition. This procedure resembles the perturbations applied by transcranial magnetic stimulation (TMS)during empirical studies of consciousness [14]. The generated time-series data consisted of 2 n epochs, where one epoch was generated by initializing/perturbing a network to an initial state and then was simulated for a total of α(n)(2 n + 1) timesteps. The function α(n) ensured parity of bits between the generated time series for networks of different sizes (see Appendix A.1). This perturbation and simulation process was repeated for all possible network states (2 n ) sequentially, with each epoch appended to the last preceding epoch. The resulting simulated time series (sequence of epochs) produced an α(n)(2 n + 1)2 n -by-n matrix where each of the n columns reflected the state of a single node over time, and each row reflected the current state of each network node (0/1) at a given time. In sum, we derived a TPM from the mechanism and connectivity profile of individual nodes and then, using the TPM and perturbations, generated a time series of observed data that explored the entire state space of the network (see Figure 1b,c).

Integrated Information
For the networks defined above, we calculated Φ 3.0 as implemented through PyPhi v1.0 [6]. Here, we just give a brief summary of how Φ 3.0 was defined and calculated, but see reference [5] for a more detailed account. Generally, IIT proposes that a physical system's degree of consciousness is identical to its level of state-dependent causal irreducibility (Φ max ), i.e., the amount of information of a system in a specific state above and beyond the information of the system's parts.
The calculation of Φ 3.0 began with "mechanism-level" computations. For a given candidate system (subset of a network) in a state, we identified all possible mechanisms (subsets of system nodes in a state that irreducibly constrained the past and future state of the system). For each mechanism, we considered all possible purviews (subsets of nodes) that the mechanism constrained. For a given mechanism-purview combination, we found its cause-effect repertoire (CER; a probability distribution specifying how the mechanism causally constrained the past and future states of the purview). To find the irreducibility of the CER, the connections between all permissible bipartitions of elements in the purview and the mechanism were cut (see [6]); the bipartition producing the least difference is called the minimum information partition (MIP). Irreducibility, or integrated information, ϕ, is quantified by the earth mover's distance (EMD) between the CER of the uncut mechanism and the CER of the mechanism partitioned by the MIP. A mechanism, together with the purview over which its CER is maximally irreducible and the associated ϕ value, specifies a concept, which expresses the causal role played by the mechanism within the system. The set of all concepts is called the cause-effect structure of the candidate system.
Once all irreducible mechanisms of a candidate system were found, a similar set of operations was done at the "system level" to understand whether the set of mechanisms specified by the system were reducible to the mechanisms specified by its parts. The irreducibility of the candidate system was quantified by its conceptual integrated information, Φ. This process was repeated for all candidate systems, and the candidate system that was maximally irreducible among all candidate systems was termed a major complex (MC). According to IIT then, the MC was the substrate that specified a particular conscious experience for the (physical) system in a state, and Φ 3.0 quantified the irreducibility of the cause-effect structure it specified in that state. As such, Φ 3.0 was calculated for every reachable state of the system, i.e., state-dependently.
As many of the heuristics and approximations outlined below are state-independent, there is no direct comparison to the state-dependent Φ 3.0 . To facilitate comparisons with these measures, we further computed a state-independent quantity, Φ peak 3.0 , as the maximum value of Φ 3.0 across all states of the network. The quantity Φ peak 3.0 can be thought of as a measure of a network capacity for consciousness, rather than its currently realized level of consciousness. Alternatively, we could also compute the mean value of Φ 3.0 , which has some relation to the state-dependent value of Φ 3.0 under certain regularity conditions [15], but the results were similar (see Figure 5d).

Approximations and Heuristics
To speed up the calculation of Φ 3.0 , one can implement several shortcuts or approximations based on assumptions about the system under consideration. Here, we aimed to test six specific approximations; three approximations that are already implemented in the toolbox for calculating Φ 3.0 Entropy 2019, 21, 525 5 of 23 (PyPhi; [6]) that reduce the complexity of evaluating information lost during partitioning of a network; two shortcuts based on estimating the elements included in the MC rather than explicitly testing every candidate subsystem; and one estimation of a system's Φ peak 3.0 from the Φ of a few states, rather than taking the maximum over all possible states. All approximations were likely to compare well against Φ 3.0 , but were unlikely to yield significant savings in computational demand.
Another approach is to use heuristics that capture aspects of Φ 3.0 . These heuristics can be separated into two classes: those that require the full TPM and discrete dynamics (heuristics on discrete networks requiring perturbational data) and those that require time-series data (heuristics from observed data). While these measures may reduce the computational demands, the heuristics based on discrete dynamics still require full structural and functional knowledge of the system, which reduces their applicability. On the other hand, measures based on observed data significantly broaden the potential applicability at the cost of estimating the underlying causal structure by using the observed time series.
All approximations and heuristics that were tested are listed in Table 1, together with an identifier (from "A" to "N") that will be used in the text for ease of reading, as well as a reference and brief description. We calculated several approximations to Φ 3.0 . (A) The cut-one approximation (CO) reduced the number of partitions considered when searching for the MIP. The approximation assumes that the MIP is achieved by cutting only a single node out of the candidate system; (B) the no-new-concepts approximation (NN) eliminates the need to rebuild the entire cause-effect structure for every partition under the assumption that when a partition is made it does not give rise to new concepts. Thus, one only needs to check for changes to existing mechanisms, rather than reevaluating the entire powerset of potential mechanisms.
We also tested two approximations based on estimates of which nodes are included in the MC. These approximations assumed the MC consisted of either (C) all the nodes in the system taken as a whole (whole system; WS), or (D) the subsystem of the network where all nodes with no recursive connectivity (no input and/or output connections) or an unreachable state (nodes that were always "on" or always "off", such as a node with only inhibitory inputs) had been removed, iteratively (iterative cut; IC). Note that by unreachable, we mean there was no state of the network that would lead to a particular node being "on" (or "off") in the next time step. This does not mean that we could not use an external perturbation to set the node into any state (which we did when generating the observed data). In IIT 3.0 , such a node (either with no inputs, no outputs, or an unreachable state) can be partitioned without loss, leading to Φ 3.0 = 0. Simply excluding these nodes from the MC is not an approximation but a computational shortcut, as they will necessarily be outside the MC. However, the approximation consisted in assuming that the remaining set of recursively connected nodes was the MC.
As with Φ 3.0 , these measures were calculated in a state-dependent and state-independent manner.
Finally, we tested (E) if the state-independent Φ peak 3.0 could be estimated by randomly sampling the state-dependent Φ 3.0 , termed here "Est.nΦ

Heuristics on Discrete Networks
To estimate Φ 3.0 , we investigated several heuristic measures defined for discrete networks. While the latest iteration of IIT takes steps to make the mathematical formalism more in tune with the intended interpretation of its axioms and postulates, IIT 3.0 is more computationally intractable than previous versions (see S1 of [5]). To compare the results of the two newest versions of the theory, we tested (F) Φ based on IIT 2.0 , Φ 2.0 [3], and (G) Φ 2.0 incorporating minimization over both cause-effect and not only cause, Φ 2.5 [12]. These measures are, however, still limited by the exponential growth in computational time and are included here because IIT 2.0 was used as inspiration for other measures, and their validity depends on the correspondence between IIT 2.0 and IIT 3.0 .
As Φ 3.0 is sensitive to a large state repertoire, i.e., divergent and convergent behavior-weakening cause/effect constraints (assuming irreducibility), we also included two measures that captured the dynamical differentiation of states in the system; (H) The number of reachable states, D1, quantifying the system's available repertoire of states, and (I) cumulative variance of system elements, D2, indicating the degree of difference between system states [15]. For D1, we calculated the number of states that were reachable, i.e., states that had a valid precursor state. Accordingly, D1 was inversely related to a system's degeneracy of state transitions. D2 calculated the cumulative variance of activity in each system node given the maximum entropy distribution of initial conditions. As such, D2 reflected how different the system's reachable states were from each other. See [15] for a more thorough account.
Both Φ 2.0 and Φ 2.5 were calculated in a state-dependent and in a state-independent manner (Φ peak 2.0 /Φ peak 2.5 ), while both D1 and D2 were only defined state-independently. All the heuristics on discrete systems were calculated using the system TPM. As such, while these measures were faster to calculate and flexible in terms of network size, they still required full knowledge of the functional dynamics of the system (i.e., the full TPM).

Heuristics from Observed Data
To alleviate the full knowledge requirement, we considered heuristic measures that are defined for observed (time-series) data. Given their relative success in distinguishing conscious from unconscious states in experiments and clinical populations [13,22,23] and their apparent similarity to central IIT intuitions, we focused on measures of signal diversity. There are many candidates to choose from, but here, we included (J) coalition entropy (S), measured by the entropy of the observed state distribution indicating a system's average diversity of visited states [22], and (K) signal complexity measured by algorithmic compressibility through Lempel-Ziv compression (LZ), indicating the degree of order or patterns in the observed state sequences of a system [22]. Both entropy and complexity measures have been used in EEG to distinguish between states of consciousness [13,24]. In addition, several measures have been developed that share many of IITs underlying intuitions, such as capturing integrated information of a system above and beyond its parts while staying computationally tractable [10,11,19,21,25]. Although these measures can be applied to continuous data in the time domain such as EEG, here, we focused on a selection of these measures that can be applied to discrete, binary data. Specifically, we tested: (L) decoder-based integrated information (Φ*) based on IIT 2.0 [21], (M) integrated stochastic interaction (SI) based on IIT 2.0 [11], and (N) mutual information (MI) based on IIT 1.0 [21]. The integrated information measures were implemented using the "Practical PHI toolbox for integrated information analysis" [26] with the discrete forms of the formulae, employing a MIP exhaustive search with a bipartition scheme (powerset; 2 n−1 −1) and a normalization factor according to IIT 2.0 [3]. All heuristics were calculated in a state-independent manner, using the time-series data generated for the whole network (no searching through subsystems).

Analysis
Comparisons between Φ 3.0 and approximate measures (CO, NN, WS, IC) were analyzed using Pearson correlations (r) and separate ordinary least-squares linear regression models as the approximations were expected to be closely related to Φ 3.0 . Statistics of linear fits are reported. For comparisons between Φ 3.0 and all other measures we used Spearman's correlation (r s ) to investigate the monotonicity of the relationship, as a linear relationship was not necessarily expected. All state-dependent measures were compared to Φ 3.0 , while all state-independent measures were compared to Φ peak 3.0 . Metrics of significance (p values) are not reported because of our large sample size; for our sample (n > 1981), correlations as small as |r| = 0.044 were statistically significant at the 0.05 level, but such small correlations were not meaningful in the context of the study. As we focused on high correspondence, we instead report correlations as weak, 0.5 < r < 0.7, medium 0.7 < r < 0.8, strong 0.8 < r < 0.9, and very strong, r > 0.9 (for both r and r s ).

Results
We analyzed 2032 randomly generated networks, with 131 three-node, 675 four-node, 866 five-node, and 360 six-node networks. In total, 61,224 states were analyzed. Note that the heuristic measures were only analyzed in 309 of the six-node networks due to time constraints. See Table 2 for an overview of the main results and Figure 2 for four example networks.   . estimated from five sample states; D1/2: state differentiation; S: coalition entropy; LZ: Lempel-Ziv complexity; Φ*: decoder-based Φ; SI: stochastic interaction; MI: mutual information.

Descriptive Statistics
Mean and variance of Φ 3.0 grew as a function of network elements (n = 3: M = 0.015 ± 0.121SD to n = 6: M = 0.386 ± 0.487SD). As the systems increased in size, the fraction of Φ peak 3.0 = 0 networks (indicating a completely reducible system, e.g., a feedforward network) decreased. We also monitored a class of networks with Φ

Descriptive Statistics
Mean and variance of Φ3.0 grew as a function of network elements (n = 3: M = 0.015 ± 0.121SD to n = 6: M = 0.386 ± 0.487SD). As the systems increased in size, the fraction of . = 0 networks (indicating a completely reducible system, e.g., a feedforward network) decreased. We also monitored a class of networks with . = 1, as this typically indicated that the MC was a stereotyped unidirectional "loop". The fraction of these stereotyped networks stayed relatively stable as n increased, while the fraction of networks with . > 1 increased. See Figure 3.

Approximations
Both the no-new-concepts (NN) and the cut-one (CO) approximations were nearly perfectly correlated with state-dependent (S.D.) Φ3.0 and state-independent (S.I.) .

Approximations
Both the no-new-concepts (NN) and the cut-one (CO) approximations were nearly perfectly correlated with state-dependent (S.D.) Φ 3.0 and state-independent (S.I.) Φ   In regard to estimating Φ . G and H are plotted with linear fit (blue) and one-to-one relationship (dotted, gray).
The state-independent heuristic LZ and S were medium correlated with .
(0.71 < rs < 0.72) (Figure 5c, only LZ shown). The state-independent measures SI and MI were weakly or less correlated with .
(rs < 0.54), while Φ* was strongly rank-order correlated with . , (rs = 0.82) (Figure 5d, only Φ* shown). For Φ*, the results showed two clusters of values, one seemingly linearly related to . , and one non-correlated cluster consisting of low . /high Φ* outliers. A post-hoc analysis removing outliers above two standard deviations of the mean negligibly influenced the results (see Appendix A2).
Together, these results suggest that the tested heuristics might be accurate predictors of . on a group level however not necessarily for individual networks; they also drastically reduce computational demands (see Appendix A4). In addition, all heuristics showed an increased variance of .
with higher values, suggesting reduced correspondence for higher values.

Post-hoc Tests
For all measures, removing non-integrated ( . = 0) or irreducible circular networks ( . = 1) reduced the correlational values. This was true for all heuristics, while the approximations were minimally affected. After this adjustment, S.I. D1 and Φ* were the heuristics highest correlated with (rs = 0.703 and rs = 0.698, respectively), with LZ the third (rs = 0.616). This indicates that the Finally, we tested whether the estimated MCs could predict Φ 3.0 . WSΦ   Together, these results suggest that the tested approximations can be used as strong predictors of Φ; however, these approximations still require knowledge of the systems TPM, and their computational cost grows exponentially, leading to only a marginal increase in the size of networks that can be analyzed (see Appendix A.4).

Heuristics
The state differentiation measures D1 and D2 showed strong (r s = 0.827) and medium (r s = 0.718) rank order correlations with S.I.Φ   (Figure 5a,b).
The state-independent heuristic LZ and S were medium correlated with Φ only Φ* shown). For Φ*, the results showed two clusters of values, one seemingly linearly related to Φ peak 3.0 , and one non-correlated cluster consisting of low Φ peak 3.0 /high Φ* outliers. A post-hoc analysis removing outliers above two standard deviations of the mean negligibly influenced the results (see Appendix A.2).
Together, these results suggest that the tested heuristics might be accurate predictors of Φ peak 3.0 on a group level however not necessarily for individual networks; they also drastically reduce computational demands (see Appendix A.4). In addition, all heuristics showed an increased variance of Φ peak 3.0 with higher values, suggesting reduced correspondence for higher values.

Post-hoc Tests
For all measures, removing non-integrated (Φ

Discussion
We randomly generated a population of small networks (three to six nodes) with linear threshold logic and both excitatory and inhibitory connections. We evaluated several approximations and heuristic measures of integrated information based on how well they corresponded to the Φ 3.0 , according to the definition proposed by integrated information theory. The purpose of the work was to determine which methods, if any, might be used to test the theory. Since the accuracy of these methods cannot be evaluated for large networks of the size typically of interest for consciousness studies, we considered success in the current study-correspondence in small networks where Φ 3.0 can be computed-as a minimal requirement for any such measure. In summary, we observed that the computational approximations were strong predictors (as defined in Section 2.4) of both Φ 3.0 and Φ peak 3.0 , while heuristic measures were only able to capture Φ peak 3.0 . The approximation measures were still computationally intensive and required full knowledge of the systems TPM, meaning they only provided a marginal increase to the size of the systems that can be studied. Heuristic measures on the other hand, provided greater reductions in computation and knowledge requirements and can be applied to much larger systems, but only in a coarser state-independent manner.

Approximation Measures
The approximation measures we tested were developed by starting from the definition of Φ 3.0 and then making assumptions to simplify the computations. Although they did not reduce computation enough to substantially increase the applicability of Φ 3.0 , their success provides a blueprint for future approximations. We discuss two aspects of Φ 3.0 computation that should be investigated in future work: finding the MC of a network and finding the MIP of a mechanism-purview combination.
Regarding the estimates of the MC, the Φ 3.0 value of any subsystem within a network is a lower bound on the Φ 3.0 of the MC of that network. Moreover, the WS approximation (assuming the MC is the whole system) and the IC approximation (assuming the MC is the whole system after removing nodes without inputs or without outputs and inactive nodes) were both highly predictive of Φ 3.0 (and of Φ peak 3.0 ). Estimating the MC provided computational savings by eliminating the need to compute Φ 3.0 for all possible subsets of elements. However, the computational cost of computing Φ 3.0 for an individual subsystem still grows exponentially with the size of the subsystem. Any MC estimate close to the full size of the network will still require substantial computation. Therefore, finding a minimal MC that still accurately estimates Φ 3.0 would be most efficient for reducing the computational demands. While this may limit the usability of MC estimates (for highly integrated systems, the MC is more likely to be the whole system), such methods could be used to investigate questions regarding which part of a system is conscious (e.g., cortical location of consciousness [27]).
Using the CO approximation (assuming that at the system level, the MIP results from partitioning a single node), we observed very strong correlations with Φ 3.0 (and Φ peak 3.0 ). Usually, the number of partitions to check grows exponentially with the number of nodes in the system, but with the CO approximation it grew linearly, providing a substantial computational savings. Extending the CO approximation (or some variant of it, see [28][29][30]) from the system-level MIP to the mechanism-level MIPs could provide even greater computational savings. While only a single system-level MIP needs to be found to compute Φ 3.0 , a mechanism-level MIP must be found for every mechanism-purview combination (the number of which grows exponentially with the system size).
As an aside, the IIT 3.0 formalism only considers bipartitions of nodes when searching for the MIP, presumably on the basis that further partitioning a mechanism (or system) could cause additional information loss (and, thus, never be a minimum information partition). To explore this, we employed an alternative definition of the MIP requiring a search over all partitions (AP, as opposed to bipartitions) for a subset of our networks. While we observed a very high correlation between all the partitions and bipartitions schemes (S.I. Φ peak 3.0 R 2 = 0.966; S.D. Φ 3.0 R 2 = 0.921; see Appendix A.7), the correspondence was not exact. Note that the definition of a partition used for the 'all partitions' option is slightly different than the definition for 'bipartitions', so the set of partitions in the AP option is not strictly a superset of the set of bipartitions (see PyPhi v1.0 and its documentation [6] or Appendix A.7 for more details). Despite this difference, we saw a very strong correlation between the methods, suggesting that different rules for permissible cuts could be considered as potential approximations.

Heuristic Measures
Although heuristic measures did not capture state-dependent Φ 3.0 , most were rank-correlated with state-independent Φ The heuristic D1 measures the number of states accessible by a system [15], and the strong correlation we observed indicates that systems with a large repertoire of available states are also likely to have high Φ peak 3.0 (assuming the systems are irreducible, i.e., Φ peak 3.0 > 0). This finding is interesting because clinical results also corroborate state differentiation as a factor in unconsciousness, where it has been observed that the state repertoire of the brain is reduced during anesthesia [31]. While D1 is computationally tractable, it requires full knowledge of the system (i.e., a TPM with 2 2n bits of information), that the system is integrated, and that transitions are relatively noise-free. As such, unfortunately, D1 cannot be applied to larger artificial or biological systems of interest (such as the brain). The second measure that correlated well with Φ peak 3.0 can also be seen to quantify state differentiation to some extent. LZ is a measure of signal complexity [32], offering a concrete algorithm to quantify the number of unique patterns in a signal. While LZ has been used to differentiate conscious and unconscious states [13,33], it cannot distinguish between a noisy system and an integrated but complex one from observed data alone. Thus, some knowledge of the structure of the system in question is required for its interpretation. In addition, while LZ allows for analysis of real systems based on time-series data, it is also the measure that is the furthest removed from IIT (but see [14]). It is highly dependent on the size of the input and is hard to interpret without normalization, which makes it difficult to compare systems of varying size. Finally, the measure Φ* is aimed at providing a tractable measure of integrated information using mismatched decoding and is applicable to time-series data, both discrete and continuous [10]. Φ* is relatively fast to compute and can also be applied to continuous time series like EEG. However, while we observed a high correlation with Φ peak 3.0 , a cluster of high Φ* values with corresponding low Φ peak 3.0 values limited the interpretation. This suggests that Φ* might not be reliable for low Φ peak 3.0 networks, but the analysis of larger networks is needed to draw a conclusion. While the results did not suggest a clear tractable alternative to Φ 3.0 , several of the measures could be useful in statistical comparisons of groups of networks.
Prior work directly comparing Φ 3.0 with measures of differentiation (e.g., D1, LZ) reported lower correlations than those observed here for Φ 3.0 [15]. There are at least three possible reasons for this: (a) the current work considered only linear nodes instead of nodes implementing general logic, (b) we compared against Φ peak 3.0 and not Φ mean 3.0 , and (c) we considered only the whole system as a basis for the heuristics, and not the subset of elements that constitutes the MC. For (b), we reran the analysis replacing Φ peak 3.0 with Φ mean 3.0 , producing negligible deviances in the results (see Appendix A.5). For (c), the results of the WS (whole-system approximation) suggested that using the whole system to approximate the MC does not make a substantial difference (at least for networks of this size). This leaves (a), the types of network studied, as the likely reason for the differences in the strength of the correlations.
All heuristic measures' rank correlations with Φ peak 3.0 were negatively impacted by removing networks with Φ peak 3.0 = 0 1. This suggests that such networks are indeed relevant to consider and that finding a tractable measure that seperates Φ peak 3.0 = 0 and Φ peak 3.0 ≥ 0 networks would be useful in its own right. Evident in the results was that all heuristics, except S, SI, and MI, showed an inverse predictability with Φ peak 3.0 , i.e., low scores on a given heuristic corresponded to a low score on Φ peak 3.0 , but the higher the scores, the larger the spread of Φ peak 3.0 (see Figure 5). This could explain why the correlations drop when removing networks with Φ peak 3.0 = 0 1. This inverse predictability indicates two things. First, that the tested measures could be useful as negative markers, that is, low scores on measures can indicate low Φ peak 3.0 networks, but not the converse. Secondly, it suggests Φ peak 3.0 has dependencies on aspects of the underlying network that are not captured by any of the heuristic measures.

Future Outlook
Finally, we discuss several topics that we consider to be relevant for future work. First, there are several conceptual aspects of Φ 3.0 that are worth considering when developing future methods. Composition: One of the major changes in IIT 3.0 from previous iterations of the theory is the role of all possible mechanisms (subsets of nodes) in the integration of the system as a whole. To our knowledge, all existing heuristic measures of integrated information are wholistic, always looking at the system as a whole. Future heuristics could take a compositional approach, combining integration values from subsets of measurements, rather than using all measurements at once. State dependence: We report that heuristic measures do not correlate with state-dependent Φ 3.0 (see Appendix A.6 for a perturbation-based approach), but a more accurate statement is that there are no (data-based) state-dependent heuristics; the nature of heuristic measures does not naturally accommodate state-dependence. Cut directionality: Φ 3.0 uses unidirectional cuts, i.e., separating one directed connection, while other heuristics use bidirectional cuts (Φ 2.0 , Φ 2.5 ) or even total cuts, separating system elements (Φ*, SI, MI). This leads, in effect, to an overestimation of integrated information, even for feedforward and ring-shaped networks (see Figure 2). This could potentially partially explain the inverse predictability noted above.
Secondly, there are differences in the data used for the different measures. Only the approximations (and D1/D2/Φ 2.0 /Φ 2.5 ) were calculated on the full TPM, the other heuristics were calculated on the basis of the generated time-series data. However, while deterministic networks such as those considered here can be fully described by both time-series data and TPM, given that the system was initialized to all possible states at least once, data from deterministic systems might be "insufficient" as a time series, as they often converge on a few cyclical states and, as such, need to be regularly perturbed. One solution to this could be to add noise to the system to avoid fixed points. In addition, as all heuristics considered here (except D1/D2/Φ 2.0 /Φ 2.5 ) were dependent on the size of the generated time series (see Appendix A.1), future work should control for the number of samples and discuss the impact of non-self-sustainable activity (convergence on a set of attractor states).
Thirdly, studies comparing measures of information integration, differentiation, and complexity, have also observed both qualitative and quantitative differences between the measures, even for simple systems [19,20]. Thus, there might be a large number of networks where the tested heuristics would correspond to Φ 3.0 if only certain prerequisites are met, such as a certain degree of irreducibility or small-worldness. One could, for example, imagine systems that have evolved to become highly integrated through interacting with an environment [34]. Such evolved networks might have further qualities than being integrated, such as state differentiation that serves distinctive roles for the system, i.e., differences that make a behavioral difference to an organism, which is an important concept in IIT (although considered from an internal perspective in the theory) [5]. While it is still an open question what Φ 3.0 captures of the underlying network above that of the heuristics considered here, investigation into structural and functional aspects that lead to systems with high Φ 3.0 could point to avenues for developing new measures inspired by IIT. Further, while estimates of the upper bound of Φ 3.0 , given a system size, have been proposed (e.g., see [15]), not much is known about the actual distribution of Φ 3.0 over different network types and topologies. Here, we explored a variety of network topologies, but the system properties, such as weight, noise, thresholds, element types, and so on, were omitted because of the limited scope of the paper. Investigating the relation between such network properties and Φ 3.0 would be an interesting research project moving forward. This could be useful as a testbed for future IIT-inspired measures and be informative about what kind of properties could be important for high Φ 3.0 in biological systems and the properties to aim for in artificial systems to produce "consciousness".

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A.
Appendix A. 1

. Input Size
For each network N with n ∈ {3, 4, 5, 6} elements, we generated an observed time series as a matrix A N , consisting of n columns and m rows. To cover the full state space of N, we perturbed each N into 2 n possible initial conditions S i . For each initial condition S i we simulated 2 n + 1 observations (referred to as an epoch) to ensure that we explored the full behavior of the network. Thus, A N was a matrix of at least size n × m(n), where m(n) = (2 n + 1)2 n However, as the LZ compression is dependent on the amount of data to compress, we wanted the size of A N to be equal for all n. Hence, we needed to adjust the number of timesteps that we ran the simulation for, so that the size of A N would always be the same as the largest network in the set,ň. Thus, for the specific case ofň = 6, the size of A N is given by n × m(ň) = 6 × m(6) = 24,960 To get the same size of A N for a network N of size n ∈ {3, 4, 5, 6}, we needed an adjusted number of timesteps m'(n) ≈ α(n) × m(n) (rounded to the nearest integer) where the adjustment factor α(n) is given by n × α(n) × m(n) = 24960 α(n) = 24960/n(2 n + 1)2 n For the general case, the shape of A N is n -by -m'(n) where m'(n) ≈ α(n) × m(n) m(n) = (2 n + 1)2 n α(n) =ň(2ň + 1)2ň/n(2 n + 1)2 n where n ∈ {a, a+1, ...,ň}, for some a,ň ∈ N, withň > a.
To test the effect of varying the amount of data, i.e., the size of A N , we generated data based on two networks with n = 6: one with high Φ Evident in the results is that all heuristics on generated data were affected by the number of timesteps (i.e., the size of A N ). This indicated that various measures were dependent on the amount of data they were calculated on.
However, as we forced each generated time series to have the same size, the networks with fewer elements generated longer time series, i.e., fewer columns required more rows. As this could potentially confound the observed results, we reanalyzed Spearman's r s between the heuristics on the observed data and Φ 3.0 for each network size class separately. Except for the heuristic SI, which increased relative to the results presented in the main text, the other measures were less affected (See Table A1).
However, as we forced each generated time series to have the same size, the networks with fewer elements generated longer time series, i.e., fewer columns required more rows. As this could potentially confound the observed results, we reanalyzed Spearman's rs between the heuristics on the observed data and Φ3.0 for each network size class separately. Except for the heuristic SI, which increased relative to the results presented in the main text, the other measures were less affected (See Table A1).  . n: size of the network in number of nodes.

Appendix A.2. Φ* Post-hoc Analysis
To investigate the distribution of Φ* relative to . after removing a cluster of high Φ* and low . values, we removed the outliers above two standard deviations of the mean. This did not improve the results drastically, as the bulk of observations lay within a narrow band of low Φ* values (see Figure A2).

Appendix A.2. Φ* Post-hoc Analysis
To investigate the distribution of Φ* relative to Φ peak 3.0 after removing a cluster of high Φ* and low Φ peak 3.0 values, we removed the outliers above two standard deviations of the mean. This did not improve the results drastically, as the bulk of observations lay within a narrow band of low Φ* values (see Figure A2).

Appendix A.3. Post-hoc Analysis of Networks not Totally Reducible or Reducible to Circular Systems
Systems that are completely reducible ( . = 0) or reducible to a circular or ring-shaped (sub)network ( . = 1) might not be representative for candidate heuristics, as these networks can be considered "special" cases in terms of IIT3.0. The absolute difference in correlation values can be seen in Table A2, and the corresponding scatter plots of some select measures are shown in Figure  A3. Note that we have here included the bipartitioning versus the all-partitioning comparison (AP) (see Appendix A7). Most measures dropped in correlational values, while those that increased were low to begin with. Only measures A to D had an r > 0.8, while measure H and L stayed close to rs = 0.7. The other measures had rs < 0.65. This suggests that the reported correlational values for most heuristics (F to N) were primarily driven by a cluster of non or trivially integrated networks ( . = 0‖1). For measures F to N, Spearman's rank order correlation was used, Pearson's correlation otherwise.  Systems that are completely reducible (Φ peak 3.0 = 0) or reducible to a circular or ring-shaped (sub)network (Φ peak 3.0 = 1) might not be representative for candidate heuristics, as these networks can be considered "special" cases in terms of IIT 3.0 . The absolute difference in correlation values can be seen in Table A2, and the corresponding scatter plots of some select measures are shown in Figure A3. Note that we have here included the bipartitioning versus the all-partitioning comparison (AP) (see Appendix A.7). Most measures dropped in correlational values, while those that increased were low to begin with. Only measures A to D had an r > 0.8, while measure H and L stayed close to r s = 0.7. The other measures had r s < 0.65. This suggests that the reported correlational values for most heuristics (F to N) were primarily driven by a cluster of non or trivially integrated networks (Φ

Appendix A.3. Post-hoc Analysis of Networks not Totally Reducible or Reducible to Circular Systems
Systems that are completely reducible ( . = 0) or reducible to a circular or ring-shaped (sub)network ( . = 1) might not be representative for candidate heuristics, as these networks can be considered "special" cases in terms of IIT3.0. The absolute difference in correlation values can be seen in Table A2, and the corresponding scatter plots of some select measures are shown in Figure  A3. Note that we have here included the bipartitioning versus the all-partitioning comparison (AP) (see Appendix A7). Most measures dropped in correlational values, while those that increased were low to begin with. Only measures A to D had an r > 0.8, while measure H and L stayed close to rs = 0.7. The other measures had rs < 0.65. This suggests that the reported correlational values for most heuristics (F to N) were primarily driven by a cluster of non or trivially integrated networks ( . = 0‖1). For measures F to N, Spearman's rank order correlation was used, Pearson's correlation otherwise.

Appendix A.4. Estimated Computational Demands
To estimate the computational demands, seven networks of each n ∈ {3, 4, 5, 6} were randomly generated, with p(W ij = 1) ∈ {0.7, 0.8, 0.9, 1.0}, and p(W ij = −1) ∈ {0.3, 0.4, 0.5}. The average times were recorded for each measure, then fitted to a logarithmic regression with reported exponent x, in the form of time = bn x , where b is a constant, and n is the system size in nodes. In essence, x > 1 indicates an exponential (more than linear) increase, while x < 1 indicates a less than linear increase. The reported exponents, especially for the measures of Φ, were likely underestimated. However, these estimates were highly dependent on underlying computational power, parallelization, efficiency of algorithmic implementation, as well as utilization of shortcuts. As such, the estimated computational demands are guiding at best. Here, we used a 32 gb, 16-core (Intel Xeon E5-1660 v4 @ 3.20 GHz, 20480 KB), parallelized on the level of states for Φ 2.0 / 2.5 / 3.0 , at the level of partitions (MIP search) for Φ*, SI, and MI1, and non-parallelized for LZ, S, D1, and D2. See Table A3 for the average time taken to compute the measures (in seconds) for each network size and fitted logarithmic regression, and Figure A4 for an overview of the relationship between computational time and correlation with Φ peak 3.0 . Note that we have here included the all-partitioning (AP) "approximation" (see Appendix A.7).  (similarly for measures A-D, F, G, O) affected the overall results. Analysis and statistical comparisons were performed as in section 2.3.x and 2.4. All the approximations and heuristics were negligibly affected, suggesting that for small networks of n ∈ {3, 4, 5, 6}, the mean and peak state-dependent Φ3.0 were estimated with similar accuracy (Table A4). Note that we have here included the bipartitioning versus the all-partitioning (AP) comparison (see Appendix A.7).  Figure A4. Overview of computational times recorded, for each measure (Φ peak 3.0 marked red), over correlation (r/r s ) with Φ peak 3.0 . The y-axis corresponds to the exponent x of the logarithmic fit between times (in seconds) for networks of size n = 3, 4, 5, 6, in the form of y = bn x , where b is a constant. accuracy (Table A4). Note that we have here included the bipartitioning versus the all-partitioning (AP) comparison (see Appendix A.7). nor how it would be affected by different rules for cutting. While IIT 3.0 is defined using BP, a criticism against the theory is that one could use tripartitioning, or more, and that BP should be considered an approximation in its own right with respect to more extensive partitioning schemes. As such, we tested the default BP versus that of all possible partitions (AP) [6] to investigate how well they corresponded (on networks with n ∈ {3, 4, 5}). While a superset of BP should result in less than or equal Φ 3.0 due to usually increased information loss with increased number of partitions, the way AP is implemented in PyPhi v1.0 [6] requires that any partition includes at least a mechanism element. As such, AP is not a superset of BP, but the results might be informative in terms of other more expedient partitioning schemes based on other requirements for permissible cuts. Statistical comparisons were performed as defined in Section 2.4. Bipartitioning was a very strong linear predictor of all partitioning, both with S.I. Φ