Neural Information Squeezer for Causal Emergence

Conventional studies of causal emergence have revealed that stronger causality can be obtained on the macro-level than the micro-level of the same Markovian dynamical systems if an appropriate coarse-graining strategy has been conducted on the micro-states. However, identifying this emergent causality from data is still a difficult problem that has not been solved because the appropriate coarse-graining strategy can not be found easily. This paper proposes a general machine learning framework called Neural Information Squeezer to automatically extract the effective coarse-graining strategy and the macro-level dynamics, as well as identify causal emergence directly from time series data. By using invertible neural network, we can decompose any coarse-graining strategy into two separate procedures: information conversion and information discarding. In this way, we can not only exactly control the width of the information channel, but also can derive some important properties analytically. We also show how our framework can extract the coarse-graining functions and the dynamics on different levels, as well as identify causal emergence from the data on several exampled systems.


Introduction
Emergence, as one of the most important concepts in complex systems, describes the phenomenon that some overall properties of a system cannot be reduced to the parts [1,2].Causality, as another significant concept, characterises the connection between cause and effect events through time [3,4] for a dynamical system.As pointed out by Hoel et al. [5,6], causality could be emergent, which means that the events of a system on the macro level may have stronger causal connections than the micro level, where the strength of causality could be measured by effective information(EI) [5,7].This theoretical framework of causal emergence provides us a new way to understand emergence and other important conceptions in a quantitative way [8][9][10].
Although many concrete examples of causal emergence across different temporal and spatial scales have been shown in [5], a method to identify causal emergence merely from data is still lack [10].One of the difficulty is how to search all possible coarse-graining strategies (functions, mappings), on which the causal emergence can be shown [10], in a systematic way.On a networked complex system, a coarse-graining strategy includes the way of grouping nodes and the method of mapping the micro-states within a group to a macro-state [11].The existing methods solve the problem by fixing the mapping function of states and searching all the grouping methods by heuristic optimization algorithms [5,11].However, there is no reason why some strategies of state coarse-graining are preferred but not others.Therefore, we should search on the space of all possible coarse-graining strategies such that the most informative dynamics can be identified.Nevertheless, two difficulties we must confront are too large searching space and unavoidability of the trivial mapping between states of micro-and macro.
To show the latter, we consider a possible coarse-graining method that maps all the micro-states to an identical value as the macro-state.In this way, the macroscopic dynamics is only an identical mapping that will have large effective information(EI) measure.However, this can not be called causal emergence because all the information is eliminated by the coarse-graining method itself.Thus, we must find a way to exclude such trivial strategies.
An alternative way to identify causal emergence and even other types of emergence is based on partial information decomposition given by [10].Although this method can avoid the discussion on coarse-graining strategies, time consuming searching on subsets of the system state space is also needed.And this method can not give the explicit coarse-graining strategy and the corresponding macro-dynamics which are useful in practice.Furthermore, another common shortage shared by the two mentioned methods is that an explicit Markov transition matrix for both macro-and micro-dynamics are needed, and the transitional probabilities should be estimated from data.As a result, large bias on rare events can hardly be avoided, particularly for continuous data.
On the other hand, machine learning methods empowered by neural networks have been developed in recent years, and many cross-disciplinary applications have been made [12][13][14][15].Equipped with this method, automated discovery of causal relationships and even dynamics of complex systems in a data driven way becomes possible [16][17][18][19][20][21][22][23].Machine learning and neural networks can also help us to find good coarse-graining strategies [24][25][26][27][28].If we treat a coarse-graining mapping as a function from micro-states to macro-states, then we can certainly approximate this function by a parameterized neural network.For example, [27] and [25] used normalized flow model equipped with invertible neural network to learn how to renormalize a multi-dimensional field (quantum field, images or joint probability distributions), and how to generate the field from Gaussian noise.Therefore, both the coarse-graining strategy and the generative model can be learned from data automatically.
These techniques can also help us to reveal causality on macro-level from data.Causal representation learning aims to use unsupervised representation learning to extract causal latent variables behind the observational data [29,30].The encoding process from the original data to the latent causal variables can be understood as a kind of coarse-graining.This shows the similarity between causal emergence identification and causal representation learning, however, their basic objectives are different.Causal representation learning aims to extract the causality hidden in data, whereas, causal emergence identification aims to find a good strategy of coarse-graining to reduce the given micro-level dynamics.Furthermore, introducing multiscale modeling and coarse-graining operations into causal models brings some new theoretical problems [31][32][33].For example, [32,33] discuss the basic requirements of the model abstraction (coarse-graining).However, these studies only care about static random variables and structural causal models but not markovian dynamics.
In this paper, we formulate the problem of causal emergence identification as a maximization problem of the effective information (EI) for the macro-dynamics under the constraint of precise prediction of micro-dynamics.We then propose a general machine learning framework called Neural Information Squeezer (NIS) to solve the problem.By using invertible neural network to model the coarse-graining strategy, we can decompose any mapping from R p to R q (q ≤ p) into a series of information conversions invertible processes and information discarding processes.In this way, the framework can not only allow us to control information conversion and discarding in a precise way but also enable us to mathematically analyze the whole framework in theory.We prove a series of mathematical theorems to reveal the properties of NIS.At last, we show how NIS can learn effective coarse-graining strategies and macro-state dynamics numerically on a set of examples.

Basic Notions and Problems Formulation
First, we will formulate our problems under a general setting, and layout our framework to solve the problems.

Background
Suppose the dynamics of the complex system that we consider can be described by a set of differential equations.
where x(t) ∈ R p is the state of the system and p ∈ Z + is a positive integer, ξ is a random variable of noise.Normally, micro-dynamic g is always markovian which means it could be also modeled as a conditional probability Pr(x(t + dt)|x(t)) equivalently.However, we can not directly obtain the evolution of the system but the discrete samples of the states, and we define these states as micro-states.Definition 1. (Micro-states): Each sample of the state of the dynamical system (Equation 1) x t is called a micro-state at time step t.And the multi-variate time series x 1 , x 2 , • • •, x T which are sampled with equal intervals and a finite time step T, forms a micro-state time series.
We always want to reconstruct g according to the observable micro-states.However, an informative dynamical mechanism g with strong causal connections is always hard to be reconstructed from the micro-states when noise ξ is strong.While we can ignore some information in the micro-state data and convert it into macro-state time series.In this way, we may reconstruct a macro-dynamic with stronger causality to describe the evolution of the system.This is the basic idea behind causal emergence [5,6].We formalize the information ignoring process as a coarse-graining strategy(or mapping, method).Definition 2. (q dimensional coarse-graining strategy): Suppose the dimension of the macro-states is 0 < q < p ∈ Z + , a q dimensional coarse-graining strategy is a function to map the micro-state x t ∈ R p to a macro-state y t ∈ R q .The coarse-graining is denoted as φ q .
After coarse-graining, we obtain a new time series data of macro-states denoted by y 1 = φ q (x 1 ), y 2 = φ q (x 2 ), • • •, y T = φ q (x T ).We then try to find another dynamical model(or a markov chain) fφ q to describe the evolution of y t : Definition 3. (macro-state dynamics): A macro-state dynamics is a set of differential equations such that the solution of Equation 2, y(t) is closed to the macro-states y t as possible as we can.That is we try to minimize||y t − y(t)|| for any t = 1, 2, ..., T, where || • || is any norm for vectors.Where ξ is the noise in the macro-state dynamics.
However, this formulation can not reject some trivial strategies.For example, suppose a q = 1 dimensional φ q is defined as φ q (x t ) = 1 for ∀y t ∈ R p .Thus, the corresponding macrodynamic is simply dy/dt = 0 and y(0) = 1.But this is meaningless because the macro-state dynamic is trivial and coarse-graining mapping is too arbitrary.Therefore, we must set limitations on coarse-graining strategies and macro-dynamics so that such trivial strategies and dynamics could be avoided.

Effective Coarse-graining Strategy and Macro-dynamics
We define an effective coarse-graining strategy to be a compressed map such that the macro-states may preserve the information of micro-states as much as it can.Formally, Definition 4. (Effective q coarse-graining strategy and macro-dynamcis): A q coarse-graining strategy φ q : R p → R q is effective if there exists a function φ † q : R q → R p , such that the following inequality holds for a given small real number : and the derived macro-dynamic fφ q is also effective.Where, y(t) is the solution of equation 2, that is: for all t = 1, 2, ..., T. That is, we can reconstruct the micro-state time series by φ † q such that the macro-state variables contain the information of micro-states as much as they can.
Notice that this definition is in accordance with the approximate causal model abstraction [34].

Problem Formulation
Our final objective is to find a most informative macro-dynamic.Therefore, we need to optimize the coarse-graining strategy and the macro-dynamic among all possible effective strategies and dynamics.Therefore, our problem can be formulated as: max φ q , fφq ,φ † q ,q I( fφ q ), under the constraint equations 3 and 4. Where, I is a measure of effective information, it could be EI, E f f , or dimension averaged EI which is mainly used in this paper and is denoted as dEI(will mention in section 2.3.3.φ q is an effective coarse-graining strategy, and fφ q is an effective macro-dynamic.

Methods
The problem(equation 5 and 3) is hard to solve because the objects that we will optimize are functions: φ q , fφ q , φ † q but not numbers.Thus, we use neural networks to parameterize the functions and convert the function optimization problem into a parameter optimization problem.

Neural Information Squeezer Model
We propose a new machine learning framework called neural information squeezer (NIS) which is based on invertible neural network to solve the problem(equation 5).NIS is composed of three components: encoder, dynamics learner, and decoder.They are represented by neural networks ψ α , f β , and ψ −1 α with the parameters α, β, and α respectively.The entire framework is shown in Figure 1.Next, we will describe each module separately.

Encoder
To be noticed, ψ α is an invertible neural network(INN), therefore ψ and ψ −1 share the parameters α.However, invertible function has no information loss, we must introduce a new operator, projection. (Projection operator): A projection operator χ p,q is a function from R p to R q , such that: where, is the operation of vector concatenation, and x q ∈ R q , x p−q ∈ R p−q .Sometimes, we abbreviate χ p,q as χ q if there is no ambiguity.
Thus, the encoder(φ) maps the micro-state x t to the macro-state y t , and this mapping can be separated into two steps.That is, where • represents the operation of function composition.The first step is a bijective(invertible) mapping ψ α : R p → R p from x t ∈ R p to x t ∈ R p without information lose and is realized by an invertible neural network, the second step is to project the resulting vector to q dimension by mapping x t ∈ R p into y t ∈ R q by discarding the information on p − q dimension.There are several ways to realize an invertible neural network [35,36].While, we select RealNVP module [37] as shown in Figure 2 to concretely implement the invertible computation.
In the module, the input vector x can be separated into two parts, both vectors will be scaled, translated and merged again.The magnitude of the scaling and translation operations will be adjusted by the corresponding feed-forward neural networks.s 1 , s 2 are the same neural networks shared parameters for scaling, represents element-wised product.And t 1 , t 2 are the neural networks shared parameters for translation.In this way, an invertible computation from x to y can be realized.The same module can be repeated for multiple times (three times in this paper) to realize complex invertible computation, the details can be referred to Appendix A.
The reasons why we use invertible neural network are: 1) INN can reduce the complexity of the model by multiplexing the structure and the parameters in the encoder to the decoder because we can simply reverse the running direction of the encoder to implement decoding; 2) The encoder equipped with INN can separate out the information conversion process and information discarding process; 3) This enables us to do mathematical analysis on the whole framework, and several theorems reflecting the basic properties can be proved.

Decoder
The decoder converts the predicted macro-state of the next time step y(t + 1) into the prediction of the micro-state at the next time step xt+1 .In our framework, because the coarsegraining strategy φ q can be decomposed as a bijector ψ α and a projector χ q , we can simply reverse ψ α to become ψ −1 α as the decoder.However, because the dimension of the macro-state is q and the input dimension of ψ α is p > q, we need to fill the remaining p − q dimensions by a p − q dimensional Gaussian random vector.That is, for any φ q , the decoding mapping can be defined as: where ψ −1 α is the inverse function of ψ α , and χ † q : R q → R p is a function defined as follow: for any where z p−q ∼ N (0, I p−q ) is a random Gaussian noise with p − q dimension, and I p−q is an identity matrix with the same dimension.That is, we can generate a micro-state by composing x q and a random sample z p−q from a p − q dimensional standard normal distribution.
According to the point view of [25,27], the decoder can be regarded as a generative model of the conditional probability Pr( xt+1 |y(t + 1)), and the encoder just performs a renormalization process.

Dynamics Learner
The dynamics learner f β is a common feed-forward neural network with parameters β, it will learn the effective markov dynamic on the macro-level.Concretely, we at first use f β to replace fφ q in equation 2, and second we use Euler method with dt = 1 to solve the Equation 2, and suppose the noise is a additive Gaussian(or Laplacian) [38], therefore we can reduce Equation 4 as: where ) is the covariance matrix, and σ i is the standard deviation in the ith dimension which could be learned or fixed.Thus, the transitional probability of this dynamics can be written as where D represents the PDF of Gaussian distribution or Laplace distribution, µ(y t ) ≡ y t + f β (y t ) is the mean vector of the distribution.
By training the dynamics learner in an end-to-end manner, we can avoid estimating the markov transitional probabilities from the data to reduce biases because neural networks always have much better ability to fit the data and generalize to unseen cases.

Two stage optimization
Although the functions that will be optimized have been parameterized by neural networks, Equation 5is still hard to be optimized directly because the objective function and the constraint condition must be combined together to be considered and q as a hyper-parameter can affect the structure of neural networks.Thus, In this paper, we propose a two-stage optimization method.In the first stage, we fix the hyper-parameter q and optimize the difference between the predicted micro-state and the observed data |φ † q (y(t)) − x t |, that is Equation 3, to let the coarse-graining strategy φ q and macro-dynamics fq to be effective.And then, we search for all possible q values to find the optimal one such that I can be maximized.

Stage 1: training a predictor
In the first stage, we can use likelihood maximization and stochastic gradient descend techniques to obtain the effective q coarse-graining strategy and the effective predictor of the macro-state dynamics.The objective function is defined on the likelihood of micro-state prediction.
We can understand a feed-forward neural network as a machine to model a conditional probability with Gaussian or Laplacian distribution [38].Thus, the entire NIS framework can be understood as a model of P( xt+dt |x t ) with the output xt+1 is just the mean value.And the objective function Equation 13is just the log-likelihood or cross-entropy of the observed data under the given form of the distribution. where where Σ is the covariance matrix which is always be a diagonal matrix and the magnitude can be calculated as the mean square error for l = 2 or mean absolute value for l = 1.
If we take the concrete form of Gaussian or Laplacian distribution into the conditional probability, we will see to maximize the log-likelihood is equivalent to minimize the l-norm objective function: where l = 1 or 2.
Then we can use stochastic gradient descend technique to optimize Equation 13.

Stage 2: search for the optimal scale
In the previous step, we can obtain the effective q coarse graining strategy and the macrostate dynamics after a large number of training epochs, but the results are dependent on q.
To select the optimized q, we can compare the measure of effective information I for different q coarse-graining macro-dynamics.Because the parameter q only has one dimension, and its value range is also limited (0 < q < p), we can simply iterate all q to find out the optimal q * and the optimal effective strategy.

About Effective Information
In the second stage, to compare coarse-graining strategies and macro-dynamics, we need to compute the important indicator: effective information (EI), however, the conventional computations of EIs are all for discrete markov dynamics in most of previous works [5,6], and we may confront difficulties when we apply EI on continuous dynamics [9].
First, the conventional methods on mutual information computation for discrete variables cannot be used here, new methods for continuous variables and mappings especially for high dimensional space must be invented.To solve the problem, we treat the mapping of the dynamics learner neural network as an conditional Gaussian distribution, thereafter, we can calculate EI for this Gaussian distribution.Concretely, we have the following theorem: which means X is defined on a hyper-cube with size L, where L is a very large integer.The output is Y = (y 1 , y 2 , • • •, y m ), and Y = µ(X).Here µ is the deterministic mapping implemented by the neural network: µ : R n → R m , and its Jacobian matrix at X is If the neural network can be regarded as a Gaussian distribution conditional on given X: where, ) is the co-variance matrix, and σ i is the standard deviation of the output y i which can be estimated by the mean square error of y i , then the effective information (EI) of the neural network can be calculated in the following way: (i) If there exists X such that det(∂ X µ(X)) = 0, then the effective information (EI) can be calculated as: where, U([−L, L] n ) is the uniform distribution on [−L, L] n , and | • | is absolute value, and det is determinant.
(ii) If det(∂ X µ(X)) ≡ 0 for all X, then EI ≈ 0 Although Theorem 1 can solve the problem of EI computation for continuous variables and functions, new problems must be confronted which are: 1) EI will be affected by the output dimension m easily, this may trouble the comparison of EI for different dimensional dynamics, and 2) EI is dependent on L, and will be divergent when L is very large.
To solve the first problem, we define a new indicator which is called dimension averaged effective information or effective information per dimension.Formally, Definition 6. (Dimension Averaged Effective Information (dEI)): For a dynamic f with n dimensional state space, then the dimension averaged effective information is defined as: Therefore, if the dynamic f is continuous and can be regarded as a conditional Gaussian distribution, then according to Theorem 1, the dimension averaged EI can be calculated as(m = n): It is easy to see that all the terms related with dimension n in Equation 17 is eliminated.However, there is still L in the equation which may cause divergent when L is very large.
Therefore, to solve this problem, we can calculate the dimension averaged causal emergence (dCE) to eliminate the influence of L. Definition 7. (Dimension averaged causal emergence(dCE)): for macro-dynamics f M with dimension n M and micro-dynamics f m with dimension n m , we define dimension averaged causal emergence as: Thus, if the dynamics f M and f m are continuous and can be regarded as conditional Gaussian distributions, then according to definition 7 and equation 17, the dimension averaged causal emergence can be calculated as: Therefore, all the effects of dimension n and L have been eliminated in Equation 19, and the result is only influenced by the relative values of the variances and the logarithmic values of the determinant of the jacobian matrices.In the following numeric computations, we will mainly use Equation 19.The reason why we not use Eff is also because it contains L.

Results
In this section, we will layout several theoretic properties of NIS at first, then we will apply it on some numeric examples.

Theoretical Analysis
To understand why the neural information squeezer framework can find out the most informative macro-dynamics and how the effective strategy and dynamics change with q, we at first layout some major theoretical results through mathematical analysis.Notice that although all of the theorems are about mutual information, these conclusions are also suitable for effective information because all the theoretical results are irrelevant to the distribution of input data.First, we notice that the framework (Figure 1) can be regarded as an information channel as shown in Figure 3, and due to the existence of the projection operation, the channel is squeezed in the middle.Therefore, we call that a squeezed information channel(see also Appendix B for formal definition for the sqeezed information channel).
As proved in Appendix B, we have a theorem for the squeezed information channel: Theorem 2. (Information bottleneck of Information Squeezer): For the squeezed information channel as shown in Figure 3 and for any bijector ψ, projector χ q , macro-dynamics f , and the random noise z p−q ∼ N (0, I p−q ), we have: where xt+1 is the prediction of NIS, and y(t + 1) follows Equation 2.
That is, for any neural network that implements the general framework as shown in Figure 3, the mutual information of macro-dynamic f φ q is identical to the entire dynamical model, i.e., the mapping from x t to xt+1 for any time.Theorem 2 is fundamental for NIS.Actually, the macro-dynamics f is the information bottleneck of the entire channel [39].

What happens during training
With Theorem 2, we can understand what happens when the neural squeezer framework is trained by data in an intuitive way.
First, we know as the neural networks are trained, the output of the entire framework xt+1 is closed to the real data x t+1 under any given x t , so do the mutual information, that is the following theorem: Theorem 3. (Mutual information of the model will be closed to the data for a well trained framework): If the neural networks in NIS framework are well-trained, then: The proof is in the Appendix C. Second, we suppose that the mutual information I(x t , x t+1 ) is always large because the time series of micro-states x t contains information.Otherwise, we may not be interested in {x t }.Therefore, as the neural network is trained, I(x t ; xt+1 ) will increase to be closed to I(x t ; x t+1 ).
Third, according to Theorem 2, I(y t ; y(t + 1)) = I(x t , xt+1 ) will also be increased such that it can be closed to I(x t ; x t+1 ).
Because the macro-dynamics is the information bottleneck of the entire channel, therefore its information must be increased as training.In the same time, the determinant of the Jacobian of ψ α and the entropy of y t will also be increased in a general case.This conclusion is implied in Theorem 4.
Theorem 4. (Information on bottleneck is the lower bound of the encoder): For the squeezed information channel shown in Figure 3, the determinant of the Jacobian matrix of ψ α and the Shannon entropy of y t are lower bounded by the information of the entire channel: where, H is the Shannon entropy measure, J ψ α (x t ) is the Jacobian matrix of the bijector ψ α at the input x t , and J ψ α ,y t (x t ) is the sub-matrix of J ψ α (x t ) on the projection y t of x t .
The proof is also given in Appendix D.
Because the distribution of x t and its Shannon entropy are given, thus, Theorem 4 states that the expectation of the logrithim of | det(J ψ α (x t ))| and the entropy of y t must be larger than the information of the entire information channel.
Therefore, once the initial values of E| det(J ψ α (x t ))| and y t are small, as the model is trained, the mutual information of the entire channel increases, the determinant of the Jacobian must also be increased, and the distribution of the macro-state y t must be more disperse.But these may not happen if the information I(x t ; xt+1 ) has been closed to I(x t ; x t+1 ) or E| det(J ψ α (x t ))| and H(y t ) have been already large enough.

The Effective Information is mainly determined by the Bijector
The previous analysis is about the mutual information but not the effective information of the macro-dynamic which is the key ingredients about causal emergence.Actually, with the good properties of the squeezed information channel, we can write down an expression of the EI for the macro-dynamic but without the explicit form of it.And, accordingly, we find the major ingredient to determine causal emergence is the bijector ψ α .
The proof is detailed in Appendix D.1 Theorem 5. (The mathematical expression for effective information of the macro-dynamics): Suppose the probability density of x t+1 under given x t can be described by a function Pr(x t+1 |x t ) ≡ G(x t+1 , x t ), and the Neural Information Squeezer framework is well trained, then the effective information of the macro-dynamics of f β can be calculated by: where, σ ≡ [−L, L] p is the integration region for x and x .

Change with the Scale(q)
According to Theorems 2 and 3, we have the following corollary 1:

Corollary 1. (The mutual information of macro-dynamics will not change if the model is well trained):
For the well trained NIS model, the Mutual Information of the macro-dynamics f β will be irrelevant of all the parameters, including the scale q.
If the neural networks are well-trained, the mutual information on the macro-dynamics will approach to the information in the data {x t }.So no matter how small q is (or how large is the scale), the mutual information of the macro-dynamics f β will keep constant.
It seems that the scale q is an irrelevant parameter on causal emergence.However, according to Theorem 6, smaller q will lead to the encoder carrying more effective information.Theorem 6. (Narrower is Harder): If the dimension of x t is p, then for 0 < q 1 < q 2 < p: where y q t denotes the q-dimensional vector y t .
The mutual information in Theorem 6 is about the encoder, i.e., the micro-state x t and the macro-state y t in different dimension q.The theorem states that as q decreases, the mutual information of the encoder part must also decrease and more closed to the information limitation I(x t ; xt+1 ) I(x t ; x t+1 ).Therefore, the entire information channel becomes narrower, the encoder must carry more useful and effective information to transfer to the macro-dynamics.And the prediction becomes harder.

Empirical Results
We test our model on several data sets.All the data is generated by the simulated dynamical models.And the models include continuous dynamics and discrete Markovian dynamics.

Spring Oscillator with Measurement Noise
The first experiment to test our model is a simple spring oscillator following the dynamical equations: where, z and v are position and velocity of the oscillator in one dimension, respectively.The states of the system can be represented as x = (z, v).However, we can only observe the state from two sensors with measurement errors.Suppose the observational model is where, ζ ∼ N (0, σ) is a random number following two dimensional Gaussian distribution, and σ is the vector of the standard deviations for position and velocity.In this example, we can understand the states x as latent macro-states and the measurements x1 , x2 are micro-states.What will NIS do is to recover the latent macro-state x from the measurements.According to Equation 26, although there is noise to disturb the measurement of the state, it can be easily eliminated by adding the measurements on the two channels together.Therefore, if NIS can discover a macro-state which is the addition of the two measurements, then it can easily obtain the correct dynamics.We sample the data for 10,000 batches (with Euler method and dt = 1), and in each batch, we randomly generate 100 random initial states and perform one step dynamic to get the state at the next time step.We use these data to train the neural network.To compare, we also use the same data set to train an ordinary feed-forward neural network with the same number of parameters.
The results are shown in Figure 4. To test if NIS can learn the real latent macro state, we directly plot the predicted and the real latent states.As shown in Figure 4(a), the predicted and the real curves collapse together which means NIS can recover the macro state in the data although it is unknown.As a comparison, the feed-forward neural network cannot recover the macro state.We can also check if the NIS can learn the dynamic of the macro states by plotting the derivatives of the states (dz/dt, dv/dt) against the macro state variables (v, z).If the learned dynamics follows Equation 25, then two cross-over lines for dz/dt = v and dv/dt = −z can be observed as shown in Figure 4(c).However, the same pattern can not be reproduced on the common feed-forward network as shown in Figure 4(d).We also test the well-trained NIS by multiple-step prediction as shown in Figure 4(e).Although there are larger and larger deviation from the prediction and the real data, the general trends can be captured by NIS model.We further study how the dimension averaged causal emergence dCE changes with the scale q which is measured by the number of effective information channels on the well-trained NIS model as shown in Figure 4(f).dCE peaks at q = 2 which is exactly same as in the ground truth.
Further, we use experimental results to verify the theorems mentioned in the previous section and the theory of information bottleneck [39].First, we show how the mutual information of I(x t , x t+1 ), I(y t , y(t + 1)), and I(x t , xt+1 ) change with time(epoch) when q takes different values as shown in Figures 5(c) and (d).The results show that all the mutual information converge as predicted by Theorems 2 and 3. We also plot the mutual information between x t and y t with different q to test Theorem 6, and the results show that the mutual information increases when q increases as shown in Figure 5(a).
According to the information bottleneck theory [39], the mutual information between latent variable and output may increase while the information between input and latent variable should increase in the early stage and then decrease as training process proceed.As shown in Figure 5(b), this conclusion is confirmed by the NIS model where the macro-states y t and the prediction y(t + 1) are all latent variables.Although the same conclusion is obtained, the information bottleneck can be reflected by the architecture in NIS model much clearer than the general neural networks because y t and y(t + 1) is the bottleneck and all other irrelevant information is discarded by the variable x t as shown in Figure 3.
We also visualize the learned macro-dynamics as shown in Figure 6(c).This is a linear mapping when y t < 0 and almost a constant for y t > 0. Therefore, the dynamics can guarantee  ), I(x t , y q=3 t ), I(x t , y q=2 t ) and I(x t , xt ) with the increase of the number of iterations.From the figure, we can see that within the specified number of iterations, ).Among them, q is the dimension of the coarse-grained system.(b) verifies the theory of information bottleneck on NIS when Scale(q)=2.The dependence of the dimension averaged Causal Emergence (dCE) on different scales(q) of the markov dynamics (a), the learned mapping between micro states and macro states on the optimal scale(q) (b), and the learned macro-dynamics (the mapping from y t to y(t + 1) (c).There are two clear separated clusters on the y-axis in (b) which means the macro states are discrete.We found that the two discrete macro states and the mapping between micro and discrete macro states are identical as the example in ref [6] which means the correct coarse-graining strategy can be discovered by our algorithm automatically under the condition without any prior information.In (d), I(x t , xt ) ≤ I(x t , y q=3 t ) is reflected.In order to make the data clearer, we have taken a moving average for each group of data.This result can be regarded as the verification of theorem 6 that all the first seven micro-states can be separated with the last state.We also verify Theorem 2 in Figure 6(d).

Simple Boolean Network
Our framework can not only work on continuous time series and markov chain, but also can work on a networked system on which each node follows a discrete micro mechanism.
For example, boolean network is a typical discrete dynamical system in which the node contains two possible states (0 or 1), and the state of each node is affected by the state of the neighbors connected to it.We follow the example in [5]. Figure 7 shows an exampled boolean network with 4 nodes, and each node follows the same micro mechanism as shown in the table of Figure 7.In the table, each entry is the probability of each node's state conditions on the state combination of its neighbors.For example, if the current node is A, then the first entry is Pr(x t+1 A = 0|x t C = 0, x t D = 0) = 0.7, which means that A will take value 0 with probability 0.7 when the state combination of C and D is 00.By taking all the single node mechanisms together, we can obtain a large markovian transition matrix with 2 4 = 16 states which is the complete micro mechanism of the whole network.
We sample the one step state transition of the entire network for 50,000 batches and each batch contains 100 different initial conditions which are randomly sampled from the possible state space evenly, and we then feed these data to the NIS model.By systematically search for different q, we found that the dimension averaged causal emergence peaks at q = 1 as shown in Figure 8(a).Under this condition, we can visualize the coarse-graining strategy by Figure 8(b), on which the x-coordinate is the decimal coding for the binary micro-states (e.g., 5 denotes for the state 0101), and the y-coordinate represents the codes for macro-states.The data points can be clearly classified into 4 clusters according to their y-coordinate.This means the NIS network found 4 discrete macro-states although the states are continuous real numbers.Interestingly, we found that the mapping between the 16 micro states and 4 macro states are identical as the coarse-graining strategy shown in the example in ref [5].However, any prior information neither the method on how to group the nodes nor the coarse graining strategy, nor the dynamics are known by our algorithm.Finally, theorems 2 and 6 are verified in this example as shown in Figure 8 (c) and (d).

Concluding Remarks
In this paper, we propose a novel neural network framework, Neural Information Squeezer, for discovering coarse-graining strategy, macro-dynamic and emergent causality in time series data.We first define effective coarse-graining strategy and macro-dynamic by constraining the Causal Emergence (dCE) on different scales(q) (a) and the learned mapping between micro states and macro states on the optimal scale(q) (b).There are four clear separated clusters on the y-axis in (b) which means the macro states are discrete.We found that the four discrete macro states and the mapping between micro and discrete macro states are identical as the example in ref [5] which means the correct coarsegraining strategy can be discovered by our algorithm automatically under the condition without any prior information.(c) shows the change of mutual information I(x t , y ).(d) shows the change of mutual information I(x t , x t+1 ), I(y t , y t+1 ) and I(x t , xt ) with the increase of the number of iterations.In order to easily observe the trend of data changes, we added a moving average curve for each group of data.It can be seen that under different scales, the three mutual information values are close to each other.I(x t , x t+1 ) ≈ I(y t , y t+1 ) = I(x t , xt ) is reflected.Considering the experimental error, the overall trend of the data still conforms to the theorem.coarse-graining strategies to predict the future micro-state with a precision threshold.And then, the causal emergence identification problem can be understood as a maximization problem for effective information under the constraint.
We then use an invertible neural network incorporating with the projection operation to realize the coarse-graining strategy.The usage of invertible neural network can not only allow us to reduce the number of parameters by sharing them between the encoder and the decoder but also can facilitate us to analyze the mathematical properties of the whole NIS architecture.
By treating the framework as a squeezed information channel, we can prove four important theorems.The results show that if the causal connection in the data is strong, then as we train the neural networks, the macro-dynamics will increase its informativeness.And during this process, the determinant of the Jacobian of the bijector will increase in the same time.We also found a mathematical expression for the effective information of the macro-dynamics without the explicit dependence on the macro-dynamics, and it is determined solely by the bijector and the data when the whole framework is well trained.Furthermore, if the framework has been trained in a sufficient time, the mutual information of the macro-dynamics will keep a constant no matter the scale q is.However, as q decreases, the mutual information or the bandwidth on the encoder part also decreases and closed to the information limitation on the entire channel such that it can make correct prediction for the future micro-states.Thus, the task becomes harder for the encoder because more effective information must be encoded and pass to the dynamics learner such that it can make correct prediction with less information.Numerical experiments show that our framework can reconstruct the dynamics in different scales and also can discover emergent causality in data on several classic causal emergence examples.
There are several weak points in our framework.First, it can only work on small data set.The major reason is the invertible neural network is very difficult to train on large data set.Therefore, we will use some special techniques to optimize the architecture in future.Second, the framework is still lack of explainability, the grouping method for variables is implicitly encoded in the invertible neural network although we can illustrate what the coarse-graining mapping is, and decompose it into information conversion and information discarding parts clearly.A more transparent neural network framework with more explanatory power is deserved for future studies.Third, the conditional distribution that the model can predict actually is limited as Gaussian or Laplacian, and it should be extended to more general distributional forms in future studies.
There are several theoretical problems left for future studies.For example, we conjecture that all coarse-graining strategies can be decompose into a bijection and a projection, but this needs strict mathematical proof.Second, although an explicit expression for EI on macrodynamics has been derived under NIS, we still cannot directly predict the causal emergence in the data.We believe that a more concise analytic results on the EI should be derived by setting some constraints on the data.Furthermore, we think the meaning and the usage of the discarding variable x t should be further explored since that it may relate with the redundant information of a pair of variables toward a target [40].Therefore, we guess more deep connections between the framework of NIS and the mutual information decomposition may exist and NIS may work as a numeric tool to decompose the mutual information.matrix at X is ∂ X µ(X) ≡ ∂µ i (X ) ∂X j | X =X nm , and if the neural network can be regarded as an Gaussian distribution conditional on given X: where, Σ = diag(σ 2 1 , σ 2 2 , • • •, σ 2 m ) is the co-variance matrix, and σ i is the standard deviation of the output y i which can be estimated by the mean square error of y i .Then the effective information (EI) of the neural network can be calculated in the following way: (i) If there exists X such that det(∂ X µ(X)) = 0, then the effective mutual information (EI) can be calculated as: However, it is hard to derive an explicit expression of the second term in Equation A6 because it contains integration.So we can expand µ(X ) into Taylor series on the point X and keep only the first order term: where ∂ X µ(X) ≡ ∂µ(X ) ∂X | X =X = ∂µ i (X)  Applying this theorem in the information squeezed channel (Figure A1), we can obtain the form of Equation 22.
Appendix D.1 Proof for Theorem 5 because H(U q 2 −q 1 |U q 1 ) ≥ 0, and: because the matrices of ∂X and ∂X are all sub-matrices of ∂ψ ∂X and the former contains the latter.Thus, according to lemma 3: I(X; U q 2 ) = H(U q 2 ) + E X (ln | det( ∂U q 2 ∂X )| ≥ H(U q 1 ) + E X (ln | det( ∂U q 1 ∂X ) = I(X; U q 1 ) (A54) Thus, if the number of dimension is smaller, the mutual information between X and U will also be smaller.That means, narrower channel is harder to transfer information.
Combined with Theorems 2 and 4, we have the following inequalities for the squeezed information channel of Figure 3: I(x t ; xt+1 ) ≤ I(x t ; y q 1 t ) ≤ I(x t ; y q 2 t ).
(A55) This is the form of Theorem 6 in the main text.

Figure 1 .
Figure 1.The Workflow and The Framework of the Neural Information Squeezer

Figure 2 .
Figure 2. The RealNVP neural network implementation of the basic module of the bijector ψ.Where, s 1 , s 2 and t 1 , t 2 are all feed-forward neural networks with three layers, 64 hidden neurons, and ReLU active function.s i s and t i s share parameters, respectively.and+ represent element-wised product and addition, respectively.x = x 1 x 2 and x = x 1 x 2 .

Figure 3 .
Figure 3.The graphic model of the neural information squeezer as a squeezed information channel

4 .
Experimentsl Results for the Simple Spring Oscillator with Measurement Noise.We sample data from Equation 25 and 26, and we use Euler method to simulate by taking dt = 0.1.(a) and (b)show the real macro-state versus the predicted ones both for NIS and the ordinary feed-forward neural network, respectively; (c) and (d) show the real and predicted dynamics, i.e., the dependence between dz/dt and v, and dv/dt with z for both neural networks for comparison; (e) shows the real and predicted trajectories with 400 time steps starting from the same latent state; (f) shows the dependence of the dimension averaged of Causal Emergence (dCE) on q (the number of effective channels).

Figure 5 .
Figure 5. Various mutual information between variables change with training iterations.(a) shows the change of mutual information I(x t , y q=4 t

Figure 6 .
Figure 6.The dependence of the dimension averaged Causal Emergence (dCE) on different scales(q) of the markov dynamics (a), the learned mapping between micro states and macro states on the optimal scale(q) (b), and the learned macro-dynamics (the mapping from y t to y(t + 1) (c).There are two clear separated clusters on the y-axis in (b) which means the macro states are discrete.We found that the two discrete macro states and the mapping between micro and discrete macro states are identical as the example in ref[6] which means the correct coarse-graining strategy can be discovered by our algorithm automatically under the condition without any prior information.In (d), I(x t , xt ) ≤ I(x t , y

Figure 7 .
Figure 7.An exampled Boolean network(left) and its micro mechanisms on nodes(right).Each node's state on the next time step is affected by its neighboring nodes' state combination randomly.The transition probabilities (micro mechanisms) on each case are shown in the table.

Figure 8 .
Figure 8. Experimentsl Results for the Boolean Network.The dependence of the dimension averagedCausal Emergence (dCE) on different scales(q) (a) and the learned mapping between micro states and macro states on the optimal scale(q) (b).There are four clear separated clusters on the y-axis in (b) which means the macro states are discrete.We found that the four discrete macro states and the mapping between micro and discrete macro states are identical as the example in ref[5] which means the correct coarsegraining strategy can be discovered by our algorithm automatically under the condition without any prior information.(c) shows the change of mutual information I(x t , y EI L (µ) = I(do(X ∼ U([−L, L] n ; Y) ≈ − m + m ln(2π) + ∑ m i=1 ln det(σ 2 i ) 2 + n ln(2L) + E X∼U([−L,L] n (ln | det(∂ X µ(X))|).(A5) where, U([−L, L] n ) is the uniform distribution on [−L, L] n ,and | • | is absolute value, and det is determinant.(ii) If det(∂ X µ(X)) ≡ 0 for all X, then EI ≈ 0 Proof.Because the calculation of mutual information can be separated into two parts: ) ln p(Y)dY dX (A6) By inserting Equation A4 into Equation A6, the first term becomes(the Shannon entropy of the Gaussian Distribution): Y p(Y|X) ln p(Y|X)dY = − m + m ln(2π) + ln det(Σ) 2 (A7) If there exists X: | det(∂ X µ(X))| = 0, thus:

Figure A2 .
Figure A2.The graphic model of the squeezed information channel Figure 3 after do operation.(a) do operator acts on the y t node, and (b) do operator acts on the x t node.The nodes with double circles are the nodes that the do operator acts on.
) and I(x t , xt ) with the increase of the number of iterations.From the figure, we can see approximately that within the specified number of iterations, I(x t , xt ) ≤ I(x t , y t ), I(x t , y