Neural Causal Information Extractor for Unobserved Causes

Causal inference aims to faithfully depict the causal relationships between given variables. However, in many practical systems, variables are often partially observed, and some unobserved variables could carry significant information and induce causal effects on a target. Identifying these unobserved causes remains a challenge, and existing works have not considered extracting the unobserved causes while retaining the causes that have already been observed and included. In this work, we aim to construct the implicit variables with a generator–discriminator framework named the Neural Causal Information Extractor (NCIE), which can complement the information of unobserved causes and thus provide a complete set of causes with both observed causes and the representations of unobserved causes. By maximizing the mutual information between the targets and the union of observed causes and implicit variables, the implicit variables we generate could complement the information that the unobserved causes should have provided. The synthetic experiments show that the implicit variables preserve the information and dynamics of the unobserved causes. In addition, extensive real-world time series prediction tasks show improved precision after introducing implicit variables, thus indicating their causality to the targets.


Introduction
Identifying the complete and valid causes of a variable has been a significant pursuit in causal inference toward complex systems.A causal inference approach is called "causally faithful" [1] when it reveals the causal relationships shown in the observed variables, and it is "causally sufficient" [2] if it does not omit any of the hidden confounder or unobserved variables that should be included.To hold causal sufficiency when deriving faithful causal relationships, traditional causality inference methods [1][2][3][4][5][6] often assume the complete observability of the system, i.e., all of the variables and information within the system are collected and observed.In the perspective of [7,8], the information within the system flows from the causes to the targets; thus, the complete causes can provide most of the target's information and can describe it (except the noise terms) in detail, as shown in Figure 1a.
However, in practical scenarios, complex systems are often partially observable since only a subset of variables can be directly measured.For instance, in financial markets, the influence of media [9] and company-specific factors [10], which are hardly measurable, drive important causal effects in local tradings [9] and price reversals [10].Another example can be shown in ecosystems, in which the dynamics are described by species populations, habitat quality, and threats; however, one can rarely acquire a perfect knowledge of the system states [11,12] to perform adequate management actions.The partial observability also induces analyzing challenges in domains such as monitoring biophysical objects [13], weather and climate forecasting [14], and industrial management [15].While part of the causes are unobserved, the target's information that should be provided is unknown, as shown in Figure 1b.To describe the causal structures in such incomplete systems, causal sufficiency in traditional works is no longer guaranteed, and their results do not necessarily identify the real causes even when they are causally faithful [1,16].Therefore, to avoid identifying spurious causes, the common idea shared by most practices [17][18][19][20][21][22][23][24][25][26][27][28] is to introduce implicit variables that compensate for the missing causes.Most of them have applied neural networks as a powerful tool for generating variables, which could be categorized into two streams: either (i) employing neural networks with latent layers to uncover the significant features in observed variables that better formulate the targets, or (ii) utilizing the ability to preserve information in representation learning to extract unobserved or latent causes that provide significant information of the targets.
The works from the first approach adopt observed causes as the inputs to train a neural network that minimizes the fitting loss between the output and the target.The intermediate layers in the neural networks help to better fit the target in supervised learning [17], for example, by capturing the hidden long-/short-term dependencies in the dynamics [18,19].Granger [29] suggested that we could consider these hidden variables as causes, since including them could perform a better regression of the targets.However, such methods are, in actuality, just recombining the neurons from the input to extract advanced features, and they cannot provide information beyond the input [30].Therefore, they do not necessarily provide unobserved information; thus, we can hardly identify these hidden variables as those that are unobserved in the system.
The second stream aims to generate implicit variables that provide more information to the targets.This echoes another definition of causality [3,31], in which a cause should conditionally correlate with the target given any of the sets of other variables, or it could provide information that all the other variables could not provide to the target [32].These methods [20][21][22][23][24][25][26][27][28] adopt the idea of representation learning, which inputs the target (and all the observed variables) into neural networks to project an embedding.This preserves the information while describing the prominent dynamics of the original data.Since these latent variables generate a cover of most of the targets' information, they could be referred to as the causes that drive the system.However, in general, the observed and unobserved causes are entangled and inseparable in the causal structure.While refs.[20][21][22][23][24][25][26][27][28] excluded all of the observed causes and only considered the generated latent variables as the targets' causes, their depiction does not necessarily reveal the real causal relationships.
Motivated by the aforementioned issues of either missing the unobserved information or giving up on the observed causes, this paper proposes a framework to complete the causal structure by extracting the unobserved variables while retaining the observed causes.We assume the existence of unobserved causes to the target variables; hence, the targets' information should be provided by both the observed and unobserved causes.Given the part of the information from the observed causes, our objective is to generate implicit variables from the observed variables that provide as much information to the target as possible, thus covering the information that should have been provided by the unobserved causes.As shown in Figure 1c, the implicit variables Z would cover the information of the target Y that should have been provided by the unobserved causes W, and hence it could provide an alternative representation ({X, Z} → Y) of the information flow in the causal relationships {X, W} → Y.
In detail, we employ a generator-discriminator architecture named the Neural Causal Information Extractor (NCIE) that extracts and embeds the targets' information into the implicit variables.The generator obtains implicit variables fromthe time series inputs, and the discriminator maximizes the mutual information between the targets and the union of explicit and implicit causes by constraining implicit variables to have diverse marginal and joint distributions.This results in a holistic causal structure that encompasses both observed and unobserved factors to the target.To highlight the ability of NCIE to complement unobserved information and recover unobserved dynamics, we generate implicit variables for three synthetic cases.To further demonstrate the efficacy of the generated implicit causes in improving time series predictions and to verify their causality, we perform extensive experiments on real-world time series data.
The contribution of our work can be highlighted as follows: 1.
We propose a generator-discriminator architecture that generates implicit variables to complement the unobserved causes in the causal structures while retaining the observed causes.

2.
The implicit variables we generate could carry information from the unobserved variables and reveal their dynamics.

3.
Time series prediction tasks show that the combination of observed and implicit variables helps improve the prediction of targets, verifying that they are better candidates for causal inference.
This paper is organized as follows.Section 2 provides a comprehensive review of related works.Section 3 presents the basic theories in information theory and causal inference and states the objective of this work.Section 4 introduces the methodology of the proposed approach.Section 5 details the experimental setup and dataset descriptions, and then presents the results and discussion of our experiments.Finally, in Section 6, we provide a discussion on the implications of our work and present future possible advanced works.

Literature Review
In the scope of our work, we categorize the related works on causal inference into two parts, which cover the methods with and without considering unobserved variables.

Causal Inference without Considering Unobserved Variables
Assuming that the given observations are complete and thus sufficient, the objectives of traditional works often include drawing causal faithful graphs, in which nodes represent the variables of interest and edges indicate the existence of causal relationships between variables.The constrained-based methods, including the Peter-Clark (PC) algorithm [2] and its improvements [2,3,33], are graph algorithms that start with a fully connected graph and iteratively delete the edges between two variables if they are conditionally independent or d-separated [2].They provide causal faithful graphs that satisfy the causal Markov condition [31], i.e., a variable is conditionally independent to all variables except its effects given its causes.On the other hand, score-based approaches present the causal structure with the Bayesian network with the highest score, measured by the Bayesian Dirichlet equivalent (BDe) score [4] and the K2 score [5].Though the latter methods do not ensure faithfulness, they pursue a representation that shows the causal structure and, at the same time, the strength of cause-to-effect dependencies.Other approaches include those based on Granger causality [6], which also aim to identify the existence and magnitudes of the causal edges among variables.

Causal Inference Considering Unobserved Variables
There are plenty of works on causal inference that are aware of the existence of unobserved variables, and for clarity, we attempt to classify them into three groups.The first group refers to those that detect the existence of hidden confounders without knowing the effects, including Fast Casual Inference (FCI) [2] and Latent PCMCI (LPCMCI, where PCMCI [3] is a PC-based condition selection with the Momentary Conditional Independence (MCI) test) [34].They are the modified versions of the PC algorithm [2] and PCMCI [3], respectively, which follow similar procedures but identify an extra kind of edge (relationship) between variables: bidirectional arrows.This indicates the conditional correlational relationship between two variables that could not be explained by any directional causal path from one to another, thus identifying the existence of a hidden confounder driving the two variables.
The second group attempts to explain the causal effects originating from unobserved variables toward the target variable.Assuming linear, additive, and first-order Markovian causal effects, ref. [16] constructs a linear regression model that describes the observed variables by past observed variables, unobserved causes, and noises.Based on this model, the effects of the unobserved causes, which refer to their coefficients, could be obtained by variational expectation maximization or by solving the autocovariance.
The third group [20][21][22][23][24][25][26][27][28] aims to identify the unobserved causes via latent representations.They assume that the observed signals are driven or caused by a set of unobserved latent signals, and there exists a latent dynamic that drives the latent variables from past to future.Therefore, given the observed dynamics, their objectives are to identify the posterior distributions of future latent signals given past latent signals, the posterior distributions of observed signals given the contemporaneous latent signals, and the latent signals themselves.They identify latent signals via different approaches, including Variational Autoencoder (VAE) [20,21,35,36], Dynamical Component Analysis [25][26][27], and maximizing the (conditional) mutual information [22,28].

Mutual Information Estimator
In Equation ( 1), Shannon provided an information theory-based definition for the entropy of random variables with known distributions.However, the distribution of variables in real-world time series remains unknown in many scenarios.To sample the unknown distributions, one of the fundamental methods is binning [37,38].By partitioning the continuous variables into discrete bins and counting the sampled distributions, one can apply Equation (1) to estimate entropy.Provided the entropy of variables, one can calculate the mutual information between two variables, X and Y, with The Mutual Information Neural Estimator (MINE) [39] is another popular method that uses a neural network to estimate the KL Divergence of two random variables, which could then provide a tight lower bound for their mutual information.Deep InfoMax [22] provides a similar approach by estimating Jensen-Shannon (JS) Divergence.Under the Gaussian assumption, Refs.[25,26] measure entropy by H(Z) = r1 2 ln(2πe) dT |Σ T (Z)|, and prove that the estimation is still tight if the assumption is violated.Refs.[40][41][42] provide other effective estimators for information based on classifiers and k-nearest neighbor (kNN).

Problem Statement
Suppose we have target variable Y t at time t.First, we consider the case where Y t is auto-correlated, and all the causes of Y t are known, denoted by {X t , Y <t }, where where H(Y) is the entropy of Y, a measure of uncertainty, and I(Y; X, Y <t ) is the mutual information (MI) between Y and {X, Y <t }.MI is a measure of "information" or "reduction of entropy" that the two variables provide to each other.In Equation ( 3), the entropy reduction that X and Y <t provide to Y equals the entropy of Y. Hence, the causes of Y could eliminate all the uncertainty of Y, in other words, provide all the information of Y. Now, we consider another case in which, besides some observed causes X = {X 1 , X 2 , ... , X n }, there are other unobserved causes W. Therefore, we assume The absence of information about W induces As shown in Figure 2, X and Y <t could only provide a part (the red part) of the information to Y. Therefore, our objective is to generate some implicit variables Z = {Z 1 , Z 2 , ... , Z m } to maximize I(Y; X, Y <t , Z) (the yellow part), which attempts to approach its upper bound H(Y) (the large black box).Hence, if we could search for a Z that could perfectly provide this information, we could say that Z somehow represents the unobserved causes W.More specifically, Z traces the information for Y from W.
While the upper bound of the MI equals the entropy of Y, i.e., the objective of our work is to generate Z which maximizes the MI and makes it approach H(Y): which is equivalent to maximizing the conditional mutual information of Z and Y given X and Y <t : Since the term without Z could be considered constant here, Objective ( 7) is maximized if Z covers all the conditional mutual information from W, and meanwhile X, Y <t , Z together cover all the entropy (Equation ( 6)) and provide all the information of Y.
There are some assumptions made on Z: 1.
We focus on the unobserved variables that provide information to Y.Those without information of Y do not affect the distributions of Y.

2.
In the case of time series, we denote X <t , Z <t as the explicit and implicit causes of Y t to uphold the assumption of temporal priority [1], i.e., causal relationships only move from past variables to future variables in time series.For simplicity, we denote X <t and Z <t as X and Z, and Y t as Y.

3.
We assume X t , Y t , and Z t are in a closed system and there is a causal loop among them, i.e., X <t and Z <t may induce Y t , and X <t and Y <t may induce Z t .4.
While we are generating Z <t , we do not necessarily obey temporal priority [1], which is a common practice in refs.[20][21][22][23][24][25][26][27][28]30].Therefore, we employ X <t and Y t to generate Z <t .We note that we do not focus on finding the exact causes of Z here.Instead of finding causes of Z by applying the whole structures we use on Y, we just use a basic RNN to present Z from the explicit causes.
To better illustrate our problem and objective, we provide a simple numerical example as follows: Example 1.We consider a temporal Boolean network with three Boolean variables: Y t , X t , and W t , where Y t at time t is determined by At every time step t, X t and W t are sampled from a uniform binomial distribution, i.e., P(X t = 0) = P(X t = 1) = P(W t = 0) = P(W t = 1) = 0.5.
In this context, the time-varying variables Y t , X t−1 , and W t−1 at a particular time t are considered as the static random variables generated by the random process [1][2][3]22,24,28].We list all the possible circumstances of the truth table of these three variables.
Then, we calculate H(Y t ), I(Y t ; X t−1 ), and I(Y t ; X t−1 , W t−1 ) respectively: We notice that H , which means that the information of Y t could be completely covered when X t−1 , W t−1 are known.When W t−1 is unobserved, the observed X t−1 can only provide part of the information of Y t .This gap of information is described by the conditional mutual information, I(Y t ; W t−1 |X t−1 ), where which quantifies the information that W t−1 provides to Y t that X t−1 cannot provide.
Example 2. Now, we consider the scenario in which we do not know Y t is generated by Equation (11) or the distribution of W t−1 , but we are provided with X t−1 and Y t , and we wish to generate Z t−1 to complement I(W t−1 ; Y t |X t−1 ) and maximize the objective in Equation (7).The ideal case is to reverse the Markov chain (W t−1 , X t−1 → Y t ) to (X t−1 , Y t → W t−1 ) and make the posterior distribution p , where the joint and posterior distributions are In such a case, we calculate that Example 3.However, there actually exist more than one p(Z t−1 |X t−1 , Y t ) that can maximize I(Y t ; Z t−1 |X t−1 ).For example, we let and we can again list all the possible circumstances of the truth table of these three variables.
In such a case, we can obtain the same result as that in Equation (17), which means a different Z from Equations ( 16) and (18) can achieve the optimal objective in (7).This indicates that they are providing the same conditional information that W provides to Y.The value of Y t is known when X t−1 = 0, but is uncertain when X t−1 = 1.W provides the conditional information by helping to eliminate this uncertainty (or conditional entropy H(Y|X)), which can also be eliminated by not including W but including Z from Equation ( 16) or (18).Therefore, Z can express conditional information I(W; Y|X).

Neural Causal Information Extractor
We name our architecture the Neural Causal Information Extractor (NCIE), which is composed of a generator and a discriminator.The architecture is shown in Figure 3.

Generator
Although Z is not explicitly collected and observed from the system, we consider part of it to be involved in the system, specifically the part that provides information to Y.While we suppose that Y is generated by {X, Y <t , Z}, reversely, the information of Z that we are interested in should be covered by (a part of) Y. Therefore, Z could be represented as a function of X, Y <t , and Y.Other variables V besides {X, Y <t , Y} in the raw data should not be included to generate Z in case they have not provided information via path V → Z → Y (or V should be detected as an observed direct cause before Z is included).
While finding the causes of Z is not the focus of this work (we only focus on finding the explicit and implicit causes of Y t ), we apply a basic RNN [18] to capture the lagged dependencies from X and Y to generate Z.The RNN consists of two neural networks to manipulate and propagate its hidden states.The first neural network, f 1 , generates H t from X <t , Y <t , Y t , and H t−1 (Equation ( 20)), where H t denotes the hidden states at time t.Then, the second neural network, f 2 , reads H t and generates the output Z (Equation ( 21)), while H t is propagated to the next episode.
With the propagation of H t , the memory of X and Y in previous episodes is also propagated.Furthermore, since Z is generated directly from H, the propagation of H also passes the memory and information of Z.It is worth noting here that Z is jointly generated by the explicit {X, Y} and the implicit variables Z <t .Suppose we employ another architecture that generates Z solely by explicit variables {X, Y} (and without Z <t ); Z then acts as an intermediate hidden layer in our architecture, which is equivalent to directly depicting Y by a neural network structure with {X, Y} and without Z.In other words, Y would then be solely caused by explicit variables.This is why we apply RNN to generate Z jointly by implicits and explicits, which matches our assumption that Y is causally related to implicits and explicits.

Discriminator
The discriminator provides a measurement of I(Y; X, Y <t , Z), where Z is the output of the generator.Here, we consider Y and {X, Y <t , Z} as two sets of random variables, and their mutual information is equivalent to Kullback-Leibler (KL) divergence between their marginal and joint distributions, where D KL denotes the KL divergence of two distributions and P Y denotes the distribution of Y.The lower bound of KL divergence can be provided by Donsker-Varadhan representation (Equation ( 22)), where T(Y, X, Y <t , Z) is any class function.We follow MINE [39] to implement function T by a neural network with parameters Θ and maximize the lower bound of mutual information I Θ by searching for optimized function T θ : Optimization is performed using stochastic gradient descent (SGD).However, a naive derivation of Equation ( 24) leads to biased gradient estimation.MINE [39] proposed addressing this bias by reformulating the loss function.For batch sample B, the gradient is MINE suggests estimating the denominator of the second term using the moving average from previous epochs, while the other parts are the averaged gradient from the network parameters.
Among all (conditional) mutual information estimators, MINE estimates MI by stochastic gradient descent (some do not provide MI, and some are not trained by gradient descent).The available gradient is essential in our architecture because one of the inputs of MINE, Z, is exactly the output of the generator.Hence, the generator and discriminator are fused together by Z, which acts as the bridge to back-propagate the gradient from the discriminator to the generator and thus update the parameters in both neural networks.

NCIE: Maximizing Mutual Information
In Section 3, we claim that the objective of our architecture is to find implicit variables Z such that I(Y; X, Y <t , Z) is maximized.In this case, Z could represent the information provided by unobserved variables.
Since the generator and discriminator are fused together, they could be jointly considered as a whole neural network that trains both components simultaneously, where Z could be viewed as a layer in between that we are interested in.While training together by gradient descent, the generator is trained to provide better Z to push up I θ , and the discriminator is trained to provide a better (higher) estimation of I θ given the Z from the generator.Therefore, NCIE pursues objective max In an alternative view, the whole NCIE structure could be interpreted as a variant of an RNN that pursues to provide a maximized value of the Donsker-Varadhan representation given the batched inputs of X, Y <t , and Y.

Motivation and Architecture
To verify the capability of extracting effective implicit causes, we modify the NCIE in Section 4.1 and apply the architecture to perform multi-variate time series prediction in Section 5.2.2.To predict Y t , we cannot use Y t itself or any variables at time t or after t as the input; therefore, we have to find an alternative for the input of the generator.Here, we use a simple RNN to pre-train a Ŷt that tries to estimate Y t given all the explicit causes {X, Y <t } of Y t .After the pre-training stage, Ŷt is plugged into the input of the generator in the NCIE and replaces Y t , to avoid the information leakage in the original architecture in Section 4.1.
The overall procedure is shown in Figure 4, which is summarized as follows: 1.
We select the explicit causes X from the raw time series with a maximum lag of k using PCMCI [3], which is a framework that identifies the time-lagged causal relationship given the assumption of the completeness of the given data (no unobserved data).
With its high precision, we assume the resulting X is the set of all the real causes of Y. Furthermore, we select the lagged Y <t from t − k to t − 1 as the autoregressive terms, together with X to be the set of observed causes; 2.
We train the pre-training module and generate Ŷt to estimate Y t ; 3.
After finishing training the previous module, we use output Ŷt as the input of this modified NCIE module and train the generator and discriminator (Section 4.1) together to generate Z; 4.
We train neural network f 3 (Equation ( 27)) to generate Ŷt ′ , which is applied to examine the improvement of the prediction effect on Y given by Z.We note that although there are four neural network architectures in Figure 4 and some of their inputs and outputs are connected, only the two networks in the third module train and update their parameters simultaneously.The framework in Figure 4 proceeds module by module, and the next one starts after the previous one finishes.

Prediction
We test whether the generated Z can help to improve the prediction of Y by comparing Ŷt and Ŷt ′ .In Section 5.2.2, we employ different neural networks to implement f 3 , which performs the prediction of Y by minimizing the mean squared error (MSE):

Experiments 5.1. Synthetic Data Experiments: Dynamics of Z
In the following experiments on synthetic data, we demonstrate that the proposed NCIE framework can accurately extract the unobserved causal variables.We consider three synthetic systems (Figure 5) to compare the generated Z with the unobserved cause W that Z mimics.Targets Y in the synthetic systems are constructed with causes exhibiting long-term and short-term dynamics.We investigate the ability of NCIE in Section 4.1 to extract causes W from target Y when Y and the other causes X are given.

Case 1: A Linear System
We suppose the time-varying X and Y are observable while W is unobserved:

Case 2: A Non-Linear System
We suppose the time-varying X and Y are observable while W is unobserved:

Case 3: A Non-Linear System
We suppose the time-varying X and Y are observable while W is unobserved:  We investigate our approach with the predator-prey model, which is a classical problem in ecology [43,44].Specifically, we choose the model with 1 predator species, wolves, and 2 prey species, sheep and rabbits.The prey species share the same food source: grass.In total, there are four species in the ecosystem, and their populations grow according to the Lotka-Volterra model [45,46]: where W(t), S(t), R(t), and G(t) represent the population of wolves, sheep, rabbits, and grass at time t, respectively.Because this dynamic system can lead to negative populations and we want to prevent extinction from halting our simulation, we add a modification that sets any population to 1 if it drops below 1.We consider the scenario in which some intermediate species in the biological chain are unknown or unobserved.Therefore, we assume that S(t) and R(t) are unobserved.We set W(t) as the target variable and G(t − 1) as the observed cause, and attempt to recover the information provided by the unobserved causes S(t) and R(t).It is worth noting that G(t − 1) is not a direct cause of W(t) in this case.The absence of S(t − 1) and R(t − 1) makes G(t − 1) and W(t) conditionally dependent, thus creating a spurious causal link.Since it is not the focus of our work, we do not discuss how to eliminate spurious links after recovering unobserved causes in this study.
The dynamics of the population of the four species and Z are shown in Figure 6.From this, we can observe that Z provides information about the unobserved variables S(t) and R(t).

The Interpretation of Z
Our target is to generate Z that carries the unique information that unobserved causes provide to Y.However, in the three synthetic examples, instead of revealing the dynamic of W, Z exhibits a variable that tries to introduce Y while eliminating the information of X.
We try to explain the result using partial information decomposition theory [47], which suggests that we could decompose the Venn Diagram (Figure 2) of mutual information into the following parts shown in Figure 7: (; )   (; , )   (; )   (; , ) () . The Venn Diagram illustrates the partial information decomposition of Y into four discrete parts: the unique information from X (red), the unique information from Z (blue), the redundant information from X and Z (purple), and the synergistic information from X and Z (yellow).
The information to Y from all of its causes consists of four parts: • I uni (Y; W), the unique information that could only be provided by W; • I uni (Y; X, Y <t ), the unique information that could only be provided by {X, Y <t }; • I syn (Y; X, Y <t , W), the synergistic information provided only when both {X, Y <t } and W are present; • I red (Y; X, Y <t , W), the redundant information that could be provided by either {X, Y <t } or W.
If our objective is to recover W, Z should only contain the unique information I uni (Y; W) and the synergistic information I syn (Y; X, Y <t , W), while the unique information I uni (Y; X, Y <t ) and the redundant information I red (Y; X, Y <t , W) are not present in this case.However, our objective in generating Z is not to exactly mimic W, but to maximize the information that the observed and unobserved causes provide to Y, so that we can acquire a better causal representation for Y.
Moreover, it is difficult to integrate different methods; therefore, our architecture does not measure the four components [48][49][50] separately but rather measures and maximizes the entire I(Y; X, Y <t , Z) at once.Thus, I uni (Y; Z), I syn (Y; X, Y <t , Z), and I red (Y; X, Y <t , Z) in Equation (32) are maximized.As a result, the implicit causes Z may contain all four components, which is the information from both the observed {X, Y <t } and the unobserved W.
Therefore, in the synthetic cases, we still observe the trend information from X that is not fully eliminated in Z.Still, in this case, we argue that Z carries significant information and reveals evident dynamics of W. As shown in Table 1, the observed causes X only provide very little information to the target Y in simple synthetic cases, where the mutual information of target Y provided by itself, I(Y; Y), is equivalent to the entropy of Y, H(Y), provided that we consider Y as a random variable.In contrast, after we add the implicit causes Z, all the causes together provide most of the information of target Y and thus complete the causal structure of Y.We first examine the ability of NCIE (Section 4.1) to extract implicit causes that recover unobserved information of the targets in several real-world datasets, including

•
The Electricity Transformer (Oil) Temperature (ETT) [17], which includes four datasets sampled at different time intervals (1 min, 2 min, 1 h, and 2 h).The observed causes include useful and useless loads in different levels; • The daily exchange rates [51] of eight countries, with one of them set as the target and the other seven as the observed causes for predicting the target; • Minneapolis-St Paul interstate metro traffic volume [52], in which observed causes include temperature, weather, and holidays; • The PM2.5 quantity in Beijing (https://www.kaggle.com/datasets/rupakroy/lstmdatasets-multivariate-univariate)(accessed on 6 November 2023), in which observed causes include dew, temperature, atmospheric pressure, weather, as well as wind direction and speed (all the codes and data are available at https://github.com/jhliang/NeuralCausalInfoExtractor)(accessed on 6 November 2023).
These data provide partial observations of complete systems in industrial monitoring, financial markets, and urban operations.In the provided datasets, it is obvious that some causes, W, are unobserved.For example, exchange rates are influenced not only by foreign exchange rates but also by the domestic economy and complex international trade.Similarly, the PM2.5 datasets do not involve urban human activities and pollution emissions from power plants.We aim to recover conditional information I(W; Y|X) using Z, which advances the precision of time series forecasting.
Table 2 shows the information provided by only observed causes and by the combination of observed and implicit causes.It illustrates that the generated Z can provide significant extra information besides the observed causes X in real-world scenarios.The Z in this table is generated by NCIE (Section 4.1).We do not discuss the dimensionality of Z in this paper.While the higher dimensionality of Z provides the wider information channel, we simply set the dimension of Z to be the dimension of {X, Y <t }.While we claim that our architecture helps to identify unobserved causes of the target variables, it holds massive potential for extracting useful information from the available features and for conducting time series forecasting.We evaluate our architecture by performing time series forecasting on the real-time series datasets and comparing our results with the baselines [18,19] listed in Table 3.To prevent future information leakage in time series prediction, we employ the modified NCIE architecture in Section 4.2 to generate Z.
In this work, we are considering one-dimensional targets Y with multiple features that are candidates for observed causes chosen by PCMCI.Additionally, our architecture focuses on single-step time series forecasting because our goal is to find the direct causes for Y, which could be X t−k and Z t−k with a small lag k.The results are shown in Table 3.For the three chosen time series analyzing tools, we compare the prediction results with and without Z.It is illustrated that including Z in most cases does outperform those predictions without implicit variables Z.Our problem could be considered a feature selection task.The Z we generate could be considered as a feature that helps to describe Y.We see that in a few cases, Z does not necessarily improve the prediction of Y.This could be due to two reasons.First, although we know there is a gap between the entropy of Y and the information that X and Y <t bring to Y, this gap might be too narrow for NCIE to generate a Z that brings sufficient information from it.Second, when the system is time-varying, the underlying causality structure of the system could also vary with time.Z that acts as a significant cause in (part of) the training dataset may become a redundant variable that perturbs the prediction.
It is worth mentioning that RNN and LSTM do not outperform simple neural networks (NN) in all cases.Our explanation is that the features X that PCMCI chooses cover most of the (observed) direct causes of Y, and we also choose autoregressive features Y <t from t − k to t − 1 that already cover short-and long-term memories of Y. Therefore, the memories that RNN and LSTM extract could not significantly help the prediction, but, in contrast, overfit Y.
Table 4 shows the mutual information between different sets of variables and Y.For real data, the Z that NCIE generates can provide extra information to Y.It demonstrates NCIE's ability to extract part of the implicit causes in the system, although we find a gap between I(Y; X, Y <t , Z) and I(Y; Y).It is unavoidable in cases in which the systems are open or noisy because external noises are not driven by variables within the system.Furthermore, for a large system with too many causes, some causes may carry very little unique information to the target, and it is difficult to express them with limited Z.The first column of Table 4 shows the mutual information between the pre-trained Ŷ and the actual Y.We recall that the pre-trained Ŷ is generated by an RNN architecture that minimizes the mean square error (MSE), given the explicit causes X and Y <t .The gap from the first column to the others shows that RNN and other recurrent structures do not necessarily provide a regression result with sufficient information to the target.While it is not a key point to argue the effectiveness of different objectives in this work, it reminds us that NCIE provides more information than general neural architectures.

Conclusions
In this work, we propose NCIE to generate implicit variables that complement the unobserved information of the target not provided by the observed variables.We can complete the causal structure of the target with the implicit variables while retaining the observed causes, as they together provide most of the target's information.Furthermore, the generated implicit variables have similar dynamics to the unobserved causes in the synthetic experiments and help to predict the target in the real-world time series.These results provide sufficient evidence that the implicit variables can be an effective candidate to substitute the unobserved causes and complete the causal structure.
While this work only focuses on discovering unobserved causes for single targets, future attempts could be made to apply similar methods to construct a complete graph for multiple targets, in which all the causes provide complete information for all the targets.With such methods, we are able to depict the evolution of systems by the information flow among variables, which could contribute to the research of emergence in real-world complex systems.Funding: This research was funded by the Science and Technology Innovation Commission of Shenzhen (JCYJ20210324135011030, JCYJ20210324115604012, WDZC20200818121348001), the National Natural Science Foundation of China (71971127), the Guangdong Pearl River Plan (2019QN01X890), the High-End Foreign Expert Talent Introduction Plan (G2021032022L), the Tsinghua Shenzhen International Graduate School Fund (JC2021004), and the Science and Technology Innovation Committee of Shenzhen-Platform and Carrier (International Science and Technology Information Center).
Institutional Review Board Statement: Not applicable.

Figure 1 .
Figure 1.Illustration of the information in a Venn diagram, as well as the causal relationships of the (a) ground truth; (b) partially observable scenario in reality; and (c) an alternative representation of causal relationships with the implicit variables Z, which is generated by NCIE, where Y, X, and W denote the target, observed causes, and unobserved causes, respectively.

Figure 2 .
Figure 2. The Venn Diagram illustration of the information provided to Y by X (red), Z (blue), and {X, Z} (yellow).

Figure 3 .
Figure 3.The architecture of the Neural Causal Information Extractor.

Figure 6 .
Figure 6.Dynamics of Z and the population of four species for Synthetic Case 4.
For simplicity, we abbreviate Y t as Y.While all the causes {X, Y <t } are known, we assume

Table 1 .
Mutual information provided to Y from different sets of variables.

Table 2 .
Mutual information provided to Y from different sets of variables.The Z in this table is generated by the original NCIE (Section 4.1).

Table 4 .
Mutual information provided to Y from different sets of variables.The Z in this table is generated by the modified NCIE (Section 4.2).