1. Introduction
One use of causal graphical models is to formulate and solve problems of
identification of causal effects—or
causal identification—in which hypotheses about the existence and direction of causal relationships between variables are used together with observational data to infer the effects of interventions. The interventions with effects to be computed may be either “atomic” interventions, represented with Pearl’s “do” operator [
1], or more complicated ones involving randomness or dependence on other variables in a model [
2]. Similarly, the data available for inference may be simply the joint probabilities of outcomes of what we will call
perfect passive observation, or may be outcome probabilities associated with various kinds of combinations of perfect passive observation via controlled experiments [
3]. In this work, we define graphical causal models in a way that is equivalent to standard ones but is designed to support reasoning about quite general classes of interactions between an agent and a causal data-generating process. This framework allows us to pose and solve new kinds of causal identification problems based on these more general types of probing schemes. We demonstrate the application of such a framework in the case of Markovian models, i.e., those in which every variable can be observed at least in some way. It is well known that for a Markovian causal model associated with a directed acyclic graph, all the model’s parameters, and hence essentially all interventional quantities, are identifiable from the graph and the joint distribution of the model’s variables under perfect passive observation. Identifiability seems to depend on the existence of a certain factorization of that distribution. We consider probing schemes that may yield only partial information about variables’ values and that may also disturb those values. The outcome distributions for these schemes cannot generally be factored in the same way as distributions obtained via perfect passive observations; however, if the probing scheme satisfies certain abstract criteria, then it may serve as well as ordinary perfect passive observations for the purpose of identification in Markovian models.
In a graphical causal model as we define it, each node in a directed graph corresponds to an
intervention locus where a variable “arrives” and then “leaves”, with the possibility for probing or manipulation in between. A generic interaction with a variable at a locus is represented by an
instrument, examples of which include a perfect passive observation, which reports the value of a variable without changing it, and a
surgical intervention, which discards the incoming value and sets the outgoing value. With all types of probing schemes represented by the same sort of mathematical entity, we can pose different kinds of causal identification problems by declaring different kinds of interaction to be available as means of gathering data for inference. Thus, we can ask about identification of causal quantities from statistics generated under combinations of passive observation and controlled experiments [
4], or more generally under probing schemes that cannot obviously be classified as purely observational or purely experimental. Formulating and solving identification problems using these more generic probing schemes is the main contribution of the present paper.
The mathematical technology used in this paper is that of
process theories, which are categories for composing input–output processes in sequence and in parallel. These categories arise in logic, computation, physics, and topology [
5]. They admit graphical calculi of
string diagrams, which we use to represent and calculate with processes. The present article builds on an approach to causal inference [
6,
7] in which the qualitative causal hypotheses in a directed acyclic graph specify a morphism in a
syntactic process theory; then, a concrete data-generating process conforming to those hypotheses is a functorial interpretation of that morphism in a
semantic process theory of matrices (the first of these cited works addresses causal identification only with ordinary observational data, i.e., data from perfect passive observations; the second does not discuss identification at all). This article also shares technical infrastructure with [
8] and the conference paper [
9]. It generalizes a result from [
9] (specifically, it generalizes the
case of Proposition 28 of [
9] to graphs with arbitrarily many nodes and to instruments that may not be ∘-separable), but uses a different technique and is not concerned with the parallel treatment of classical and quantum causal models.
3. Instruments, Tomography, Combs, and Examples of Causal Inference
Because a set of matrices of non-negative real numbers is closed under sums, every hom-set in
can be given the structure of a commutative monoid. Moreover, this monoidal structure is compatible with both sequential and parallel composition of processes in the sense that (for instance) composing the zero matrix
(i.e., the additive identity element) in sequence or in parallel with another process yields another zero matrix, and addition is distributive over both sequential and parallel composition. This “enrichment” of the process theory in the category of commutative monoids (see Mac Lane [
11] for further details) implies that sums distribute over entire diagrams. Sums can therefore be introduced without parenthetical grouping, as shown below.
When composed with a normalized state, the sum of standard basis effects i and j provides the probability of the ith configuration plus the probability of the jth configuration. This sum can be said to represent the property of being in the ith or the jth configuration (for ). The sum of all the effects i where the label i runs from 1 to is the discarding map. This map’s composition with a normalized state yields the number 1. The sum can be said to represent the property of “being in some configuration”, which is satisfied with probability 1 by any system with a normalized state.
Sums in allow for the definition of mathematical entities representing quite generic ways in which agents may interact with processes. The concept of instrument used here is derived from quantum information science.
Definition 4. A -valued instrument of type , where A and are system-types in , is a finite set of maps, each of which has the form , and of which the sum is a normalized map. Each map ϕ is called a branch of the instrument.
When an agent carries out a procedure represented by an instrument, one of the instrument’s branches is implemented and reported to the agent. The probability of a specific branch generally depends on both the branch and the process that the agent is probing with the instrument. An instrument with a single branch represents a procedure that results in the realization of that branch with probability 1.
Example 2. For system-type A in , the standard basis effects constitute an instrument of type , as the sum of the standard basis effects is a discarding map, which is normalized. Suppose that an agent is confronted with (as many independent preparations as wanted of) an unknown normalized state in . When the agent “applies” the instrument to a single preparation, exactly one of the effects/branches is realized; effect i is realized with probability . By applying the instrument to a large number of preparations of ρ, the agent can use the frequencies to estimate the probabilities . These probabilities determine the state ρ; indeed, they are the entries in the column vector.
Example 3. Let denote the system-type of a binary random variable with values denoted 1
and 2
. (Mathematically, the system-type is the number 2
, corresponding to the number of values, but we adopt alternative notation to distinguish between the values and the system-type.) LetEffects ϕ and are the branches of an instrument of type . Suppose that we apply this instrument to an unknown normalized state ) representing the variable’s distribution. On a single “trial”, if the variable’s true value is 1, then ϕ is realized with probability and with probability . If the true value of the variable is 2, then ϕ is realized with probability and with probability . On each trial, the received information is simply which of ϕ and has been realized. This information does not allow us to deduce from the trial which value the variable has taken, as we could in the previous example. Over a large number of trials, we can learn the probabilities and . Knowing the values of the instrument branches allows us to infer the entries in the vector ρ from these probabilities, i.e., to infer the unknown state. In the above example, branch has higher likelihood than branch for either value of the random variable modeled by . However, the difference in relative likelihoods suffices for inference of . A somewhat more intuitive example is one in which each value of the random variable makes a different instrument branch more likely but not certain, in such a way that the instrument represents a small deviation from the perfect observation in Example 2.
Example 4. LetIf we consider ϕ and fuzzy versions of the predicate “has value 1
” and “has value 2
”, respectively, then the instrument with branches ϕ and “errs” twenty percent of the time when the true value of the random variable is 1 and ten percent of the time when true value is 2
. If we know that we are using this instrument, we can infer an unknown state (on system-type ) exactly in the limit of infinitely many trials. Although this instrument is more informative on a single trial than the instrument in Example 3, the way in which the empirically revealed branch probabilities determine an initially unknown state is mathematically the same in both examples. Example 5. A discarding map in is the sole branch of an instrument of type . An agent applying this instrument to an unknown normalized state on system-type A is told on each trial only that the single branch has been realized; the agent learns nothing about the state.
Example 6. Let denote the system-type of a ternary random variable with values denoted 1
, 2
, and 3
( is convenient notation for the system-type 3
). LetThese two effects are the branches of an instrument of type . If we apply this instrument to an unknown normalized state on system-type , then on a trial when we learn that ϕ has been realized, we can be certain that the true value of the random variable on that trial was not 2
. If, on the other hand, we learn that has been realized, we can be certain that the true value of the random variable was not 1
. This instrument, however, does not allow us to infer from infinitely many trials the value (i.e., probability distribution) of an arbitrary state. Let the arbitrary state be The probabilities of ϕ and when the instrument is applied to this state are and , respectively. If we learn these probabilities from many trials, then we learn two linear equations in three unknowns. We also know the normalization condition . These three equations do not generally have a unique solution for , and c.
Now, suppose that we have access to both the instrument and another instrument withAlthough neither instrument by itself suffices to determine arbitrary unknown normalized states, both instruments together do suffice: with the entries in the state called a, b, and c, as above, using the first instrument allows us to estimate and , while using the second allows us to estimate c and . The resulting system of linear equations in , and c has a unique solution. The procedure of inferring an unknown state using the probabilities of various instrument branches (that may not be standard basis effects) is called state tomography. For an unknown state and a known effect of appropriate type, once the probability is known, it does not matter what instrument (of which is a branch) was used to learn that probability. This irrelevance of the instrument under which an effect was realized extends to the instruments described below, for which the branches are not effects.
An agent’s preparation of a normalized state is represented by a one-branch instrument . State preparation instruments can be used along with instruments made up of effects to perform process tomography on unknown normalized processes in with generic input and output system-types.
Example 7. Suppose that an agent is confronted with an unknown normalized process in . In addition, suppose that the agent has access to a -branch instrument consisting of the standard basis effects for B and that for each standard basis state the agent has access to state preparation instrument (that is, the agent can prepare any standard basis state at will). If the agent prepares standard basis state i at the input of f, then the result is the state shown below.If the agent now takes the instrument of type B → I consisting of the standard basis effects j and applies it at the output of f, then each branch j is realized with the following probability.The agent can learn these probabilities (for all standard basis effects j) by performing the experiment with the fixed state preparation instrument for infinitely many trials. By repeating this procedure with the preparation instruments for all standard basis states i, the agent can learn the probabilities (
10)
for all i and j. These probabilities are the entries in the matrix f; therefore, the agent has now learned the value of f. As some of these examples of state tomography suggest, other instruments could be substituted for the instrument consisting of the standard basis effects, and process tomography would remain possible. Process tomography would also remain possible if the standard basis state preparations were replaced by certain other state preparations.
Example 8. Let single-branch state preparation instruments and be defined byThese instruments together with the two instruments and of type from Example 6 suffice for tomography of an unknown normalized process in . By applying each of the four possible combinations consisting of an instrument of type and an instrument of type , we can learn the following probabilities from infinitely many trials:These numbers are not themselves the entries of the matrix f, but they allow the entries to be determined. Here, we have essentially performed a controlled experiment on the black-box process f, but it is controlled in an unusual sense, as we could not choose the value to feed into f, only which of two known distributions the value would be drawn from. What matters for the tomography of a process in is whether we can prepare enough linearly independent states and whether enough linearly independent effects appear in the instruments of type .
Definition 5. A set of effects in a process theory is called informationally complete for system-type A if any state is uniquely determined by the set of numbers . Similarly, a set of states is called informationally complete for A if any effect is uniquely determined by the set of numbers .
Example 9. In , an informationally complete set of states for system-type A is a set of linearly independent column vectors of length , while an informationally complete set of effects for A is the set of transposes of such a set of column vectors. In particular, the set of standard basis states on A is informationally complete, as is the associated set of effects.
Tomography of arbitrary unknown normalized processes (for fixed A and B) in is possible if one has access to preparation instruments for all of an informationally complete set of normalized states on A and access to instruments of type for which union (a set of effects) is informationally complete for B.
The following property of regulates the tomography of processes with multiple inputs or multiple outputs.
Proposition 1. In the process theory , any processis determined by numberswhere and π index any informationally complete sets of states or effects for the appropriate system-types. We call this property tomographic locality. Proof. The proposition follows from the prior characterization of informationally complete sets together with facts about tensor products in algebra, specifically, that tensor products of basis elements in two vector spaces span the tensor product of the spaces. □
Thus, one way to perform process tomography on an unknown normalized process
in
is to select an informationally complete set of normalized states
, an informationally complete set of normalized states
, an informationally complete set of effects
, and an informationally complete set of effects
, then learn the probabilities in (
11). In terms of instruments, for every pair of states
and
, it is necessary to probe
f over many “trials”. On each trial, prepare the states
and
on
f’s inputs by applying the corresponding state preparation instruments, and measure
f’s outputs with an instrument
and an instrument
for which the branches are the effects from
and
, respectively. (We assume for simplicity that each of the two sets of effects
and
forms a single instrument; in this case, only one pair of measurement instruments is needed for the process tomography protocol, while many single-branch state preparation instruments are needed. Alternatively, it may be that there are multiple instruments comprising different subsets of, e.g.,
. Then, for each fixed pair
and
, one would need to vary the measurement instrument
). On each trial with states
and
, an effect
and an effect
will be realized. As the number of trials with fixed states
and
approaches infinity, it becomes possible to estimate exactly the probabilities (
11) for the fixed
and
and for all
and
. By varying the states
and
in this procedure, we can obtain all of the probabilities necessary to determine the value of
f.
The causal models to be defined in this paper involve second-order processes that take various first-order processes as input and determine the probabilities of outcomes of either “observational” or more “interventional” interactions with systems. Thus, for example, a causal model based on the graph
will involve a process
in
, where
x and
y are normalized. Each variable now corresponds to an input–output pair, depicted as a gap in the wire. This splitting of variables into an input and an output is similar to the situation in a single-world intervention graph [
16], where nodes are split in two. The idea of a causal model as a second-order process with an input and an output for each node appears in works [
17,
18] focused on “quantum causal models”.
The gap formed by an input–output pair represents an intervention locus where a “system” arrives and then leaves, with an opportunity for interaction in between. The interactions in this case correspond to instruments of types and ; that is, when an instrument is applied at a gap, a branch is realized and reported to the agent applying the instrument. The “state” of the system that is fed forward depends on which branch has been realized.
If
is a branch of the instrument applied at the
X gap and
is a branch of the instrument applied at the
Y gap, then the joint probability of realizing
f and
g is as follows.
As shown below, the sum of these numbers over all pairs of branches
f and
g is 1, as the sums of the branches
f and of the branches
g are normalized maps.
![Entropy 27 00732 i028]()
Following the quantum information literature [
19], we call second-order processes with gaps
combs. Each of these combs can still be specified as an ordinary first-order process in
. Thus, the comb in (
13) is a matrix
, and is equivalently depicted as follows.
For technicalities regarding translation between the two representations—ensuring, e.g., consistency of calculations of the numbers generated by composition with processes
and
—see Jacobs, Kissinger, and Zanasi [
6]. In this paper, all combs are depicted in the style of (
13). It is useful to keep in mind, however, that a comb is ultimately a matrix of numbers (in our setting, a matrix of conditional probabilities governing a causal scenario); for our purposes, causal inference will involve learning these matrix entries by applying instruments.
Example 10. For any system-type A in , there is an instrument for which the only branch is . Applying this instrument at a gap does not extract any information for the agent, and leaves the variable to pass through undisturbed.
Example 11. For system-type A in , the perfect passive observation instrument of type has branches forming a set (mathematically, the instrument is the set). Knowing which branch of a perfect passive observation instrument has been realized means being certain of the value that a random variable has taken and being certain that the variable retains that value after observation.
Example 12. For the comb in (
13)
, for perfect passive observation instruments and at the X and Y loci, respectively, the joint branch probabilities arewhere i runs from 1
to and j from 1
to . If we apply the instruments for infinitely many trials and record the frequency of each pair of branches and , we learn a table indexed by i and j in which the entries are the probabilities in (
14)
. These probabilities would typically be written as . They factorize as expected for a causal model conforming to the graph (
12)
—they have the form , or equivalently the form shown below. Example 13. Another important example of an instrument of type that can be applied at a gap in a comb is one that discards the incoming system and prepares a fixed normalized state ψ for the outgoing system. A single-branch “discard-and-prepare” instrument has the following branch.If ψ is a standard basis state, then this instrument corresponds to what is called an “atomic” or “do” intervention. The state ψ may also be mixed, either because the agent carrying out the intervention procedure cannot fully control which value the variable has after the intervention—and the resulting uncertainty is represented by a specific probability distribution—or because the intervention procedure is designed to randomize the value in a specific way. One example of a causal identification task is to use joint branch probabilities for perfect passive observation instruments in order to infer how perfect passive observations at one locus would turn out if a discard-and-prepare instrument were applied at another locus. Suppose that in the comb (
13)
, and, initially unbeknownst to us, we have We learn the probabilities (
14)
for all four pairs of values of i and j. Now, we would like to infer the probabilityi.e., the probability that passive observation at locus Y results in branch 1
if the value leaving X is forcibly set to 2
. Because x and the standard basis state are normalized, we know that among the three scalars in (
15)
, the top and bottom are each equal to 1
and the desired probability is as follows.The reader, knowing the value of the process y, can see that the desired probability is ; however, from the perspective of the agents confronted with the inference problem, we do not yet know the value of y. We can transform the expression for the desired quantity into an expression with a value that is immediately computable from the known numbers (
14)
. First, we note the following:where the inverse of a scalar is indicated by the scalar inside . Next, we rewrite the scalar being inverted in terms of numbers (
14).
For the first equality, we have used normalization of both y and a standard basis state; for the second equality, we have used normalization of the sum of the perfect passive observation branches at Y. The equation combining the far left-hand number with the far right-hand number essentially recovers the former from a joint probability distribution by marginalization. We now know that the desired interventional probability is as follows:which we can compute by referring to the table of joint probabilities from perfect passive observations; we obtain , as the reader expected. The last example demonstrates how probabilities from perfect passive observation instruments are in fact what would normally be called observational probabilities. In our example, the “Markov assumption”, i.e., the assumption encoded in the graph (
12) that the unknown comb has the form in (
13), is used in the inference in the sense that the inference procedure does not apply to a two-locus comb of totally unknown shape. Of course, the direction of the edge indicates that the comb has locus
X before locus
Y; just as importantly, though, the Markov assumption rules out (most) combs of the form
for which the just-presented inference protocol would not work. Put another way, for generic combs of the form in (
17), the probabilities for perfect passive observation instruments at the
X and
Y loci do not uniquely determine the probabilities for discard-and-prepare instruments at the
X locus and perfect passive observation instruments at the
Y locus. This is essentially a statement of the well-known fact that observational data do not suffice to distinguish between causal influence of variable
X on variable
Y and joint influence of a latent confounder on both
X and
Y. To continue emphasizing the language of instruments, let us state explicitly what would suffice to determine the interventional quantity
namely, the ability to implement the discard-and-prepare instrument with sole branch
and implement a perfect passive observation instrument at the
Y locus, for infinitely many trials.
Example 14. Let the effects ϕ and on system-type be as in Example 4:In addition, letConsider the instrument of type of which the branches are and . Suppose that an agent applies this instrument to an intervention locus representing a binary variable. On a single trial, if the true value of the variable arriving at the locus is 0
, then with probability , branch is realized and reported to the agent, and value 0
is fed forward; with probability , branch is realized and reported and value 1 is fed forward. If the true value of the variable arriving at the locus is 1, then with probability branch is realized and reported and value 1 is fed forward; with probability branch is realized and reported and value 0
fed forward. (The specification of the value that is fed forward in each case is redundant, as it follows from whether the branch that is realized on the trial contains state ψ or state .) The agent knows the values of the branches, and for its own record-keeping can name branch “0
” and branch “1
”. Now, on each trial, the instrument reports the correct value of the variable with probability if the true value is 0 and with probability if the true value is 1
. Whichever value is reported is also fed forward; thus, when the instrument misreports the value of the variable, it also changes the value to match the report. Now, suppose that we (taking the place of “the agent”) apply this instrument at the X locus in the comb (
13)
and a perfect passive observation instrument of type at the Y locus. Again, . (, X, and Y all denote the same system-type in , but X and Y also denote loci and the associated input and output wires.) The matrix value of the comb, determined by the values of x and y in Example 13, is initially unknown to us, but we do know the comb shape, perhaps from the graph (
12)
. We are going to infer the value of the entire comb, which will allow us to infer the results of any counterfactual combination of instruments at the two loci. The probabilities learned from infinitely many trials are the following. | | Y |
| | i = 1 | i = 2 |
X | | 0.396 | 0.264 |
| | 0.102 | 0.238 |
First, we determine the scalars and by rewriting them in terms of probabilities we can look up in the table:andthus, and . These numbers allow us to determine the value of x, as the effects ϕ and form an informationally complete set. The computation of the matrix entries of x from these numbers is essentially a “change of basis” computation, and we omit it. Now, we wish to determine the process y. Its matrix entriesare also known asrespectively. We can compute each one using the already known numbers and and the probabilities in the table. For example,that is, the entry of y is the entry of the table of branch probabilities times , or , as the reader expects from prior knowledge of the matrix y. We can compute the other entries of y similarly. In the next section, we are going to return to the language of syntactic and semantic causal structures to more rigorously reformulate the causal models from
Section 2 (which in general include copy maps corresponding to forks in graphs) using combs and instruments, thereby treating the kind of observation implicit in the state-based representation on the same footing as other kinds of probing such as that represented by discard-and-prepare instruments or the instrument
in Example 14. The semantic causal structure involved in a causal model will be a comb in
with an intervention locus for each variable, producing probabilities when instruments are assigned to all the loci. What are normally called “observational data” are then the probabilities generated under one particular instrument assignment, namely, the assignment of a perfect passive observation instrument to every locus.
For example, instead of the state in (
5), a causal model based on the graph
G in (
1) can be represented with a comb
, as shown below.
This comb (
18) provides the observational data encoded in the state
in (
5): the probability of any particular joint outcome for the variables in joint state
in (
5) is provided by the composite of
with the appropriate standard basis effects as shown below.
Per Equation (
6), the right-hand expression is equal to the joint probability of perfect passive observation branches at all five gaps in the comb:
where the middle expression involves the original Bayesian network representation from (
5). This equivalence helps to justify the definition of perfect passive observation instruments. Knowing the probabilities of “values of random variables” encoded in the state
, i.e., knowing the state, is the same as knowing the outcome probabilities for perfect passive observation instruments.
The comb in (
18) assigns probabilities to joint outcomes for any combination of instruments
,
,
,
, and
applied at the five gaps. For example, the joint probabilities for a discard-and-prepare instrument preparing standard basis state
i at the
B gap and perfect passive observation instruments at all other gaps follow.
These probabilities are the same as those encoded in the joint state on the right-hand side of (
7). In order to obtain the probabilities from (
7), we simply compose the joint state with the appropriate standard basis effects. (There is a slight difference in the handling of discard-and-prepare interventions between representations such as (
7) and the new comb representations. In the comb representation of the causal Bayesian network in our running example, if the single-branch discard-and-prepare instrument at the
B gap prepares a mixed state
instead of a standard basis state
i, then there is no way to model observation of the actual value about which
encodes uncertainty. In contrast, in the representation from (
7) with
i replaced by
, the copy map on
B models perfect passive observation of variable
B immediately after the preparation of
.)
By passing from states (as in (
5)) to combs (as in (
18)), we treat perfect passive observation as just one among many kinds of instrument that can in principle be applied to probe the process governing a causal scenario. Hence, we can precisely pose new kinds of causal identification problems such as the one solved in Example 14, by declaring that the statistics available for use in the inference are the outcome probabilities associated with instruments other than perfect passive observation. The task is then to use those statistics together with qualitative assumptions encoded in a graph to deduce a property of the data-generating process that implies how the process would respond to other instruments. For simplicity, we focus on identifying the value of the entire comb. In the next section, we will define in full the syntactic process theory associated with an arbitrary directed acyclic graph, explain how to construct the syntactic causal structure from the graph, and define causal models in terms of interpretation functors from the syntactic process theory into
.
4. Functorial Causal Models
Given a directed acyclic graph
G, we formulate the syntactic causal structure for
G-based Markovian causal models as a process in a syntactic process theory constructed as follows. First, we form a
free process theory, i.e., a free strict symmetric monoidal category
over a
signature determined by the graph
G. Here, a process-theoretic signature provides the symbols from which processes in the free process theory may be formed. It consists of a set of system-type symbols
and a set of labeled boxes and other icons, all of which have input and output wires labeled with system-type symbols. For each vertex
X in
G, the signature
has a system-type symbol
X and a box labeled
x with an output wire labeled
X and input wires labeled by the system-type symbols for
X’s parents in
G. Also for each vertex
X, there are (i) an icon
called “copy” with as many outputs as vertex
X has children in
G, and (ii) an icon
called “discard” or
. If vertex
X has no children in
G, then (i) and (ii) are identical.
The free process theory has the following system-types: a unit system-type I, the system-type symbols in , and terms that are formal combinations of those symbols joined by ⊗, e.g., . Its processes include identity maps for all system-types, swap maps for all pairs of system-types, and diagrams that can be formed from those maps and boxes in via sequential and parallel composition. Processes are equal if and only if their diagrams are identical up to rearrangements that preserve connections.
To form the syntactic process theory
for
G-based causal models, we impose equations between processes in
. “Imposing” equations means constructing a quotient process theory, which is defined as one wherein the system-types are the same as those of the original but wherein the processes are equivalence classes of processes from the original theory. We can treat individual representatives of the equivalence classes as being themselves processes in the quotient process theory, replacing them with other representatives at any point in a calculation. The equations defining
include those in (
4), which make the “copy” icon behave as an abstract copy map that interacts with discarding in a way that mirrors the copy–discard interaction in
(
is a copy–discard process theory), and equation
for every map
in
. The latter condition mirrors the normalization condition in
, and we accordingly call the processes in
(which all satisfy it) “normalized”.
The syntactic causal structure for graph G is now a process in , constructed as follows. For each vertex X, we connect the output wires of the X copy icon to the input wires of the boxes associated with X’s children. The resulting diagram has an input and an output wire for each vertex in G. We arrange these input–output pairs to form gaps in a comb.
For example, the graph in (
1) gives as
the process (
18) (viewed as a formal diagram, not a matrix in
).
A semantic causal structure realizing the qualitative hypotheses in G is the image of under a certain functor of process theories , called an interpretation functor.
Definition 6. For process theories and with unit system-types I and and families of swap maps σ and , respectively, a functor of process theories is a strict symmetric monoidal functor, i.e., an ordinary functor of categories , such that for system-types A and B in :
.
In order for a functor of process theories to be a legitimate interpretation functor, we require that F send copy and discard maps in to copy and discard maps in , respectively (this requirement makes F a copy–discard functor). Normalization of all processes in ensures that these processes’ images under F are stochastic matrices. In particular, the entire semantic causal structure is a stochastic matrix.
For a DAG G, a copy–discard functor is called a G-based causal model. A G-based causal model implicitly includes the syntactic causal structure defined by G along with the semantic causal structure . The same term will be used to refer to the semantic causal structure .
4.1. Non-Markovian Models
A model constructed according to the procedure presented thus far is Markovian; each vertex in the graph has a corresponding input–output pair in the syntactic and semantic causal structures, meaning that the model mathematically allows us to consider probing at any node in the network. On the other hand, we may want to consider non-Markovian models based on a directed acyclic graph, in which some loci corresponding to vertices are “latent”. For a directed acyclic graph G with a list of vertices that are to correspond to latent loci, the prescription for forming the syntactic causal structure is as above, with one additional final step: for each locus that is to be latent, we join the output to the input with an identity map of the appropriate type, thereby closing the gap in the diagram.
In the example with graph (
1), if
A is to be the sole latent locus, then the syntactic causal structure is as follows.
4.2. Causal Structures for ADMGs
More complicated kinds of graphs—featuring, e.g., bidirected edges or directed hyperedges—can represent causal hypotheses for non-Markovian scenarios without specifying all of the latent variables [
20,
21]. For example, the following Acyclic Directed Mixed Graph (ADMG)
encodes, in addition to familiar hypotheses represented by black directed edges, the hypothesis (represented by a red bidirected edge) that there may be latent variables which simultaneously influence both
B and
C. Examples of DAGs consistent with these hypotheses are
G from (
1) with
A latent,
with
and
latent, and
also with
and
latent.
It is possible to state a recipe for constructing a syntactic causal structure from an ADMG, similar to the way a syntactic causal structure is constructed from a DAG. The difference is that there are now processes (states) corresponding to bidirected edges. For example, the syntactic causal structure for the ADMG in (
21) is as follows.
Again, a causal model based on the graph is provided by a copy–discard functor , and is called the semantic causal structure.
We leave a more thorough description of AMDG-based functorial causal models to other work; here, we simply note that the process-theoretic approach with both a syntactic and a semantic process theory allows us to understand very clearly how an ADMG represents “margins” of DAG-based models in the causal context. For example, at the level of syntactic causal structures, “marginalizing” (by plugging in an identity map) locus
A in (
18), which is based on the DAG in (
1), yields (
20), which also has the form (
22) based on the ADMG (
21). Alternatively, marginalizing in a single semantic interpretation of (
18) results in a semantic causal structure that can be specified by a semantic interpretation of either (
20) or (
22).
5. Local Intervention Regimes and Identification
We use the term intervention locus to refer to an input–output pair in the semantic causal structure as well as to the corresponding input–output pair in the syntactic causal structure. If an instrument of type is assigned to each locus X, then the semantic causal structure assigns probabilities to all possible combinations of branches. From here on, a single label will often stand for a locus (in both the syntactic and semantic causal structures), its input and output system-types in the syntactic causal structure, and its input and output system-types in the semantic causal structure.
Definition 7. A local intervention regime assigns an instrument of type to each locus A in an interventional causal model.
“Implementing” a local intervention regime for one iteration of a repeated causal scenario represented by a causal model results in the joint realization of a combination of maps at all the loci: at each locus, one branch of the instrument assigned to that locus is realized. The joint probability of this combination of local outcomes is the number resulting from plugging the maps into their loci in the semantic causal structure. The sense in which the causal models just defined are “causal” is that for a given local intervention regime, changing the instrument assigned to a locus (thereby creating a new local intervention regime that agrees with the first at all but one locus) might result in different outcome probabilities for the loci where the instruments are unchanged. In other words, a choice at one locus might “cause” events at other loci.
In a causal identification problem, one is given a graph (indicating a syntactic causal structure) and outcome probabilities from probing the initially unknown semantic causal structure with one or more local intervention regimes. Causal inference consists essentially in using outcome probabilities for some local intervention regimes to infer those for other local intervention regimes. The problem of causal identification is to use outcome probabilities from a limited set of local intervention regimes together with the shape of the syntactic causal structure (or the graph) to infer probabilities of outcome combinations under other local intervention regimes.
In the standard identification setup (e.g., in Pearl [
1]), the probabilities available for inference are those that in our framework are generated under local intervention regimes consisting only of perfect passive observation instruments. We will study identification tasks based on data from other kinds of instruments.
The set of local intervention regimes whose outcome statistics are to be used for inference is specified as follows: for each locus A, an accessible set of instruments is given. An accessible local intervention regime is then a local intervention regime in which each instrument comes from the accessible set for the locus to which the instrument is assigned. The probabilities available for inference are the probabilities that can be “learned” from accessible local intervention regimes—that is, for each accessible local intervention regime, the joint probability of each combination of branches will be considered known.
The quantity to be identified in a causal identification problem is the image of some map in the syntactic process theory under the interpretation functor F. In this paper, we focus for simplicity on identifying the entire semantic causal structure .
In full, the causal identification problem studied in this paper is defined by the following in each instance: a causal model consisting of a directed acyclic graph G, a set of latent loci and an interpretation functor F, and an accessible set of instruments for each locus A. The inputs for the identification task are the graph G and the set of latent loci—which together prescribe how to construct the syntactic causal structure —along with the data generated by under each accessible local intervention regime. For each accessible local intervention regime, a table is provided in which every combination of instrument branches realizable at the various loci on a single “trial” is associated with its probability. The collection of tables specifies the set of accessible local intervention regimes, and consequently the accessible sets of instruments. The probabilities in these tables are called accessible probabilities. The task is to compute . If this task is possible, we say that (or “the model”) is identifiable from the accessible sets of instruments (and the graph with the list of latent loci).
We assume from here on that all DAG-based models are strictly positive in the following sense (it is straightforward to formulate the corresponding assumption for models based on ADMGs and other kinds of decorated graphs): a causal model F based on a DAG G is called strictly positive if, for every process x in corresponding to a vertex X in G, the stochastic matrix in has only strictly positive entries. Our strict positivity assumption plays a similar role to that of requirements in the standard causal inference literature that distributions be strictly positive; in both cases, strict positivity guarantees that the conditional probabilities needed for calculations of interventional quantities are defined. Intuitively, if one wants to infer a “do”-conditional probability from observational data, one needs the possibility to be realized in those data. Our restriction to strictly positive models brings the following fact into our calculations: in , when a matrix f with only strictly positive entries is composed with a state , then as long as some entries in are non-zero, the resulting state has only strictly positive entries. If f and are normalized, then corresponds to a strictly positive probability distribution.
Our version of the most common kind of identification task involves accessible sets consisting only of perfect passive observation instruments (each accessible set is a singleton). In these cases, we speak of quantities being identifiable “from perfect passive observation”.
Proposition 2. Models based on the DAG in (
1)
with A and C latent or based on the ADMG in (
21)
are not identifiable from perfect passive observation. Proof. As in the discussion of comb shape (
17), this is an example illustrating our framework using an established fact from the causal inference literature: in general, (interventional quantities associated with) non-Markovian models are not identifiable from “observational data”. □
Proposition 3. Models based on the DAG in (
1)
with only A being latent are identifiable from perfect passive observation. Put another way, any semantic interpretation of (
20) is discernible from knowledge of the syntactic causal structure and the statistics from a local intervention regime consisting only of perfect passive observation instruments.
From here on, we distinguish boxes in a syntactic process theory from their images in by indicating the latter in boldface; thus, indicates the image of a process x under interpretation functor F.
Proof. An arbitrary semantic causal structure is
and the accessible probabilities are
where the indices
, and
l take values from 1 to
, and
, respectively. Identification proceeds in several steps. First, we learn the value of
by computing its matrix entries
The reason for the first equality is that
is the scalar 1, since
and
are normalized; the second equality follows from the fact that adding the branches of an instrument (here, a perfect passive observation instrument) results in a normalized map (here, an identity map), which discarding “falls through”.
Next, we use the matrix entries of (
23) together with the accessible probabilities to compute the process
for which the matrix elements are
where the right-hand side is in the form of a product of an accessible probability and the inverse of a number already known from the previous step.
We now know the values of processes (
23) and (
24), which together with the two discard maps make up the semantic causal structure; hence, we know the semantic causal structure. □
This proof has two notable features. First, it was not necessary to compute the value of the image of every box in
on its own. Indeed, doing so is impossible in this example, as the state (
23) (equivalently, its matrix entries) does not fix unique values of the matrices
, and
. Second, the accessible instruments were perfect passive observation instruments; thus, we computed matrix entries rather directly from accessible probabilities, in the sense that the accessible probabilities were products of matrix entries of processes in the semantic causal structure. This would not be the case for accessible instruments other than perfect passive observation instruments, but it might still be possible to compute the values of component processes in the semantic causal structure. For example, if we could somehow obtain the numbers
from the accessible probabilities for informationally complete sets of effects
and
, then, even if the effects were not standard basis effects, we would be able to compute the value of the process in (
23).
Proposition 4. Markovian models based on DAGs are identifiable from perfect passive observation.
Proof. This is a statement of a well-known fact represented in our framework. It follows from the uniqueness of factorization described in Example 1 together with the equivalence between the matrix entries of a state and the probabilities of perfect passive observation branches, as in (
19). □
The proposition for Markovian models defined with combs in this paper follows from Theorem 1 below. For now, we can consider the representation of a DAG-based causal model as a state (for example, in (
5)). In causal identification from perfect passive observation, the state (i.e., probability distribution)
is provided together with the graph
G indicating that the causal mechanisms
and
e are arranged as in (
5). A special feature of a Markovian model—i.e., one in which each box’s output is copied to an output of the overall state, as opposed to being discarded as in (
8)—is that the probability distribution is consistent with only one set of values for the five mechanisms. The five factors on the right-hand side of
of the observational distribution specify the five stochastic matrices in (
5). In other words, for a Markovian model, the observed conditional probabilities are also the values of the stable and autonomous mechanisms. This narrative for the state-based representation suggests that it is the Markovianity of a
probability distribution relative to a DAG that allows for the identification of a Markovian
causal model based on that DAG. We will show with the comb representation that Markovianity of accessible probability distributions need not be understood as the condition making Markovian models identifiable. If the accessible instruments satisfy a certain abstract condition (met in particular by perfect passive observation instruments and some other instruments in
Section 3), then Markovian models are identifiable even though the accessible probability distributions may not be Markovian for the graphs.
Definition 8. A set of maps in is called marginally informationally complete if it satisfies both of the following conditions:
- 1.
The set of effects
is informationally complete for A. - 2.
For any normalized state with full support, the set of states
is informationally complete for .
The following proposition follows from Proposition 6 below.
Proposition 5. The set of branches of a perfect passive observation instrument is marginally informationally complete.
Definition 9. A set of instruments of type is called marginally informationally complete if the union of the instruments (with each instrument viewed as a set of maps) is a marginally informationally complete set of maps.
Thus, Proposition 5 states that a perfect passive observation instrument by itself constitutes a marginally informationally complete set of instruments.
The following two examples include assertions of marginal informational completeness of certain singleton sets of instruments, justified by Proposition 6 below.
Example 15. Consider a locus A corresponding to a binary random variable (i.e., ). Suppose that one is able to probe it with an instrument that reports the true value of the variable arriving at that locus, but that also disturbs the variable such that the value to be fed forward is flipped with probability . The branches of this instrument are the following:The set containing only this instrument is marginally informationally complete. It is not necessary that the disturbance be “symmetric”, as in this instrument, nor that the effects be standard basis effects.
Example 16. LetThe instrument with branches and constitutes a single-instrument marginally informationally complete set of instruments of type (recall that we use the notation for the system-type 2
, in order to reserve the notation “2
” for a value of the random variable). When a locus representing a binary random variable (with values denoted 1
and 2
) is probed with this instrument, if the variable’s true value is 1
, then is realized with probability and with probability . If the true value of the variable is 2
, then is realized with probability and with probability . If branch is realized, then the value of the variable fed forward after the probing is totally randomized. If branch is realized, then the value fed forward is 1
with probability and 2
with probability . The instruments in these examples have the property that each branch has the form for an effect and a state , i.e., each branch is ∘-separable. If all instruments in a set consist only of ∘-separable branches, then marginal informational completeness of the set of instruments can be checked with the following criterion.
Proposition 6. A set of maps in composed of states and effects is marginally informationally complete if and only if: (i) the set of all effects ϕ is informationally complete for A, and (ii) the set of all states ψ is informationally complete for .
Proof. Suppose that we have a set of maps
in
such that the set of effects
is informationally complete for
A and the set of states
is informationally complete for
.
- (i)
The set of effects
is informationally complete for
A, as scaling each of a linearly independent set of row vectors results in a linearly independent set.
- (ii)
For any state
with full support, the set of states
is informationally complete for
by a similar consideration about preservation of linear independence under scaling.
Conversely, suppose that the set of maps (
25) satisfies conditions 1 and 2 in Definition 8. Then, the effects
are re-scalings of the effects
dA′ ∘
ψ ∘
ϕ supposed to form an informationally complete set. To see that the states
form an informationally complete set, we can plug a fixed state
with full support into each of the maps (
25). The resulting states
on
are supposed to form an informationally complete set, and the states
are re-scalings of these states. □
We would also like to deal with instruments for which the branches are not ∘-separable. For a given locus, consider a probing procedure for which one of the possible outcomes tells the observer the following: the variable may have had either value 1 or 2 when arriving at the locus, and both possibilities should be assigned equal probability, but with certainty that the value upon leaving the locus is the same as the value upon arriving. The instrument branch corresponding to this outcome is
which is non-∘-separable. An instrument with branches that are sums of perfect passive observation branches represents an observation that is “imperfect” or “coarse-grained” but not “disturbing”. Loosely speaking, although a ∘-separable instrument can represent a procedure for which outcomes carry only incomplete “information” about the values arriving at and leaving the locus, it cannot represent a procedure that results in “shared information” between the incoming and outgoing systems that is inaccessible to the observer. However, procedures of the latter sort, such as coarse-grained non-disturbing observations, are among the first that come to mind when one thinks of generalizing from perfect passive observation. The (classical version of the) identifiability result demonstrated for a three-node graph in [
9] covers only ∘-separable instruments. The main result of the present paper, in Theorem 1, is that a Markovian model is identifiable whenever the accessible set of instruments at each locus is marginally informationally complete regardless of whether or not all the accessible instruments are ∘-separable.
Theorem 1. Markovian models are identifiable whenever the accessible set of instruments at each locus is marginally informationally complete.
We begin by demonstrating identifiability for the example of three-locus models based on the graph
with the following syntactic causal structure.
A semantic causal structure has the form shown below.
First,
is determined by the probabilities formed by its composition with the informationally complete set of effects
, with the maps
f forming a marginally informationally complete set. These probabilities are obtained from accessible probabilities via marginalization of the outcomes at
Y and
Z. Here,
g and
h vary over the branches of single instruments; that is, all the local intervention regimes from which data are collated share the same choices of instruments at
Y and
Z.
Next,
is determined as follows:
now with both
f and
g varying over marginally informationally complete sets but with
h varying over the branches of only one instrument. The numbers on the left for various
f and
g determine
, because the assumption of marginal informational completeness of the sets
and
implies that the set of states
is informationally complete for
X (the input system-type of
) and the set of effects
is informationally complete for
Y (the output system-type of
).
Now, consider the set of accessible probabilities
where
f,
g, and
h vary over marginally informationally complete sets. Marginal informational completeness of
and
implies that the set of effects
and the set of states
are informationally complete for
Z and
X, respectively. Thus, knowing these probabilities amounts (by tomographic locality) to knowing the value of the process
for each
g. Therefore, for every pure (normalized) state
we can compute the quantity
for each
g. For each
i, the state
has full support due to the assumption that the model is strictly positive. Hence, by the assumed marginal informational completeness of
, varying
g with fixed
i yields an informationally complete set of states
. All in all, we now know a set of quantities
with
and
i varying independently over informationally complete sets. This set of states determines the process
by tomographic locality.
Proof of Theorem 1. If the given graph has
m nodes, then the unknown model can be rewritten in a form without explicit common-cause maps:
Here, the processes for i from 1 through m have been linearly ordered in correspondence with a topological ordering of the graph. The topological ordering also dictates the labeling of nodes in the graph and loci in the syntactic and semantic causal structures. The copy map for type and for i between 1 and m has output wires, which connect as inputs to the boxes through . Typically, the topological ordering does not have the property that every vertex is a parent (in the graph) of every vertex subsequent to it (in the ordering); however, there is no loss of generality in adding connections to the diagram of the unknown model. In effect, we declare that we might not make full use of the causal hypotheses encoded in the graph. For example, if node precedes in the topological ordering but there is no edge from to in the graph, then map in the syntactic causal structure has no input , and similarly, the map in the original semantic causal structure has no such input. The semantic causal structure remains unchanged if we substitute for a map (composed with appropriate swap maps) that does have an input . We make this kind of substitution everywhere it is possible, but retain the names of maps from the semantic causal structure for convenience.
The proof of identifiability constructs an iterative identification procedure, proceeding by double induction. Suppose that the value of is known for all . We show how to compute .
For any locus
, let
be the set of branches appearing in instruments in the accessible set
. Consider the set of processes
where, for all loci
with
i between
k and
(inclusive), all possible branches
are plugged in such that the set of processes is indexed by
. The left-hand diagram has inputs
through
, each followed by a copy map with
outputs connected to the boxes
through
. The diagram has one output, from
.
We will show by induction that this set of processes can be computed for every
k from 1 through
n. In the base case, there are no input wires. Each locus from
through
is filled with an instrument branch. Each locus after
is filled with each of the branches of one accessible instrument and the results are summed so that the discarding map following the final locus
“falls through” every component process after
. For every such combination of branches, the resulting state on
is tomographically determined by the results of various instruments at
; that is, for fixed
, the state
is determined by numbers
with
varying, as marginal informational completeness of the set
implies that the set of effects
is informationally complete for
.
Now, suppose the set of processes (
28) is known for some fixed
k. We will show how to compute the set for
. For each
with
i between 1 and
(inclusive), we introduce the set
of states on
(i.e., column vectors), each having entry 1 in exactly one position and 0s elsewhere. This set is informationally complete for
(the set of product states
, where each
varies independently, is informationally complete for
). Moreover, the states are copied by the copy map for
. We can plug an arbitrary
into input wire
for each
i from 1 through
. The result is shown below.
The copy maps on
through
in the right-hand diagram each have
outputs connected as inputs to boxes
through
. The states
, as used here, have not been realized via local intervention regimes; rather, they are simply known processes for which composites with other known processes can be computed.
The state
has full support per the standard strict positivity assumption about unknown models. Therefore, the set of states
with
varying is informationally complete for
, meaning that the process
is known. Now, because the states
through
are arbitrary choices from the relevant sets of vectors, varying them yields knowledge of the following process.
This completes the inner inductive step. Iterating eventually eliminates from the known process all boxes prior to
while reducing each copy map’s number of outputs to one, making the final known process simply
up to reordering of the input wires. Thus, the outer inductive step is completed. □
Example 17. The theorem guarantees that Markovian models based on the two-node graph in (
12)
with are identifiable if the accessible set of instruments at X consists of the single instrument with branches and in Example 14 and if the accessible set of instruments at Y consists of the same instrument. By Proposition 6, the accessible sets in this case are marginally informationally complete. The theorem also covers Example 14 itself, in which the instrument used at Y is a perfect passive observation instrument. Example 18. For the three-node graph in (
26)
with and , where n is a positive integer, Markovian models are identifiable if the accessible set of instruments at X is the singleton consisting of the instrument in Example 15, the accessible set at Y is the singleton consisting of the instrument in Example 16, and the accessible set at Z is the singleton consisting of a perfect passive observation instrument .