On a Class of Tensor Markov Fields

Here, we introduce a class of Tensor Markov Fields intended as probabilistic graphical models from random variables spanned over multiplexed contexts. These fields are an extension of Markov Random Fields for tensor-valued random variables. By extending the results of Dobruschin, Hammersley and Clifford to such tensor valued fields, we proved that tensor Markov fields are indeed Gibbs fields, whenever strictly positive probability measures are considered. Hence, there is a direct relationship with many results from theoretical statistical mechanics. We showed how this class of Markov fields it can be built based on a statistical dependency structures inferred on information theoretical grounds over empirical data. Thus, aside from purely theoretical interest, the Tensor Markov Fields described here may be useful for mathematical modeling and data analysis due to their intrinsic simplicity and generality.


General Definitions
Here, we introduce Tensor Markov Fields, i.e., Markov random fields [1,2] over tensor spaces. Tensor Markov Fields (TMFs) represent the joint probability distribution for a set of tensor-valued random variables.
Let X = X β α be one of such tensor-valued random variables. Here X j i ∈ X may represent either a variable i ∈ α, that may exist in a given context or layer j ∈ β (giving rise to a class of so-called multilayer graphical models or multilayer networks) or a single tensor-valued quantity X j i . A TMF will be an undirected multilayer graph representing the statistical dependency structure of X as given by the joint probability distribution P(X).
As an extension of the case of Markov random fields, a TMF is a multilayer graphĜ = (V, E) formed by a set V of vertices or nodes (the X j i 's) and a set E ⊆ V × V of edges connecting the nodes, either on the same layer or through different layers ( Figure 1). The set of edges represents a neighborhood law N stating which vertex is connected (dependent) to which other vertex in the multilayer graph. With this in mind, a TMF can be also represented (slightly abusing notation) aŝ G = (V, N). The set of neighbors of a given point X j i will be denoted N

Configuration
It is possible to assign to each point in the multilayer graph, one of a finite set S of labels. Such assignment will be called a configuration. We will assign probability measures to the set Ω of all possible configurations ω. Hence, ω A represents the configuration ω restricted to the subset A of V. It is possible to think of ω A as a configuration on the smaller multilayer graphĜ A restricting V to points of A ( Figure 2).

Local Characteristics
It is also possible to extend the notion of local characteristics from MRFs. The local characteristics of a probability measure P defined on Ω are the conditional probabilities of the form: i.e., the probability that the point t is assigned the value ω t , given the values at all other points of the multilayer graph. In order to make explicit the tensorial nature of the multilayer graphĜ, let us re-write Equation (1). Let us also recall the fact that the probability measure will define a tensor Markov random field (a TMF) if the local characteristics depend only of the knowledge of the outcomes at neighboring points, i.e., if for every ω

Cliques
Given an arbitrary graph (or in the present case a multilayer graph), we shall say that a set of points C is a clique if every pair of points in C are neighbors (see Figure 3). This definition includes the empty set as a clique. A clique is thus a set whose induced subgraph is complete, for this reason cliques are also called complete induced subgraphs or maximal subgraphs (although these latter term may be ambiguous). Figure 3. Cliques on a Tensor Markov Field: The set {X I 2 , X I 3 , X I 4 } forms an intra-layer 2-clique (as marked by the red edges, all on layer I), the set {X I 3 , X I I 3 } forms an inter-layer 1-clique (marked by the blue edge connecting layers I and I I). However, the set {X I 3 , X I I 3 , X I 4 , X I I 4 , } is not a clique since there are no edges between X I 3 and X I I 4 nor between X I 3 and X I I 4 .

Configuration Potentials
A potential η is a way to assign a number η A (ω) to every subconfiguration ω A of a configuration ω in the multilayer graphĜ. Given a potential, we shall say that it defines (or better, induces) a dimensionless energy U(ω) on the set of all configurations ω by In the preceeding expression, for fixed ω, the sum is taken over all subsets A ⊆ V including the empty set. We can define a probability measure, called the Gibbs measure induced by U as with Z a normalization constant called the partition function.
In physics, the term potential is often used in connection with the so-called potential energies. Physicists often call η A a dimensionless potential energy, and they call φ A = e −η A a potential.
Equations (4) and (5) can be thus rewritten as: Since this latter use is more common in probability and graph theory, we will refer to Equations (6) and (7) as the definitions of Gibbs measure and partition function (respectively) unless otherwise stated.

Gibbs Fields
A potential is called a nearest neighbor Gibbs potential if φ A (ω) = 1 whenever A is not a clique. It is customary to refer as a Gibbs measure to a measure induced by a nearest neighbor Gibbs potential. However, it is possible to define more general Gibbs measures by considering other types of potentials.
The inclusion of all cliques in the calculation of the Gibbs measure is necessary to establish the equivalence between Gibbs random fields and Markov random fields. Let us see how a nearest neighbor Gibbs measure on a multilayer graph determines a TMF.
Let P(ω) be a probability measure determined on Ω by a nearest neighbor Gibbs potential φ: With the product taken over all cliques C on the multilayer graphĜ. Then, Here ω is any configuration which agrees with ω at all points except X j i .
For any clique C that does not contain X j i , φ C (ω) = φ C (ω ), So that all the terms that correspond to cliques that do not contain the point X j i cancel both from the numerator and the denominator in Equation (10), therefore this probability depends only on the values x j i at X j i and its neighbors. P defines thus a TMF.
A more general proof of this equivalence is given by Hammersley-Clifford theorem that will be presented in the following section.

Extended Hammersley Clifford Theorem
Here we will outline a proof for an extension of Hammersley-Clifford theorem for Tensor Markov Fields (i.e., we will show that a Tensor Markov Field is equivalent to a Tensor Gibbs Field). LetĜ = (V, N) be a multilayer graph representing a TMF as defined in the previous section. With V = X β α = {X j i }, a set of vertices over a tensor field and N a neighborhood law that connects vertices over this tensor field. The fieldĜ obeys the following neighborhood law given its Markov property (see Equation (2)) (11) Here X N j i is any neighbor of X j i . The Hammersley-Clifford theorem states that a MRF is also a local Gibbs field. In the case of a TMF we have the following expression: In order to prove the equivalence of Equations (11) and (12), we will first bult a deductive (backward direction) part of the proof to be complemented with a constructive (forward direction) part as presented in the following subsections.

Backward Direction
Let us consider Equation (11) at the light of Bayes' theorem: Using a clique-approach to calculate the joint and marginal probabilities (see next subsection to support the following statement): Let us split the product ∏ c∈CĜ φ c (X c ) into two products, one over the set of cliques that contain Factoring out the terms depending on X j i (that do not contribute to cliques in the domainĜ \ X j i ): i (by construction) so, it can be factored out from the summation over X j i in the denominator.
We can cancel the term in the numerator and denominator: Then we multiply by Equation (20) is nothing but the definition of a local Gibbs Tensor Field (Equation (12)).

Forward Direction
In this subsection we will show how to express the clique potential functions φ c (X c ), given the joint probability distribution over the tensor field and the Markov property.
Consider any subset σ ⊂Ĝ of the multilayer graphĜ. We define a candidate potential function (following Möbius inversion lemma) [3] as follows: In order for f σ to be a proper clique potential, it must satisfy the following two conditions: To prove (i), we need to show that all factors in f σ (X σ = x σ ) cancel out, except for P(X).
To do this, it will be useful to consider the following combinatorial expansion of zero: Here, of course C A B is the number of combinations of B elements from an A-element set. Let us consider any subset ζ ofĜ. Let us consider a factor ∆ = P(X ζ = x ζ , XĜ \ζ = 0). For the case of f ζ(X ζ ) it occurs as ∆ −1 0 = ∆. Such factor also occurs in subsets containg ζ and other additional elements. If it includes ζ and one additional element, there are C |Ĝ|−|ζ| 1 such functions. The additional element creates an inverse factor ∆ −1 1 = ∆ −1 . The functions over subsets containg ζ and two additional elements contributes with a factor ∆ −1 2 = ∆ 1 = ∆. If we continue this process and consider Equation (22), it is evident that all odd cardinality difference terms cancel out with all even cardinality difference terms so that the only remaining factor corresponds to ζ =Ĝ equal to P(X) thus fulfilling condition (i).
In order to show how condition (ii) is fulfilled, we will need to use the Markov property of TMFs. Let us consider σ * ⊂Ĝ that is not a clique. Then it will be possible to find two nodes X h i and X k j in σ * that are not connected to each other. Let us recall Equation (21): An arbitrary subset ζ may belong to any of the following classes: (i) ζ = ω a generic subset of σ; If we write down Equation (23) factored down to these contributions we get: Let us consider two of the factors in Equation (24) at the light of Bayes' theorem: We can notice that the priors in the numerator and denominator of Equation (25) are the same. We can then cancell them out. Since by definition X h i and X k j are conditionally independent given the rest of the multilayer graph, we can also replace the default value X k j = 0 for X k j instead.
Since X h i and X k j are conditionally independent given the rest of the multilayer graph, we can also replace the condition for X k j with any other, without affecting X h i . By adjusting this prior conveniently, we can write out: By substituting Equation (27) into Equation (24) we get (condition (ii)):

An Information-Theoretical Class of Tensor Markov Fields
Let us consider again the set of tensor-valued random variables X = X β α . It is possible to calculate, for every duplex in X, the mutual information function I(·, ·) [4]: Let us consider a multilayer graph scenario. From now on, the indices i, j will refer to the random variables, whereas h, k will be indices for the layers. Ω and Ω are the respective sampling spaces (that may, of course, be equal). In order to discard self-information, let us define the off-diagonal mutual information as follows: With the bi-delta function δ X h i X k j defined as: By having the complete set of off-diagonal mutual information functions for all the random variables and layers, it is possible to define the following hyper-matrix elements: as well as: Here Θ[·] is Heavyside's function and I 0 is a lower bound for mutual information (a threshold) to be considered significant.
We call A hk ij and W hk ij the adjacency hypermatrix and the strength hypermatrix respectively (notice that • in Equation (33) represents the product of a scalar times a hypermatrix). The adjacency hyper-matrix and the strength hyper-matrix define the (unweighted and weighted, respectively) neighborhood law of the associated TMF, hence the statistical dependency structure for the set of random variables and contexts (layers).
Although the adjacency and strength hypermatrices are indeed, proper representations of the undirected (unweighted and weighted) dependency structure of P(X), it has been considered advantageous to embed them into a tensor linear structure, in order to be able to work out some of the mathematical properties of such fields relying on the methods of tensor algebra. One relevant proposal in this regard, has been advanced by De Domenico and collaborators, in the context of multilayer networks.
Following the ideas of De Domenico and co-workers [5], we introduce the unweighted and weighted adjacency 4-tensors (respectively) as follows: Here, ξ αγ βδ = ξ αγ βδ [ijhk] is a unit four-tensor whose role is to provide the hypermatrices with the desired linear properties (projections, contractions, etc.). Square brackets indicate that the indices i, j, h and k belong to the α, β, γ and δ dimensions and ⊗ represents a form of a tensor matricization product (i.e., the one producing a 4-tensor out of a 4-index hypermatrix times a unitary 4-tensor).

Conditional Independence in Tensor Markov Fields
In order to discuss the conditional independence structure induced by the present class of TMFs, let us analyze Equation (32). As already mentioned, the hyper-adjacency matrix A hk ij represents the neigborhood law (as given by the Markov property) on the multilayer graphĜ (i.e., the TMF). Every non-zero entry on this hypermatrix represents a statistical dependence relation between two elements on X. The conditional dependence structure on TMFs inferred from mutual information measures via Equation (32) are related not only to the statistical independence conditions (as given by a zero mutual information measure between two elements), but also to the lower bound I 0 and in general to the dependency structure of the whole multilayer graph.
The definition of conditional independence (CI) for tensor random variables is as follows: Here ⊥ ⊥ represents conditional independence between two random variables, were is the joint conditional cumulative distribution of X h i and X k j given X m l and X h i * , X k j * and X m l * are realization events of the corresponding random variables.
In the case of MRFs (and by extension TMFs), CI is defined by means of (multi)graph separation: in this sense we say that X h i ⊥ ⊥Ĝ X k j |X m l iff X m l separates X h i from X k j in the multilayer graphĜ. This means that if we remove node X m l there are no undirected paths from X h i to X k j inĜ. Conditional independence in random fields is often considered in terms of subsets of V. Let A, B and C be three subsets of V. The statement X A ⊥ ⊥Ĝ X B |X C , which holds only iff C separates A from B in the multilayer graphĜ, meaning that if we remove all vertices in C there will be no paths connecting any vertex in A to any vertex in B is called the global Markov property of TMFs.
The smallest set of vertices that renders a vertex X h i conditionally independent of all other vertices in the multilayer graph is called its Markov blanket, denoted mb(X h i ). If we define the closure of a node . It is possible to show that in a TMF, the Markov blanket of a vertex is its set of first neighbors. This is called the undirected local Markov property. Starting from the local Markov property it is possible to show that two vertices X h i and X k j are conditionally independent given the rest if there is no direct edge between them. This has been called the pairwise Markov property.
If we denote byĜ X h i →X k j the set of undirected paths in the multilayer graphĜ connecting vertices X h i and X k j , then the pairwise Markov property of a TMF can be stated as: It is clear that the global Markov property implies the local Markov property which in turn implies the pairwise Markov property. For systems with positive definity probability densities, it has been probed (in the case of MRFs) that pairwise Markov actually implied global Markov (See [6] p. 119 for a proof). For the present extension this is important since it is easier to assess pairwise conditional independence statements.

Indepence Maps
Let IĜ denote the set of all conditional independence relations encoded by the multilayer graphĜ (i.e., those CI relations given by the Global Markov property). Let I P be the set of all CI relations implied by the probability distribution P(X j i ). A multilayer graphĜ will be called an independence map (I-map) for a probability distribution P(X j i ), if all CI relations implied byĜ hold for P(X j i ), i.e., IĜ ⊆ I P [6]. The converse statement is not necessarily true, i.e., there may be some CI relations implied by P(X j i ) that are not encoded in the multilayer graphĜ. We may be usually interested in minimal I-maps, i.e., I-maps from which none of the edges could be removed without destroying its CI properties.
Every distribution has a unique minimal I-map (and a given graph representation). Let P(X j i ) > 0. LetĜ † be the multilayer graph obtained by introducing edges between all pairs of vertices X h i , X k j such that X h i ⊥ ⊥ X k j |X \ {X h i , X k j }, thenĜ † is the unique minimal I-map. We callĜ a perfect map of P when there is no dependenciesĜ which are not indicated by P, i.e., IĜ = I P [6].

Conditional Independence Tests
Conditional independence tests are useful to evaluate whether CI conditions apply either exactly or in the case of applications under a certain bounded error. In order to be able to write down expressions for C.I. tests let us introduce the following conditional kernels [7]: as well as their generalized recursive relations: The conditional probability of X k h given X j i can be thus written as: We can then write down expressions for Markov conditional independence as follows: Following Bayes' theorem, CI conditions -in this case-will be of the form: Equation (42) is useful since in large scale data applications is computationally cheaper to work with joint and marginal probabilities rather than conditionals. Now let us consider the case of conditional independence given several conditional variables. The case for CI given two variables could be written-using conditional kernels-as follows: Hence, Using Bayes' theorem, In order to generalize the previous results to CI relations given an arbitrary set of conditionals, let us consider the following sigma-algebraic approach: Let Σ jk ih be the σ-algebra of all subsets of X that do not contain X j i or X k h . If we consider the contravariant index i ∈ α with i = 1, 2, . . . , N and the covariant index j ∈ β with j = 1, 2, . . . , L, then there are M = NL 2 such σ-algebras in X (let us recall that TMFs are undirected graphical models). A relevant problem for network reconstruction is that of establishing the more general Markov pairwise CI conditions, i.e., the CI relations for every edge not drawn in the graph. Two arbitrary nodes X j i and X k h are conditionally independent given the rest of the graph iff: By using conditional kernels, the recursive relations and Bayes' theorem it is possible to write down M expressions of the form: The family of Equations (48) represent the CI relations for all the non-existing edges in the hypergraphĜ, i.e., every pair of nodes X j i and X k h not-connected inĜ must be conditionally independent given the rest of the nodes in the graph. These expression may serve to implement exact tests or optimization strategies for graph reconstruction and/or graph sparsification in applications considering a mutual information threshold I 0 as in Equation (32). In brief, for every node pair with a mutual information value lesser than I 0 , the presented graph reconstruction approach will not draw an edge, hence implying CI between the two nodes given the rest. Such CI condition may be tested on the data to see whether it holds or the threshold itself can be determined by resorting to optimization schemes (e.g. error bounds) in Equation (48).

Graph Theoretical Features and Multilinear Structure
Once the probabilistic properties of TMFs have been set, it may be fit to briefly present some of their graph theoretical features, as well as some preliminaries as to the reasons to embed hyperadjecency matrices into multilayer adjacency tensors. Given that TMFs are indeed PGMs, some of their graph characteristics will result relevant here.
Since the work by De Domenico and coworkers [5] covers in great detail how the multilinear structure of the multilayer adjacency tensor allows the calculation of these quantities-usually as projection operations-we will only mention connectivity degree vectors since these are related with the size of the TMF dependency neighborhoods.
Let us recall multilayer adjacency tensors, as defined in Equations (34) and (35). To ease presentation, we will work with the unweighted tensor A αγ βδ (Equation (34)). The multidegree centrality vector K α which contains the connectivity degrees of the nodes spanning different layers can be written as follows: Here U δ γ is a rank 2 tensor that contains a 1 in every component and u β is a rank 1 tensor that contains a 1 in every component-these quantities are called 1-tensors by De Domenico and coworkers [5]. It can be shown that K α is indeed given by the sums of the connectivity degree vectors k α corresponding to all different layers: k α (hk) is the vector of connections that nodes in the set α = 1, 2, . . . , N in layer h have to any other nodes in layer k. Whereas K α is the vector with connections in all the layers. Appropriate projections will yield measures such as the size of the neighborhood to a given vertex |N X j i |, the size of its Markov blanquet |mb(X h i )|, or other similar quantities.

Specific Applications
After having considered some of the properties of this class of Tensor Markov Fields, it may become evident that aside from purely theoretical importance, there is a number of important applications that may arise as probabilistic graphical models in tensor valued problems, among the ones that are somewhat evident are the following: • The analysis of multidimensional biomolecular networks such as the ones arising from multi-omic experiments (For a real-life example, see Figure 4) [8][9][10]; • Probabilistic graphical models in computer vision (especially 3D reconstructions and 4D [3D+time] rendering) [11]; • The study of fracture mechanics in continuous deformable media [12]; • Probabilistic network models for seismic dynamics [13]; • Boolean networks in control theory [14].
Some of these problems are being treated indeed as multiple instances of Markov fields or as multipartite graphs or hypergraphs. However, it may become evident that when random variables across layers are interdependent (which is often the case), the definitions of potentials,cliques and partition functions, as well as the conditional statistical independence features become manageable (and in some cases even meaninful) under the presented formalism of Tensor Markov Fields. . Gene and microRNA regulatory network: A Tensor Markov Field depicting the statistical dependence of genome wide gene and microRNA (miR) on a human phenotype. Edge width is given by the mutual information I † (X j i , X k h ) between expression levels of genes (layer j) and miRs (layer k) in a very large corpus of RNASeq samples, vertex size is proportional to the degree, i.e., the size of the node's neighborhood, N X j i .

Conclusions
Here we have presented the definitions and fundamental properties of Tensor Markov Fields, i.e., random Markov fields over tensor spaces. We have proved -by extending the results of Dobruschin, Hammersley and Clifford to such tensor valued fields-that tensor Markov fields are indeed Gibbs fields whenever strictly positive probability measures are considered. We also introduced a class of tensor Markov fields obtained by using information theoretical statistical dependence measures inducing local and global Markov properties, and show how these can be used as probabilistic graphical models in multi-context environments much in the spirit of the so-called multilayer network approach. Finally, we discuss the convenience of embedding tensor Markov fields in the multilinear tensor representation of multilayer networks.
Funding: This research received no external funding.