Partial Exchangeability for Contingency Tables

Diaconis, Persi

doi:10.3390/math10030442

Open AccessFeature PaperArticle

Partial Exchangeability for Contingency Tables

by

Persi Diaconis

Department of Mathematics and Statistics, Sequoia Hall, Stanford University, Stanford, CA 94305, USA

Mathematics 2022, 10(3), 442; https://doi.org/10.3390/math10030442

Submission received: 30 December 2021 / Revised: 19 January 2022 / Accepted: 20 January 2022 / Published: 29 January 2022

(This article belongs to the Special Issue Bayesian Predictive Inference and Related Asymptotics—Festschrift for Eugenio Regazzini's 75th Birthday)

Download Review Reports Versions Notes

Abstract

:

A parameter free version of classical models for contingency tables is developed along the lines of de Finetti’s notions of partial exchangeability.

Keywords:

algebraic statistics; contingency tables; de Finetti representation theorem; Markov basis; partial exchangeability

1. Introduction

Consider cross-classified data:

X_{1}, X_{2}, \dots, X_{n}

, where

X_{a} = (i_{a}, j_{a})

,

i_{a} \in [I]

,

j_{a} \in [J]

(for

[I] = {1, 2, \dots, I})

. Such data are often presented as an

I \times J

contingency table

T = (t_{i j})

where

t_{i j}

is the number of times

(i, j)

happens. Suppose that

X_{1}, \dots, X_{n}

are exchangeable and extendible. Then, de Finetti’s theorem says:

Theorem 1.

For exchangeable

{X_{i}}_{i = 1}^{\infty}

taking values in

[I] \times [J]

P [X_{1} = (i_{1}, j_{1}), \dots, X_{n} = (i_{n}, j_{n})] = \int_{Δ_{I \times J}} \prod_{i, j} p_{i j}^{t_{i j}} μ (d p),

where

Δ_{I \times J} = {p_{i j} \geq 0, \sum_{i, j} p_{i j} = 1}

. The representing measure μ is unique.

A popular model for cross classified data is

p_{i j} = θ_{i} η_{j} .

Here is a Bayesian, parameter free, description.

Theorem 2.

For exchangeable

{X_{i}}_{i = 1}^{\infty}

taking values in

[I] \times [J]

, a necessary and sufficient condition for the mixing measure μ in Theorem 1 to be supported on

Δ_{I} \times Δ_{J}

(with

Δ_{I} = {p_{1}, \dots, p_{I} : p_{i} \geq 0, \sum_{i} p_{i} = 1}

), so

P [X_{1} = (i_{1}, j_{1}), \dots, X_{n} = (i_{n}, j_{n})] = \int_{Δ_{I} \times Δ_{J}} \prod θ_{i}^{t_{i *}} η_{j}^{t_{* j}} μ (d θ, d η),

is that

\begin{matrix} P [X_{1} = (i_{1}, j_{1}), X_{2} = (i_{2}, j_{2}), X_{3} = (i_{3}, j_{3}), \dots, X_{n} = (i_{n}, j_{n})] = \\ P [X_{1} = (i_{1}, j_{2}), X_{2} = (i_{2}, j_{1}), X_{3} = (i_{3}, j_{3}), \dots, X_{n} = (i_{n}, j_{n})] . \end{matrix}

(1)

Condition (1) is to hold for any

n \geq 2

and any

(i_{a}, j_{a})

1 \leq a \leq n

.

Proof.

Condition (1) implies for all n and

h \geq 1

(surpressing P a.s. throughout)

\begin{matrix} P [X_{1} = (i_{1}, j_{1}), X_{2} = (i_{2}, j_{2}) | X_{n} = (i_{n}, j_{n}), \dots, X_{n + h} = (i_{n + h}, j_{n + h})] = \\ P [X_{1} = (i_{1}, j_{2}), X_{2} = (i_{2}, j_{1}) | X_{n} = (i_{n}, j_{n}), \dots, X_{n + h} = (i_{n + h}, j_{n + h})] . \end{matrix}

(2)

Let

h ↑ \infty

and then

n ↑ \infty

. Let

T

be the tail field of

{X_{i}}_{i = 1}^{\infty}

. Then, Doob’s increasing and decreasing martingale theorems show

P [X_{1} = (i_{1}, j_{1}), X_{2} = (i_{2}, j_{2}) | T] = P [X_{1} = (i_{1}, j_{2}), X_{2} = (i_{2}, j_{1}) | T] .

However, a standard form of de Finetti’s theorem says that, given

T

, the

{X_{i}}_{i = 1}^{\infty}

are i.i.d. with

P [X_{1} = (i, j)] = p_{i j}

. Thus

p_{i j} p_{i^{'} j^{'}} = p_{i j^{'}} p_{i^{'} j} for all i, i^{'}, j, j^{'} .

(3)

Finally, observe that (3) implies (writing

p_{i *} : = \sum_{j} p_{i j}

,

p_{* j} : = \sum_{i} p_{i j}

)

p_{i *} p_{* j} = \sum_{h, l} p_{i h} p_{l j} = \sum_{h l} p_{i j} p_{h l} = p_{i j} .

□

We remark the following points.

1: If $X_{i} = (Y_{i}, Z_{i})$ condition (2) is equivalent to

$L ((Y_{1}, Z_{1}), (Y_{2}, Z_{2}), \dots, (Y_{n}, Z_{n})) = L ((Y_{1}, Z_{σ (1)}), \dots, (Y_{n}, Z_{σ (n)}))$

for all n and $σ \in S_{n}$ ( $S_{n}$ is the symmetric group over $1, 2, \dots, n$ ). Since ${(Y_{i}, Z_{i})}_{i = 1}^{n}$ are exchangeable this is equivalent to saying the law is invariant under $S_{n} \times S_{n}$ .
2: The mixing measure $μ (d θ, d η)$ allows general dependence between the row parameters $θ$ and column parameters $η$ . Classical Bayesian analysis of contingency tables often chooses $μ$ so that $θ$ and $η$ are independent. A parameter free version is that under P, the row sums $t_{i *}$ and column sums $t_{* j}$ are independent. It is natural to weaken this to “close to independent” along the lines of [1] or [2]. See also [3].
3: Theorems 1 and 2 have been stated for discrete state spaces. By a standard discretization argument, they hold for quite general spaces. For example:

Theorem 3.

Let

X_{i} = (Y_{i}, Z_{i})

be exchangeable with

Y_{i} \in Y

,

Z_{i} \in Z

, complete separable metric spaces,

1 \leq i < \infty

. Suppose

\begin{matrix} P [X_{1} \in (A_{1}, B_{1}), X_{2} \in (A_{2}, B_{2}), \dots, X_{n} \in (A_{n}, B_{n})] = \\ P [X_{1} \in (A_{1}, B_{2}), X_{2} \in (A_{2}, B_{1}), \dots, X_{n} \in (A_{n}, B_{n})] \end{matrix}

for all measurable

A_{i}, B_{i}

and all n. Then,

P (X_{1} \in (A_{1}, B_{1}), \dots, X_{n} \in (A_{n}, B_{n})) = \int_{P (Y) \times P (Z)} \prod_{1}^{n} θ (A_{i}) η (B_{i}) μ (d θ, d η),

with

P (Y), P (Z)

the probabilities on the Borel sets of

Y, Z

. The mixing measure μ is unique.

4.: Theorem 2 is closely related to de Finetti’s work in [1,4].
5.: De Finetti’s law of large numbers holds as well, in Theorem 3

$\frac{1}{n} \sum δ_{X_{i}} (A \times B) \to μ (θ (A), η (B)) .$

One object of this paper is to develop similar parameter free de Finetti theorems for widely used log-linear models for discrete data. Section 2 begins by relating this to an ongoing conversation with Eugenio Regazzini. Section 3 provides needed background on discrete exponential families and algebraic statistics. Section 4 and Section 5 apply those tools to give de Finetti style partially exchangeable theorems for some widely used hierarchical and graphical models for contingency tables. Section 6 shows how these exponential family tools can be used for other Bayesian tasks: building “de Finetti priors” for “almost exchangeability” and running the “exchange” algorithm for doubly intractable Bayesian computation. Some philosophy and open problems are in the final section.

2. Some History

I was lucky enough to be able to speak at Eugenio Regazzini’s 60^TH birthday celebration, in Milan, in 2006. My talk began this way:

≪ Hello, my name is Persi and I have a problem. ≫

For those of you not aware of the many “10 step-programs” (alcoholics anonymous, gamblers anonymous, …) they all begin this way, with the participants admitting to having a problem. In my case the problem was this:

(a): After 50 years of thinking about it, I think that the subjectivist approach to probability, induction and statistics is the only thing that works;
(b): At the same time, I have done a lot of work inventing and analyzing various schemes for generating random samples for things like contingency tables with given row and column sums; graphs with given degree sequences; …; Markov Chain Monte Carlo. These are used for things like permutation tests and Fisher’s exact test.

There is a lot of nice mathematics and hard work in (b) but such tests violate the likelihood principle and lead to poor scientific practice. Hence my problem (I still have it): (a) and (b) are incompatible.

There has been some progress. I now see how some of the tools developed for (b) can be usefully employed for natural tasks suggested by (a). Not so many people care about such inferential questions in these ’big data’ days. However, there are also lots of small datasets where the inferential details matter. There are still useful questions for people like Eugenio (and me).

3. Background on Exponential Families and Algebraic Statistics

The following development is closely based on [5], which should be considered for examples, proofs and more details.

Let

X

be a finite set. Consider the exponential family:

p_{θ} (x) = \frac{1}{Z (θ)} e^{θ \cdot T (x)} θ \in R^{d}, x \in X .

(4)

Here,

Z (θ)

is a normalizing constant and

T : X \to N^{d} - {0}

. If

X_{1}, X_{2}, \dots, X_{n}

are independent and identically distributed from (4), the statistic

t = T (X_{1}) + \dots + T (X_{n})

is sufficient for

θ

. Let

Y_{t} = {(x_{1}, \dots, x_{n}) : T (x_{1}) + \dots + T (x_{n}) = t} .

Under (4), the distribution of

X_{1}, \dots, X_{n}

given t is uniform on

Y_{t}

. It is usual to write

t = \sum_{i = 1}^{n} T (X_{i}) = \sum_{X} σ (x) T (x) with σ (x) = # {i : T (X_{i}) = T (x)} .

Let

F_{t} = {f : X \to N : \sum f (x) T (x) = t} .

Example 1.

For contingency tables

X = {(i, j) : 1 \leq i \leq I, 1 \leq j \leq J} .

The usual model for independence has

T (i, j) \in N^{I + J}

a vector of length

I + J

with two non zero entries equal 1. The 1’s in

T (i, j)

are in the

i^{t h}

place and position j of the last j places. The sufficient statistic t contains the row and column sums of the contingency table associated to the first n observations. The set

F_{t}

is the set of an

I \times J

tables with these row and column sums.

A Markov chain on this

F_{t}

can be based on the following moves: pick

i \neq i^{'}

,

j \neq j^{'}

and change the entries in the current f by adding

\pm 1

in pattern

\begin{matrix} j & j^{'} \\ i & + & - \\ i^{'} & - & + \end{matrix} o r \begin{matrix} - & + \\ + & - \end{matrix}

This does not change the row sums and it does not change the column sums. If told to go negative, just pick new

i, i^{'}, j, j^{'}

. This gives a connected, aperiodic Markov chain on

F_{t}

with a uniform stationary distribution. See [6].

Returning to the general case, an analog of

\begin{matrix} + & - \\ - & + \end{matrix}

moves is given by the following:

Definition 1

(Markov basis). A Markov basis is a set of functions

f_{1}, f_{2}, \dots, f_{L}

from

X

to

Z

such that

\sum_{X} f_{i} (x) T (x) = 0 1 \leq i \leq L

(5)

and that for any t and

f, f^{'} \in F_{t}

there are

(t_{1}, f_{i_{1}}), \dots, (t_{A}, f_{i_{A}})

with

t_{i} = \pm 1

, such that

f^{'} = f + \sum_{j = 1}^{A} t_{j} f_{i_{j}} a n d f + \sum_{j = 1}^{a} t_{j} f_{i_{j}} \geq 0, for 1 \leq a \leq A .

(6)

This allows the construction of a Markov chain on

F_{t}

: from f, pick

I \in {1, 2, \dots, L}

and

t = \pm 1

at random and consider

f + t f_{I}

. If this is positive, move there. If not, stay at f. Assumptions (5) and (6) ensure that this Markov chain is symmetric and ergodic with a uniform stationary distribution. Below, I will use a Markov basis to formulate a de Finetti theorem to characterize mixtures of the model (4).

One of the main contributions of [5] is a method of effectively constructing Markov bases using polynomial algebra. For each

x \in X

, introduce an indeterminate, also called x. Consider the ring of polynomials

k [X]

in these indeterminates where k is a field, e.g., the complex numbers. A function

g : X \to N

is represented as a monomial

X^{g} = \prod_{X} x^{g (x)}

. The function

T : X \to N^{d}

gives a homomorphism

\begin{matrix} φ_{T} : k [X] & ⟶ k [t_{1}, \dots, t_{d}] \\ x & ⟼ t_{1}^{T_{1} (x)} t_{2}^{T_{2} (x)} \dots t_{d}^{T_{d} (x)}, \end{matrix}

extended linearly and multiplicatively (

φ_{T} (x + y) = φ_{T} (x) + φ_{T} (y)

and

φ_{T} (x^{2}) = φ_{T} {(x)}^{2}

and so on). The basic object of interest is the kernel of

φ_{T}

:

I_{T} = {p \in k [X] : φ_{T} (p) = 0} .

This is an ideal in

k [X]

. A key result of [5] is that a generating set for

I_{T}

is equivalent to a Markov basis. To state this, observe that any

f : X \to Z

can be written

f = f_{+} - f_{-}

with

f_{+} (x) = max (f (x), 0)

and

f_{-} (x) = max (- f (x), 0)

. Observe

\sum f (x) T (x) = 0

iff

X^{f_{+}} - X^{f_{-}} \in I_{T}

. The key result is

Theorem 4.

A collection of functions

f_{1}, f_{2}, \dots, f_{L}

is a Markov basis if and only if the set

X^{f_{i +}} - X^{f_{i -}} 1 \leq i \leq L

generates the ideal

I_{T}

.

Now, the Hilbert Basis Theorem shows that ideals in

k [X]

have finite bases and modern computer algebra packages give an effective way of finding bases.

I do not want (or need) to develop this further. See [5] or the book by Sullivant [7] or Aoki et al. [8]. There is even a Journal of Algebraic Statistics.

I hope that the above gives a flavor for what I mean by “working in (b) is hard honest work”. Most of the applications are for standard frequentist tasks. In the following sections, I will give Bayesian applications.

4. Log Linear Model for Contingency Tables

Log linear models for multiway contingency tables are a healthy part of the modern statistics. The index set is

X = \prod_{γ \in Γ} I_{γ}

with

Γ

indexing categories and

I_{γ}

the levels of

γ

. Let

p (x)

be the probability of falling into cell

x \in X

. A log linear model can be specified by writing:

log p (x) = \sum_{a \subseteq Γ} φ_{a} (x) .

The sum ranges over subsets a of

Γ

and

φ_{a} (x)

means a function that only depends on x through the coordinates in a. Thus,

φ_{\emptyset} (x)

is a constant and

φ_{Γ} (x)

is allowed to depend on all coordinates. Specifying

φ_{a} = 0

for some class of sets a determines a model. Background and extensive references are in [9]. If the a with

φ_{a} \neq 0

permitted form a simplicial complex

C

(so

a \in C

and

\emptyset \neq a^{'} \subseteq a \Rightarrow a^{'} \in C

) the model is called hierarchical. If

C

consists of the cliques in a graph, the model is called graphical. If the graph is chordal (every cycle of length

\geq 4

contains a chord) the graphical model is called decomposable.

Example 2

(3 way contingency tables). The graphical models for three way tables are: Mathematics 10 00442 i001

The simplest hierarchical model that is not graphical is No Three Way Interaction Model.

This can be specified by saying ’the odds rate of any pair of variables does not depend on the third’. Thus,

\frac{p_{i j k} p_{i^{'} j^{'} k}}{p_{i j^{'} k} p_{i^{'} j k}} is constant in k for fixed i, i^{'}, j, j^{'} .

(7)

As one motivation, recall that for two variables, the independence model is specified by

p_{i j} = θ_{i} η_{j} .

For three variables, suppose there are parameters

θ_{i j}, η_{j k}, ψ_{i k}

satisfying:

p_{i j k} = θ_{i j} η_{j k} ψ_{i k} for all i, j, k .

(8)

It is easy to see that (8) entails (7) hence ’no three way interaction’. Cross multiplying (7) entails

p_{i j k} p_{i^{'} j^{'} k} p_{i j^{'} k^{'}} p_{i^{'} j k^{'}} = p_{i j k^{'}} p_{i^{'} j^{'} k^{'}} p_{i j^{'} k} p_{i^{'} j k} .

(9)

This is the form we will work with for the de Finetti theorems below.

For background, history and examples (and some nice theorems) see ([10], Section 8.2), [11,12], Simpsons ’paradox’ [13] is based on understanding the no three way interaction model. Further discussion is in Section 5 below.

5. From Markov Bases to de Finetti Theorems

Suppose

X

is a finite set,

T : X \to N^{d} - {0}

is a statistic and

{f_{i}}_{i = 1}^{L}

is a Markov basis as in Section 3. The following development shows how to translate this into de Finetti theorems for the contingency table examples of Section 4. The first argument abstracts the argument used for Theorem 2 above.

Lemma 1

(Key Lemma). Let

X

be a finite set and

{X_{i}}_{i = 1}^{\infty}

an exchangeable sequence of

X

-valued random variables. Suppose for all

n > m

\begin{matrix} P [X_{1} = x_{1}, \dots, X_{m} = x_{m}, X_{m + 1} = x_{m + 1}, \dots, X_{n} = x_{n}] = \\ P [X_{1} = y_{1}, \dots, X_{m} = y_{m}, X_{m + 1} = x_{m + 1}, \dots, X_{n} = x_{n}] . \end{matrix}

(10)

In (10),

x_{1}, \dots, x_{m}, y_{1}, \dots, y_{m}

are fixed and

x_{m + 1}, \dots, x_{n}

are arbitrary. Then, if

T

is the tail field of

{X_{i}}_{i = 1}^{\infty}

and

p (x) = P [X_{1} = x | T]

,

\prod_{i = 1}^{m} p (x_{i}) = \prod_{i = 1}^{m} p (y_{i}) .

(11)

Proof.

From (10) and exchangeability

\begin{matrix} P [X_{1} = x_{1}, \dots, X_{m} = x_{m}, X_{n + 1} = x_{n + 1}, \dots, X_{n + h} = x_{n + h}] = \\ P [X_{1} = y_{1}, \dots, X_{m} = y_{m}, X_{n + 1} = x_{n + 1}, \dots, X_{n + h} = x_{n + h}] \end{matrix}

so

\begin{matrix} P [X_{1} = x_{1}, \dots, X_{m} = x_{m} | X_{n + 1} = x_{n + 1}, \dots, X_{n + h} = x_{n + h}] = \\ P [X_{1} = y_{1}, \dots, X_{m} = y_{m} | X_{n + 1} = x_{n + 1}, \dots, X_{n + h} = x_{n + h}] . \end{matrix}

Let

h ↑ \infty

and then

n ↑ \infty

, use Doob’s upward and then downward martingale convergence theorems to see:

P [X_{1} = x_{1}, \dots, X_{m} = x_{m} | T] = P [X_{1} = y_{1}, \dots, X_{m} = y_{m} | T] .

Now, de Finetti’s theorem implies (11). □

Remark 1.

The Key Lemma shows that the

p (x)

satisfy certain relations. Using choices of

{x_{i}}, {y_{i}}

derived from a Markov basis will show that

p (x)

satisfy the required independence properties. Suppose that

\sum_{X} f (x) T (x) = 0

,

\sum_{X} f (x) = 0

and

f \in {0, \pm 1}

. Let

S_{+} = {x : f (x) = 1}

,

S_{-} = {y : f (y) = - 1}

. Say

| S_{+} | = | S_{-} | = m

. Enumerate

S_{+} = {x_{1}, \dots, x_{m}}

,

S_{-} = {y_{1}, \dots, y_{m}}

. Assumptions (10) and conclusion (11) will give our theorems.

Example 3

(Independence in a two way table). Let

X = [I] \times [J]

. A minimal basis for the independence model is given by

f_{i, j, i^{'}, j^{'}}

:

\begin{matrix} j & j^{'} \\ i & + & - \\ i^{'} & - & + \end{matrix} (a l l o t h e r e n t r i e s = 0) .

The condition of the Key Lemma becomes:

\begin{matrix} P [X_{1} = (i, j), X_{2} = (i^{'}, j^{'}), X_{3} = (i_{3}, j_{3}), \dots, X_{n} = (i_{n}, j_{n})] = \\ P [X_{1} = (i, j^{'}), X_{2} = (i^{'}, j), X_{3} = (i_{3}, j_{3}), \dots, X_{n} = (i_{n}, j_{n})] . \end{matrix}

Passing to the limit gives

p_{i j} p_{i^{'} j^{'}} = p_{i j^{'}} p_{i^{'} j}

and so

p_{i *} p_{* j} = \sum_{i^{'} j^{'}} p_{i j^{'}} p_{i^{'} j} = p_{i j} .

This is precisely Theorem 2 of the Introduction. □

Example 4

(Complete independence in a three way table). The sufficient statistics are

T_{i * *}, T_{* j *}, T_{* * k}

. From [5], there are two kinds of moves in a minimal basis. Up to symmetries, these are:

Class I	Class II
$\begin{matrix} j & j^{'} \\ i & + & - \\ i^{'} & - & + \end{matrix} \begin{matrix} \end{matrix} \begin{matrix} \end{matrix}$	$\begin{matrix} j & j^{'} \\ i & + & - \end{matrix} \begin{matrix} j & j^{'} \\ i^{'} & - & + \end{matrix} \begin{matrix} \end{matrix}$

Passing to the limit, this entails:

p_{i j k} p_{i j^{'} k} = p_{i j^{'} k} p_{i^{'} j k} a n d p_{i j k} p_{i^{'} j^{'} k^{'}} = p_{i j^{'} k} p_{i j k^{'}} .

These may be said as ’the product of any

p_{i j k}, p_{i^{'} j k}

remains unchanged if the middle coordinates are exchanged’. By symmetry, this remains true if the two first or last coordinates are exchanged. As above, this entails

p_{i * *} p_{* j *} p_{* * k} = p_{i j k} .

These observations can be rephrased into a statement that looks more similar to the classical de Finetti theorem; using symmetry:

Theorem 5.

Let

{X_{i}}_{i = 1}^{\infty}

be exchangeable, taking values in

[I] \times [J] \times [K]

. Then

\begin{matrix} P [X_{1} = (i_{1}, j_{1}, k_{1}), \dots, X_{n} = (i_{n}, j_{n}, k_{n})] = \\ P [X_{1} = (σ (i_{1}), ζ (j_{1}), η (k_{1})), \dots, X_{n} = (σ (i_{n}), ζ (j_{n}), η (k_{n}))] \end{matrix}

for all n,

{(i_{a}, j_{a}, k_{a})}_{a = 1}^{n}

and

(σ, ζ, η) \in S_{I} \times S_{J} \times S_{K}

is necessary and sufficient for there to exist a unique μ on

Δ_{I} \times Δ_{J} \times Δ_{K}

with

P [X_{a} = (i_{a}, j_{a}, k_{a}), 1 \leq a \leq n] = \int_{Δ_{I} \times Δ_{J} \times Δ_{K}} \prod_{a = 1}^{n} p_{i_{a}} q_{j_{a}} r_{k_{a}} μ (d p, d q, d r) .

□

Example 5

(One variable independent of the other two). Suppose, without loss, that the graph is Mathematics 10 00442 i002

Identify the pairs

(j, k)

with

{1, 2, \dots, L}

with

L = J K

. The problem reduces to Example 4. A minimal basis consists of (again, up to relabeling)

\begin{matrix} l & l^{'} \\ i & + & - \\ i^{'} & - & + \end{matrix}

We may conclude

Theorem 6.

Let

{X_{i}}_{i = 1}^{\infty}

be exchangeable, taking values in

[I] \times [J] \times [K]

. Then

\begin{matrix} P [X_{1} = (i_{1}, j_{1}, k_{1}), \dots, X_{n} = (i_{n}, j_{n}, k_{n})] = \\ P [X_{1} = (σ (i_{1}), ζ (j_{1}, k_{1})), \dots, X_{n} = (σ (i_{n}), ζ (j_{n}, k_{n}))] \end{matrix}

for all n,

{(i_{a}, j_{a}, k_{a})}_{a = 1}^{n}

and

(σ, ζ) \in S_{I} \times S_{J \times K}

is necessary and sufficient for there to exist a unique μ on

Δ_{I} \times Δ_{J K}

with

P [X_{a} = (i_{a}, j_{a}, k_{a}), 1 \leq a \leq n] = \int_{Δ_{I} \times Δ_{J K}} \prod_{a = 1}^{n} p_{a} q_{a} μ (d p, d q) .

□

Example 6

(Conditional independence). Suppose variable i and j are conditionally independent given k. Mathematics 10 00442 i003

Rewrite the parameter condition of section four as

p_{* * k} p_{i j k} = p_{i * k} p_{* j k} f o r a l l i, j, k

The sufficient statistics are

{T_{i * k}}_{i, k}, {T_{* j k}}_{j k}

. From [5], a minimal generating set is

\begin{matrix} j_{k} & j_{k}^{'} \\ i_{k} & + & - \\ i_{k}^{'} & - & + \end{matrix} K \times \frac{I (I - 1)}{2} \times \frac{J (J - 1)}{2} m o v e s i n a l l .

From this, the Key Lemma shows (for all

i, j, k

)

p_{i j k} p_{i^{'} j^{'} k} = p_{i j^{'} k} p_{i^{'} j k} .

This entails:

p_{i * k} p_{* j k} = \sum_{i^{'}, j^{'}} p_{i j^{'} k} p_{i^{'} j k} = \sum_{i^{'} j^{'}} p_{i j k} p_{i^{'} j^{'} k} = p_{i j k} p_{* * k} .

Again, phrasing the condition (10) in terms of symmetry.

Theorem 7.

Let

{X_{i}}_{i = 1}^{\infty}

be exchangeable, taking values in

[I] \times [J] \times [K]

. Then,

\begin{matrix} P [X_{1} = (i_{i}, j_{i}, k_{i}), \dots, X_{n} = (i_{n}, j_{n}, k_{n})] = \\ P [X_{1} = (σ^{k_{1}} (i_{1}), ζ^{k_{1}} (j_{1}), k_{1}), \dots, X_{n} = (σ^{k_{n}} (i_{n}), ζ^{k_{n}} (j_{n}), k_{n})] \end{matrix}

(12)

for all n,

{(i_{a}, j_{a}, k_{a})}_{a = 1}^{n}

and

σ^{k}, ζ^{k} \in S_{I} \times S_{J}

,

1 \leq k \leq K

, is necessary and sufficient for there to exist a unique family

μ \times \prod_{b = 1}^{k} μ_{b, r}

on

Δ_{K} \times {(Δ_{I} \times Δ_{J})}^{K}

\begin{matrix} P [X_{a} = (i_{a}, j_{a}, k_{a}), 1 \leq a \leq n] = \\ \int_{Δ_{K} \times {(Δ_{I} \times Δ_{J})}^{K}} \prod_{a = 1}^{n} r_{k_{a}} p_{i_{a}}^{k_{a}} q_{j_{a}}^{k_{a}} \prod_{b = 1}^{k} μ_{b, r} (p^{i_{b}} q^{i_{b}}) μ (d r) . \end{matrix}

(13)

Both (12) and (13) have a simple interpretation. For (12),

{X_{i}}_{i = 1}^{n}

are exchangeable 3-vectors. For any k and specified sequence of values

{(i_{a}, j_{a}, k)}_{a = 1}^{n}

the chance of observing these values is unchanged under permuting the

(i_{a}, j_{a}, k)

, by permutations

σ^{k} \in S_{I}, ζ^{k} \in S_{J}

. Here

σ^{k}, ζ^{k}

are allowed to depend on k.

On the right of (13), the mixing measure may be understood as follows. There is a probability

μ

on

Δ_{K}

. Pick

r = (r_{1}, \dots, r_{k}) \in Δ_{K}

. Given this r, pick

(p^{k}, q^{k})

from

μ_{k, r}

on the

k^{t h}

copy of

Δ_{I} \times Δ_{J}

. These choices are allowed to depend on r but are independent, conditional on r,

1 \leq k \leq K

.

All of this simply says that, conditional on the tail field,

P [X_{a} = (i, j, k) | T] = P [X_{a} = (i, *, k) | T) P (X_{a} = (*, j, k) | T] .

The first two coordinates are conditionally independent given the third.

Example 7

(No three way interaction). The model is described in Section 4. The sufficient statistics are

{T_{i j *}}, {T_{i * k}}, {T_{* j k}} .

Minimal Markov bases have proved intractable. See [5] or [8]. For any fixed

I, J, K

, the computer can produce a Markov basis but these can have a huge number of terms. See [7,8] and their references for a surprisingly rich development.

There is a pleasant surprise. Markov bases are required to connect the associated Markov chain. There is a natural subset, the first moves anyone considers, and and these are enough for a satisfactory de Finetti theorem (!).

Described informally, for an

I \times J \times K

array, pick a pair of parallel planes, say the

k, k^{'}

planes in the three dimensional array, and consider moves depicted as

\begin{matrix} \begin{matrix} j & j^{'} \\ i & + & - \\ i^{'} & - & + \end{matrix} \begin{matrix} j & j^{'} \\ i & + & - \\ i^{'} & - & + \end{matrix} \\ k k^{'} \end{matrix}

These moves preserve all line sums (the sufficient statistics). They arenotsufficient to connect any two datasets with the same sufficient statistics. Using the prescription in the Key Lemma, suppose:

\begin{matrix} P [X_{1} = (i, j, k), X_{2} = (i^{'}, j^{'}, k), X_{3} = (i, j^{'}, k^{'}), X_{4} = (i^{'}, j, k^{'}), \\ X_{a} = (i_{a}, j_{a}, k_{a}) 5 \leq a \leq n] = \\ P [X_{1} = (i, j^{'}, k), X_{2} = (i^{'}, j, k), X_{3} = (i, j, k^{'}), X_{4} = (i^{'}, j^{'}, k^{'}), \\ X_{a} = (i_{a}, j_{a}, k_{a}) 5 \leq a \leq n] . \end{matrix}

(14)

Passing to the limit gives

p_{i j k} p_{i^{'} j^{'} k} p_{i j^{'} k^{'}} p_{i^{'} j k^{'}} = p_{i j^{'} k} p_{i^{'} j k} p_{i j k^{'}} p_{i^{'} j^{'} k^{'}} .

(15)

This is exactly the no three way interaction condition. Or, equivalently:

\frac{p_{i j k} p_{i^{'} j^{'} k}}{p_{i j^{'} k} p_{i^{'} j k}} = \frac{p_{i j k^{'}} p_{i^{'} j^{'} k^{'}}}{p_{i j^{'} k^{'}} p_{i^{'} j k^{'}}} .

The odds ratios are constant on the

k^{t h}

and

k^{' t h}

planes (of course, they depend on

i, j, i^{'}, j^{'}

). These considerations imply:

Theorem 8.

Let

{X_{i}}_{i = 1}^{\infty}

be exchangeable, taking values in

[I] \times [J] \times [K]

. Then, condition (14) is necessary and sufficient for the existence of a unique probability μ on

Δ_{I J K}

, supported on the no three way interaction variety (15) satisfying

P [X_{a} = (i_{a}, j_{a}, k_{a}), 1 \leq a \leq n] = \int_{Δ_{I J K}} \prod p_{i j k}^{η_{i j k}} μ (d p_{i j k}) .

We remark on the following points.

It follows from theorems in [12] and [11] that, if all $p_{i j k} > 0$ , condition (15) is equivalent to the unique representation,

$p_{i j k} = r α_{j k} β_{k i} γ_{i j},$

(16)

where $r, α, β, γ$ have positive entries and satisfy

$\sum_{k} α_{j k} = \sum_{i} β_{k i} = \sum_{j} γ_{i j} = 1 for all i, j, k$

and

$r \sum_{i, j, k} α_{j k} β_{k i} γ_{i j} = 1 .$

The integral representation in the theorem can be stated in this parametrization. The condition $p_{i j k} > 0$ is equivalent to $P (X_{1} = (i, j, k)) > 0$ on observables.
Condition (14) does not have an obvious symmetry interpretation.
Conditions (14) and (15) are stated via varying the third variable when $i, j, i^{'}, j^{'}$ are fixed. Because of (16), if they hold in this form, they hold for any two variables fixed as the third varies.
It is possible to go on, but, as John Darroch put it, ’the extensions to higher order interactions… are not likely to be of practical interest’. The most natural development—the generalization to decomposable models—is being developed by Paula Gablenz.
There are many extensions of the Key Lemma above. These allow a similar development for more general log linear models and exponential families.

6. Discussion and Conclusions

The tools of algebraic statistics have been harnessed above to develop partial exchangeability for standard contingency table models. I have used them for two further Bayesian tasks: approximate exchangeability and the problem of ’doubly intractable priors’. As both are developed in papers, I will be brief.

Approximate exchangeability.Consider n men and m women along with a binary outcome. If the men are judged exchangeable (for fixed outcomes for the women) and vice versa, and, if both sequences are extendable, de Finetti [1] shows that there is a unique prior on the unit square

{[0, 1]}^{2}

such that, for any outcomes

t_{1}, \dots, t_{n}, σ_{1}, \dots, σ_{m}

in

{0, 1}

\begin{matrix} P [X_{1} = t_{1}, \dots, X_{n} = t_{n}, Y_{1} = σ_{1}, \dots, Y_{m} = σ_{m}] = \\ \int_{{[0, 1]}^{2}} p^{S} {(1 - p)}^{n - S} θ^{T} {(1 - θ)}^{m - T} μ (d p, d θ), \end{matrix}

with

S = \sum_{i = 1}^{n} t_{i}

,

T = \sum_{j = 1}^{m} σ_{j}

.

If, for the outcome of interest,

{X_{i}, Y_{j}}

were almost fully exchangeable (so the men/ women difference is judged practically irrelevant) the prior

μ

would be concentrated near the diagonal of

{[0, 1]}^{2}

. De Finetti suggested implementing this by considering priors of the form

μ (d p, d θ) = Z^{- 1} e^{- A {(p - θ)}^{2}} d p d θ

for A large.

In joint work with Sergio Bacallado and Susan Holmes [3], multivariate versions of such priors are developed. These are required to concentrate near sub-manifolds of cubes or products of simplicies; think about ‘approximate no three way interaction’. We used the tools of algebraic statistics to suggest appropriate many variable polynomials which vanish on submanifold of interest. Many ad hoc choices were involved. Sampling from such priors or posteriors is a fresh research area. See [2,14,15].

Doubly intractable priors. Consider an exponential family as in Section 3:

p_{θ} (x) = \frac{1}{Z (θ)} e^{θ \cdot T (x)} .

Here

x \in X

a finite set,

T : X \to R^{d}

and

θ \in R^{d}

. In many real examples, the normalizing constant

Z (θ)

will be unknown and unknowable. For a Bayesian treatment, let

Π (d θ)

be a prior distribution on

R^{d}

. For example, the conjugate prior.

If

X_{1}, X_{2}, \dots, X_{n}

is as i.i.d. sample from

p_{θ}

, T is a sufficient statistic and the posterior has the form

Z {(Z^{- 1} (θ))}^{n} e^{θ F} Π (d θ),

with

F = \sum_{i = 1}^{n} T (X_{i})

and Z another normalizing constant. The problem is that

Z^{- 1} (θ)

depends on

θ

and is unknown!

The exchange algorithm and many variants offer a useful solution. See [16,17].

In practical implementations, there is an intermediary step requiring a sample form

p_{θ^{'}}^{T}

, the measure induced by

p_{θ}^{n}

under

\sum_{i}^{n} T (x_{i}) : X^{n} \to R

. This is a discrete sampling task and Markov basis techniques have been proved useful. See [16].

A philosophical comment. The task undertaken above, finding believable Bayesian interpretations for widely used log linear models, goes somewhat against the grain of standard statistical practice. I do not think anyone takes a reasonably complex, high dimensional hierarchical model seriously. They are mostly used as a part of exploratory data analysis; this is not to deny their usefulness. Making any sense of a high dimensional dataset is a difficult task. Practitioners search through huge collections of models in an automated way. Usually, any reflection suggests the underlying data is nothing like a sample from a well specified population. Nonetheless, models are compared using product likelihood criteria. It is a far far cry from being based on anyone’s reasoned opinion.

I have written elsewhere about finding Bayesian justification for important statistical tasks such as graphical methods or exploratory data analysis [18]. These seem like tasks similar to ’how do you form a prior’. Different from the focus of even the most liberal Bayesian thinking.

The sufficiency approach. There is a different approach to extending de Finetti’s theorem. This uses ‘sufficiency’. Consider exchangeable

{X_{i}}_{i = 1}^{\infty}

. For each n, suppose

T_{n} : X^{n} \to Y

is a function. The

{T_{n}}

have to fit together according to simple rules satisfied in all of the examples above. Call

{X_{i}}

partially exchangeable with respect to

T_{n}

if

P [X_{1} = x_{1}, \dots, X_{n} = x_{n} | T_{n} = t_{n}]

is uniform. Then, Diaconis and Freedman [19] show that a version of de Finetti’s theorem holds. The law of

{X_{i}}

is a mixture of i.i.d. laws indexed by extremal laws. In dozens of examples, these extremal laws can be identified with standard exponential families. This last step remains to be carried out in the generality of Section 3 above. What is required is a version of the Koopman–Pitman–Darmois theorem for discrete random variables. This is developed in [19] when

X \subseteq N

and

T_{n} (X_{1}, \dots, X_{n}) = X_{1} + \dots + X_{n}

. Passing to interpretation, this version of partial exchangeability has the following form:

\begin{matrix} if T_{n} (x_{1}, \dots, x_{n}) = T_{n} (y_{1}, \dots, y_{n}), \\ then P [X_{1} = x_{1}, \dots, X_{n} = x_{n}] = P [X_{1} = y_{1}, \dots, X_{n} = y_{n}] . \end{matrix}

This is neat mathematics (and allows a very general theoretical development). However, it does not seem as easy to think about in natural examples. Exchangeability via symmetry is much easier. The development above is a half-way house between symmetry and sufficiency. A close relative of the sufficiency approach is the topic of ‘extremal models’ as developed by Martin-Löf and Lauritzen. See [20] and its references. Moreover, Refs. [21,22] are recent extensions aimed at contingency tables.

Classical Bayesian contingency table analysis. There is a healthy development of parametric analysis for the examples of Section 5. This is based on natural conjugate priors. It includes nice theory and R packages to actually carry out calculations in real problems. Three papers that I like are [23,24,25,26]. The many wonderful contributions by I.J. Good are still very much worth consulting. See [27] for a survey. Section 5 provides ‘observable characterizations’ of the models. The problem of providing ‘observable characterizations’ of the associated conjugate priors (along the lines of [28]) remains open.

Funding

Thisresearch received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257 and funding from NSF grant No 1954042.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank Paula Gablenz, Sourav Chatterjee and Emanuele Dolera for help throughout.

Conflicts of Interest

The author declares no conflict of interest.

References

de Finetti, B. On the condition of partial exchangeability. Stud. Inductive Log. Probab. 1980, 2, 193–205. [Google Scholar]
Bruno, A. On the notion of partial exchangeability (Italian). In Giornale dell’Istituto Italiano degli Attuari; English Translation in: de Finetti, Probability, Induction and Statistics; International Statistical Institute: Leidschenveen, The Netherlands, 1964; Volume 27, Chapter 10; pp. 174–196. [Google Scholar]
Baccalado, S.; Diaconis, P.; Holmes, S. De Finetti priors using Markov chain Monte Carlo computations. J. Stat. Comput. 2015, 25, 797–808. [Google Scholar] [CrossRef] [PubMed] [Green Version]
de Finetti, B. Probability, Induction and Statistics: The Art of Guessing; Wiley: Hoboken, NJ, USA, 1972. [Google Scholar]
Diaconis, P.; Sturmfels, B. Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 1998, 26, 363–397. [Google Scholar] [CrossRef] [Green Version]
Diaconis, P.; Gangolli, A. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms; Springer: New York, NY, USA, 1995; Volume 72, pp. 15–41. [Google Scholar]
Sullivant, S. Algebraic Statistics; AMS: Providence, RI, USA, 2018. [Google Scholar]
Aoki, S.; Hara, H.; Takemura, A. Markov Bases in Algebraic Statistics; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Lauritzen, S.L. Graphical Models, 2nd ed.; Oxford University Press: Oxford, UK, 2004. [Google Scholar]
Agresti, A. Categorical Data Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
Birch, M.W. Maximum likelihood in three-way contingency tables. J. R. Stat. Soc. Ser. B 1963, 25, 220–233. [Google Scholar] [CrossRef]
Darroch, J.N. Interactions in multi-factor contingency tables. J. R. Stat. Soc. Ser. 1962, 24, 251–263. [Google Scholar] [CrossRef]
Simpson, E.H. The interpretation of interaction in contingency tables. J. R. Stat. Soc. Ser. 1951, 13, 238–241. [Google Scholar] [CrossRef]
Diaconis, P.; Holmes, S.; Shahshahani, M. Sampling From a Manifold. In Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton; IMS Statistics Collections: Beachwood, OH, USA, 2013; pp. 102–125. [Google Scholar]
Gerencsér, B.; Ottolini, A. Rates of convergence for Gibbs sampling in the analysis of almost exchangeable data. arXiv 2020, arXiv:2010.15539v2. [Google Scholar]
Diaconis, P.; Wang, G. Bayesian goodness of fit tests: A conversation for David Mumford. Ann. Math. Sci. Appl. 2018, 3, 287–308. [Google Scholar] [CrossRef] [Green Version]
Wang, G. On the Theoretical Properties of the Exchange Algorithm. arXiv 2021, arXiv:2005.09235v4. [Google Scholar]
Diaconis, P. Theories of data analysis: From magical thinking through classical statistics. In Exploring Data Tables, Trends, and Shapes; Hoaglin, D.C., Mosteller, F., Tukey, J.W., Eds.; Wiley: Hoboken, NJ, USA, 1985; pp. 1–36. [Google Scholar]
Diaconis, P.; Freedman, D. Partial Exchangeability and Sufficiency. In Statistics:Applications and New Directions; Ghosh, K., Roy, F., Eds.; Indian Statistical Institute: Calcutta, India, 1984; pp. 205–236. [Google Scholar]
Lauritzen, S.L. General Exponential Models for Discrete Observations. Scand. J. Stat. 1975, 2, 23–33. [Google Scholar]
Lauritzen, S.L.; Rinaldo, A.; Sadeghi, K. Random Networks, Graphical Models, and Exchangeability. arXiv 2017, arXiv:1701.08420v2. [Google Scholar] [CrossRef] [Green Version]
Lauritzen, S.L.; Rinaldo, A.; Sadeghi, K. On exchangeability in network models. J. Algebr. Stat. 2019, 10, 85–114. [Google Scholar] [CrossRef]
Albert, J.H.; Gupta, A.K. Mixtures of Dirichlet distributions and estimation in contingency tables. Ann. Stat. 1982, 10, 1261–1268. [Google Scholar] [CrossRef]
Murray, I.; Ghahramani, Z.; MacKay, D.J.C. MCMC for doubly-intractable distributions. In Proceedings of the 22nd Conference in Uncertainty in Artificial Intelligence (UAI ’06), Cambridge, MA, USA, 13–16 July 2006. [Google Scholar]
Letac, G.; Massam, H. Bayes factors and the geometry of discrete hierarchical loglinear models. Ann. Stat. 2012, 40, 861–890. [Google Scholar] [CrossRef]
Tarantola, C.; Ntzoufras, I. Bayesian Analysis of Graphical Models of Marginal Independence for Three Way Contingency Tables. In Quaderni di Dipartimento from University of Pavia; No 172; Department of Economics and Quantitative Methods, University of Pavia: Pavia, Italy, 2012; Available online: http://dem-web.unipv.it/web/docs/dipeco/quad/ps/RePEc/pav/wpaper/q172.pdf (accessed on 30 December 2021).
Diaconis, P.; Efron, B. Testing for independence in a two-way table: New interpretations of the Chi-Square statistic. Ann. Stat. 1985, 13, 845–874. [Google Scholar] [CrossRef]
Diaconis, P.; Ylvisaker, D. Quantifying prior opinion. In Bayesian Statistics, II; Bernardo, J., DeGroot, M., Lindley, D., Smith, A.F.M., Eds.; North-Holland: Amsterdam, The Netherlands, 1985; pp. 133–156. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Diaconis, P. Partial Exchangeability for Contingency Tables. Mathematics 2022, 10, 442. https://doi.org/10.3390/math10030442

AMA Style

Diaconis P. Partial Exchangeability for Contingency Tables. Mathematics. 2022; 10(3):442. https://doi.org/10.3390/math10030442

Chicago/Turabian Style

Diaconis, Persi. 2022. "Partial Exchangeability for Contingency Tables" Mathematics 10, no. 3: 442. https://doi.org/10.3390/math10030442

APA Style

Diaconis, P. (2022). Partial Exchangeability for Contingency Tables. Mathematics, 10(3), 442. https://doi.org/10.3390/math10030442

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Partial Exchangeability for Contingency Tables

Abstract

1. Introduction

2. Some History

3. Background on Exponential Families and Algebraic Statistics

4. Log Linear Model for Contingency Tables

5. From Markov Bases to de Finetti Theorems

6. Discussion and Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI