1. Introduction and Background
It has been a long-standing pursuit in probability theory and its applications to express a random sequence as a mixture of simpler random sequences. The mixing is meant here in the probabilistic sense, that is, we select one among the component sequences via some probability distribution that governs the mixing, and then outputs the selected sequence in its entirety. Equivalently, the distribution of the resulting sequence (i.e., the joint distribution of its entries) is a convex combination of the distributions of the component sequences. The distribution used for the selection is often referred to as the mixing measure.
Note: when we only want to represent a single random variable as a mixture, it is a much simpler case, discussed in the well-established statistical field of
mixture models, see Lindsay [
1]. Here we are interested, however, in expressing random
sequences, rather than just single random variables.
Which simple sequences can serve best as the components of mixing? Arguably, the simplest possible probabilistic structure that a random sequence can have is being a sequence of independent, identically distributed (i.i.d.) random variables. The mixture of such i.i.d. sequences, however, does not have to remain i.i.d. For example, the identically 0 and identically 1 sequences are both i.i.d., but if we mix them by selecting one of them with probability 1/2, then we get a sequence in which each term is either 0 or 1 with probability 1/2, but all of them are equal, so the entries are clearly not independent.
Since the joint distribution of any i.i.d. sequence is invariant to reordering the terms by any fixed permutation, therefore, the mixture must also behave this way. The reason is that it does not matter whether we first apply a permutation to each sequence and then select one of them, or first make the selection and apply the permutation afterward to the selected sequence. The sequences with the property that their joint distribution is invariant to permutations are called exchangeable:
Definition 1 (Exchangeable sequence).A finite sequence of random variables is called exchangeable if its joint distribution is invariant with respect to permutations. That is, for any permutation σ of , the joint distribution of is the same as the joint distribution of . An infinite sequence is called exchangeable if every finite initial segment of the sequence is exchangeable.
It means, an exchangeable sequence is stochastically indistinguishable from any permutation of itself. An equivalent definition is that if we pick k entries of the sequence, then their joint distribution depends only on k, but not on that which k entries are selected and in which order. This also implies (with ) that each individual entry has the same distribution. Sampling tasks often produce exchangeable sequences, since in most cases the order of the samples does not matter.
As a special case, the definition is satisfied by i.i.d. random variables, but not every exchangeable sequence is i.i.d. There are many examples to demonstrate this; a simple one can be obtained from geometric considerations:
Example 1. Take a square in the plane, and divide it into two triangles by one of its diagonals. Select one of the triangles at random with probability 1/2, and then pick n uniformly random points from the selected triangle. These random points constitute an exchangeable sequence, since their joint probability distribution remains the same, regardless of the order they have been produced. Furthermore, each individual point is uniformly distributed over the whole square, because it is uniformly distributed over a triangle, which is selected with equal probability from among the two triangles. On the other hand, the random points are not independent, since if we know that a point falls in the interior of a given one of the two triangles, all the others must fall in the same triangle.
As we have argued before Definition 1, the mixing of i.i.d. sequences produces exchangeable sequences. A classical theorem of Bruno de Finetti, originally published in Italian [
2] in 1931, says that the converse is also true for infinite binary sequences: every infinite exchangeable sequence of binary random variables can be represented as a mixture of i.i.d. Bernoulli random variables (for short, a Bernoulli i.i.d.-mix). The result can be formally stated in several different ways, here is a frequently used one, which captures the distribution of the exchangeable sequence as a mixture of binomial distributions:
Theorem 1 (de Finetti’s Theorem—distributional form).Let be an infinite sequence of -valued exchangeable random variables. Then there exists a probability measure μ (called mixing measure) on , such that for every positive integer n and for any the following holds:where . Furthermore, the measure μ is uniquely determined. Note that the reason for using Stieltjes integral on the right-hand side of (
1) is just to express discrete, continuous, and mixed distributions in a unified format. For example, if the mixing measure
is discrete, taking values
with probabilities
, respectively, then the integral becomes the sum
If the mixing measure is continuous and has a density function
, then the integral becomes the ordinary integral
. The Stieltjes integral expression contains all these special cases in a unified format, including mixed distributions, as well.
Another often seen form of the theorem emphasizes that becomes an i.i.d. Bernoulli sequence, whenever we condition on the value , as presented below:
Theorem 2 (de Finetti’s Theorem—conditional independence form).Let be an infinite sequence of -valued exchangeable random variables. Then there exists a random variable η, taking values in , such that for every , for every positive integer n and for any the following holds:where . Furthermore, η is the limiting fraction of the number of ones in the sequence (the empirical distribution): It is interesting that the requirement of having an
infinite sequence is essential; for the finite case counterexamples are known, see, e.g., Stoyanov [
3]. (Note that even though Equations (
1) and (
2) use a fixed finite
n, the theorem requires it to hold for every
n.) On the other hand, approximate versions exist for finite sequences, see
Section 3. It is also worth noting that the proof is far from easy. An elementary proof was published by Kirsch [
4] in 2019, but this happened 88 years after the original paper.
Philosophical Interpretation of de Finetti’s Theorem
The concept of probability has several philosophical interpretations (for a survey, see [
5]). An appealing aspect of de Finetti’s Theorem is that it builds a bridge between two major conflicting interpretations: the
frequentist and the
subjective interpretations. (The latter is also known as
Bayesian interpretation.) Let us briefly explain these through the simple experiment of coin flipping.
The frequentist interpretation of probability says that there exists a real number , such that if we keep flipping the same coin independently, then the relative frequency of heads converges to p, and this value gives us the probability of heads. In this sense, the probability is an objective quantity, even when we may not know its exact value. Most researchers accept this interpretation, since it is in good agreement with experiments, and provides a common-sense, testable concept. In some cases, however, it does not work so well, such as when we deal with a one-time event that cannot be indefinitely repeated. For example, it is hard to assign a precise meaning to a statement like “candidate X will win the election tomorrow with probability 52%”.
In contrast, the subjective (Bayesian) interpretation denies the objective existence of probability. Rather, it says that the concept only expresses one’s subjective expectation that a certain event happens. For example, there is no reason to a priori assume that if among the first 100 coin flips we observed, say, 53 heads, then similar behavior has to be expected among the next 100 flips. If we still assume that the order in which the coin flips are recorded does not matter, then what we see is just an exchangeable sequence of binary values, but possibly no convergence to a constant.
Which interpretation is right? The one that de Finetti favored (see [
6]), against the majority view, was the subjective interpretation. Nevertheless, his theorem provides a nice bridge between the two interpretations, in the following way. Consider two experiments:
(1) Bayesian: Just keep flipping a coin and record the results. Do not presuppose the existence of a probability to which the relative frequency of heads converges, but still assume that the order of recording does not matter. Then what we obtain is an exchangeable sequence, but no specific objective probability.
(2) Frequentist: Assume that an objective probability p of heads does exist, but we do not know its value exactly, so we consider it as a random quantity, drawn from some probability distribution . Then the experiment will be this: draw p from the distribution , fix it, and then keep flipping a coin that has probability p of heads, on the basis that this probability p objectively exists.
Now de Finetti’s Theorem states that the results of the above two experiments are indistinguishable: an exchangeable sequence of coin flips cannot be distinguished from a mix of Bernoulli sequences. In this sense, the conflicting interpretations do not lead to conflicting experimental results, so the theorem indeed builds a bridge between the subjective and frequentist views. This is a reassuring reconciliation between the conflicting interpretations!
We need to note, however, that the above argument is only guaranteed to work if the sequence of coin flips is infinite. As already mentioned earlier, for the finite case, Theorem 1 does not always hold. This anomaly with finite sequences may be explained by the fact that the frequentist probability, as the limiting value of the relative frequency, is only meaningful if we can consider infinite sequences.
2. Generalizations/Modifications of de Finetti’s Theorem
As the original theorem was published almost a century ago and has been regarded as a fundamental result since then, it is not surprising that numerous extensions, generalizations, and modifications were obtained over the decades. Below we briefly survey some of the typical clusters of the development.
2.1. Extending the Result to More General Random Variables
The original theorem, published in 1931, refers to binary random variables. In 1937, de Finetti himself showed [
6] that it also holds for real-valued random variables. This was extended to much more general cases in 1955 by Hewitt and Savage [
7]. They allow random variables that take values from a variety of very general spaces; one of the most general examples is a Borel measurable space (Borel space, for short; see the definition and explanation of related concepts in
Appendix A). This space includes all cases that are likely to be encountered in applications.
To formally present the generalization of de Finetti’s Theorem in a form similar to Theorem 1, let S denote the space from which the random variables take their values, and let be the family of all probability distributions on S.
Theorem 3 (Hewitt–Savage Theorem).Let be an infinite sequence of S-valued exchangeable random variables, where S is a Borel measurable space. Then there exists a probability measure μ on , such that for every positive integer n and for any measurable the following holds:where denotes a random probability distribution from , drawn according to μ. Furthermore, the mixing measure μ is uniquely determined. Less formally, we can state it this way: an infinite sequence of S-valued exchangeable random variables is an S-valued i.i.d.-mix, whenever S is a Borel measurable space. Note that here selects a random distribution from , which may be a complex object, while in Theorem 1 this random distribution is determined by a single real parameter .
An interesting related result (which was actually published in the same paper [
7]) is called
Hewitt–Savage 0–1 Law. Let
be an infinite i.i.d. sequence. Further, let
be an event that is determined by
X. We say that
is
symmetric (with respect to
X) if the occurrence or non-occurrence of
is not influenced by permuting any finite initial segment of
X. For example, the event that “
X falls in a given set
A infinitely many times” is clearly symmetric, as it is not influenced by permuting any finite initial segment of
X. The Hewitt–Savage 0–1 Law says that any such symmetric event has probability either 0 or 1.
As an illustration for Theorem 3, consider the following example:
Example 2. Assume we put a number of balls into an urn. Each ball has a color, one of t possible colors (the number of colors may be infinite). Letbe the initial number of balls of coloriin the urn, where theare arbitrary fixed non-negative integers. Consider now the following process: draw a ball randomly from the urn, let its color be denoted by. Then put backtwoballs of the same colorin the urn. Keep repeating this experiment by always drawing a ball randomly from the urn, and each time putting back two balls of the same color as that of the currently drawn ball. Letdenote the random sequence of obtained colors. This is called at-color Pólya urn scheme and it is known that the generated sequenceis exchangeable, see Hill, Lane, and Sudderth [8]. Then, by Theorem 3, the sequence can be represented as an i.i.d.-mix. Note that just from the definition of the urn process this fact may be far from obvious. In view of the generality of Theorem 3, one may push further: does the result hold for
completely arbitrary random variables? After all, it does not seem self-explanatory why they need to take their values from a Borel measurable space. The most general target space that is allowed for random variables is a general
measurable space, see the definition in
Appendix A. One may ask: does the theorem remain true for random variables that take their values from
any measurable space?
Interestingly, the answer is no. Dubins and Freedman [
9] prove that Theorem 3 does not remain true for this completely general case, so some structural restrictions are indeed needed, although these restrictions are highly unlikely to hinder any application. A challenging question, however, still remains: how can we explain the need for such restrictions in the context of the philosophical interpretation outlined in
Section 1.1? Let us just mention, without elaborating on details, that restricting the general measurable space to a Borel measurable space means a topological restriction (for background on topological spaces we refer to the literature, see, e.g., Willard [
10]). At the same time, topology can be viewed as a (strongly) abstracted version of geometry. In this sense, we can say that de Finetti-style theorems require that, no matter how remotely, we still have to somehow relate to the real world: at least some very abstract version of geometry is indispensable.
2.2. Modifying the Exchangeability Requirement
There are numerous results that prove some variant of de Finetti’s theorem (and of its more general version, the Hewitt–Savage theorem) for random structures that satisfy some symmetry requirement similar to exchangeability. For a survey see Aldous [
11] and Kallenberg [
12]. Here we present two characteristic examples.
Partially exhangeable arrays. Let
be a doubly infinite array (infinite matrix) of random variables, taking values from a Borel measurable space
S. Let
denote the
row and
column of
X, respectively. We say that
X is
row-exchangeable, if the sequence
is exchangeable. Similarly,
X is
column-exchangeable if
is exchangeable. Finally,
X is
row and column exchangeable (RCE), if
X is both row-exchangeable and column-exchangeable. Observe that RCE is a weaker requirement than demanding that all entries of
X, listed as a single sequence, form an exchangeable sequence. For RCE arrays, Aldous [
13] proved a characterization, which contains the de Finetti (in fact, the Hewitt–Savage) theorem as a special case. We use the notation
to express that the random variables
have the same distribution.
Theorem 4 (Row and column exchangeable (RCE) arrays). If X is an RCE array, then there exists independent random variables such that all of them are uniformly distributed on , and there exists a measurable function (see the definition in Appendix A) , such that where . When the array X consists of a single row or a single column, we get a special case, which is equivalent to the Hewitt–Savage theorem (and includes de Finetti’s theorem):
Theorem 5. An infinite S-valued sequence Z is exchangeable if and only if there exists a measurable function and i.i.d. random variables all uniformly distributed on , such that .
Note that for any fixed , the sequence is i.i.d., so with a random we indeed obtain an i.i.d. mix. Comparing with the formulations of Theorems 1 and 3, observe that here the potentially complicated mixing measure is replaced by the simple random variable , which is uniform on . Of course, the potential complexity of does not simply “evaporate,” it is just shifted to the function f.
de Finetti’s theorem for Markov chains. Diaconis and Freedman [
14] created a version of de Finetti’s Theorem for Markov chains. The mixture of Markov chains can be interpreted similarly to other sequences, as a Markov chain is just a special sequence of random variables.
To elaborate the conditions, consider random variables taking values in a countable state space I. Let us call two fixed sequences and in Iequivalent if , and the number of transitions occurring in a is the same as the number of transitions occurring in b, for every .
Let
be a sequence of random variables over
I. We say that
X is
recurrent if for any starting state
, the sequence returns to
i infinitely many times, with probability 1. Then the Markov chain version of de Finetti’s Theorem, proved by Diaconis and Freedman [
14], can be formulated as follows:
Theorem 6 (Markov chain version of de Finetti’s theorem).Let be a recurrent sequence of random variables over a countable state space I. Iffor any n and for any equivalent sequences , , then X is a mixture of Markov chains. Furthermore, the mixing measure is uniquely determined. 3. The Case of Finite Exchangeable Sequences
As already mentioned in
Section 1, de Finetti’s Theorem does not necessarily hold for finite sequences. There exist, however, related results for the finite case, as well. Below we briefly review three fundamental theorems.
3.1. Approximating a Finite Exchangeable Sequence by an i.i.d. Mixture
Even though de Finetti’s Theorem may fail for finite sequences, intuition suggests that a finite, but very long sequence will likely behave similarly to an infinite one. This intuition is made precise by a result of Diaconis and Freedman [
15]. It provides a sharp bound for the total variation distance between the joint distribution of exchangeable random variables
and the closest mixture of i.i.d. random variables. The distance is measured by the
total variation distance. The total variation distance between distributions
P and
Q is defined as
Theorem 7. Let be an exchangeable sequence of random variables, taking values in an arbitrary measurable space S. Then the total variation distance between the distribution of and of the closest mixture of i.i.d. random variables is at most if S is finite, and at most if S is infinite.
Observe that the distance bound depends on both k and n, and it becomes small only if is small. Thus, if the sequence to be approximated is long (i.e., k is large), then this fact in itself does not bring the sequence close to an i.i.d.-mix. In order to claim such a closeness, we need that is extendable to a significantly longer exchangeable sequence .
3.2. Exact Expression of a Finite Exchangeable Sequence by a Signed Mixture
Another interesting result on the finite case is due to Kerns and Székely [
16]. They proved that any finite exchangeable sequence, taking values from an arbitrary measurable space, can always be expressed
exactly as an i.i.d. mix. This would not hold in the original setting. However, the twist that Kerns and Székely have introduced is that the mixing measure is a so-called
signed measure. The latter means that it may also take negative values. In the notation recall that
denotes the set of all probability distributions on
S.
Theorem 8. Let be a sequence of exchangeable random variables, taking values from an arbitrary measurable space S. Then there exists a signed measure ν on , such that for any measurable the following holds:where π runs over , integrated according to the signed measure ν. Here the mixing measure does not have to be unique, in contrast to the traditional versions of the theorem.
A harder question, however, is this: comparing with the traditional versions, the right-hand side of (
3) means that
is drawn according to a signed measure from
. What does this mean from the probability interpretation point of view?
Formally, the integral on the right-hand side of (
3) is just a mixture (linear combination, with weights summing to 1) of the values
, where
runs over
. The only deviation from the classical case is that some
can be weighted with negative weights. Thus,
formally, everything is in order, we simply deal with a mixture of
probability distributions, allowing negative weights, but insisting that at the end, a non-negative function must result. However, if we want to interpret it as a mixture of
random sequences, rather than just probability distributions, then the signed measure amounts to a selection via a probability distribution incorporating
negative probabilities.What does it mean? How can we pick a value of a random variable with negative probability? To answer this meaningfully is not easy. There are some attempts in the literature to interpret negative probabilities, for a short introduction see Székely [
17]. Nevertheless, it appears that negative probabilities are neither widely accepted in probability theory, nor usually adopted in applications, apart from isolated attempts. Therefore, we rather stay with the formal interpretation: “drawing”
according to a signed measure for the integral just means taking a mixture (linear combination) of probability distributions with weights summing to 1, also allowing negative weights, while insisting that the result is still a non-negative probability distribution. This makes Theorem 8 formally correct, avoiding troubles with interpretation. Nevertheless, the interpretation still remains a challenging philosophical problem, given that Theorem 8 has been the only version to date that provides an exact expression of the
distribution any finite exchangeable sequence as a mix of i.i.d distributions, but it does not correspond to the mixture of random
sequences in the usual (convex) sense.
3.3. Exact Finite Representation as a Mixture of Urn Sequences
Another interesting result about finite exchangeable sequences is that they can be expressed as a mixture (in the usual convex sense) of so-called
urn sequences, explained below. It seems, this provides the most direct analogy of de Finetti’s Theorem for the finite case, yet this result did not receive the attention it deserves, as pointed out by Carlier, Friesecke, and Vögler [
18]. The idea goes back to de Finetti [
19]. Later it was used by several authors at various levels of generality as a proof technique, rather than a target result in itself, see, e.g., Kerns and Székely [
16], so it did not become a “named” theorem. Finally, the most general version, which applies to arbitrary random variables, appears in the book of Kallenberg (see [
12], Proposition 1.8).
Urn sequences constitute a simple model of generating random sequences. As the most basic version, imagine an urn in which we place N balls, and each ball has a certain color. We randomly draw the balls from the urn one by one and observe the obtained random sequence of colors. We can distinguish two basic variants of the process: after drawing a ball, it is put back in the urn (urn process with replacement), or it is not put back (urn process without replacement).
Consider the following simple example. Let us put N balls in the urn, K black, and white balls. If we randomly draw them with replacement, then an i.i.d. sequence is obtained, in which each entry is black with probability , and white with probability . The length of the sequence can be arbitrary (even infinite), as the drawing can continue indefinitely.
On the other hand, if we do this experiment
without replacement, then the maximum length of the obtained sequence is
N, since after that we run out of balls. The probability that among the first
draws (without replacement) there are precisely
X black balls follows the
hypergeometric distribution (see, e.g., Rice [
20]) given by
For our purposes the important variant is the case without replacement, and with , that is, all the balls are drawn out of the urn. Then the obtained sequence has length . Note that it cannot be i.i.d., as it contains precisely K black and white balls. However, otherwise, it is completely random, so the distribution of X is the same as it were in an i.i.d. sequence, conditioned on including precisely K black balls.
The number of colors can be more than two, even infinite. The obtained random sequence is still similar to an i.i.d. one, with the difference that each color occurs in it a fixed number of times. We can then formulate the general definition of the urn sequences of interest to us. For a short description, let us first introduce some notations. The set is abbreviated by , and the family of all permutations of is denoted by . If a permutation is applied to a sequence , then the resulting sequence is denoted by , which is an abbreviation of . We also use the following naming convention:
Convention 1 (Uniform random permutation).Let be a permutation. We say that σ is a uniform random permutation, if it is chosen from the uniform distribution over .
Now the urn sequences of interest to us are defined as follows:
Definition 2 (Urn sequence).Let be a deterministic sequence, each taking values from a set S, and let be a uniform random permutation. Then is called anurn sequence.
Here each
represents the color of a ball, allowing repeated occurrences. The meaning of
is simply that we list the balls in random order. Note that due to the random permutation, we obtain a random sequence, even though
x is deterministic. Now we can state the result, after Kallenberg [
12], but using our own notations:
Theorem 9 (Urn representation).Let be a finite exchangeable sequence of random variables, each taking values in a measurable space S. Then X can be represented as a mixture of urn sequences. Formally, there exists a probability measure μ on (mixing measure), such that for any holds, where is a uniform random permutation, drawn independently for every . Observe that Theorem 9 shows a direct analogy to Theorem 1, replacing the i.i.d. Bernoulli sequence with a finite urn sequence, giving us the finite length analogy of de Finetti’s Theorem. In the special case when
, using the hypergeometric distribution formula (
4), we can specialize it to the following result, resembling the conditional independence form of de Finetti’s Theorem, given in Theorem 2:
Theorem 10. Let be a finite sequence of -valued exchangeable random variables. Then there exists a random variable η, taking values in , such that for every and , the following holds: Furthermore, η is given as the number of ones in the sequence, representing the empirical distribution: Theorem 10 says: given that the length-N exchangeable sequence contains K ones, it behaves precisely as an urn sequence that contains K ones. This also provides a simple algorithm to generate the exchangeable sequence: first pick from its distribution, and whenever , then generate an urn sequence with K ones. The distribution of (the mixing measure) can be obtained as the empirical distribution of ones in the original sequence. The sequence generated this way will be statistically indistinguishable from the original.
4. A Decomposition Theorem for General Finite Sequences
In all known versions of de Finetti’s Theorem, a sequence of rather special properties is represented as a mixture of simpler sequences. In most cases the target sequence is exchangeable. Although there are some exceptions (some of them are listed in
Section 2.2), the target sequence is
always assumed to satisfy some rather strong symmetry requirement.
Now we raise the question: is it possible to eliminate all symmetry requirements? That is, can we express an arbitrary sequence of random variables as a mixture of simpler ones? Surprisingly, the answer is in the affirmative, with one condition: our method can only handle finite sequences. The reason is that we use uniform random permutations, and they do not exist over an infinite sequence. On the other hand, we deal with completely arbitrary random variables, taking values in any measurable space.
With a general target sequence, the component sequences clearly cannot be restricted to i.i.d., or to urn sequences, since they are all exchangeable, and the mixture of exchangeable sequences cannot create non-exchangeable ones. Then which class of sequences should the components be taken from? We introduce a class that we call elementary sequences, which will do the job. In the definition we use the notation for the superposition (composition) of two permutations, with the meaning .
Definition 3 (Elementary sequence).Let be a deterministic sequence, each taking values from a set S, and let be uniform random permutations, possibly not independent of each other. Then is called anelementary sequence.
Observe the similarity to Definition 2. The only difference is that in an elementary sequence the permutation is the composition of two uniform random permutations, while in the urn sequence we only use a single uniform random permutation. Of course, if and in Definition 3 are independent of each other, then their superposition would remain a uniform random permutation, giving back Definition 2. On the other hand, if they are not independent, then we may get a sequence that is not an urn sequence.
Let us note that not every sequence is elementary. This follows from the observation that if we fix any , then the number of times a occurs in is constant (which may be 0). The reason is that permutations do not change the number of occurrences of a, so its occurrence number remains the same as in x, which is constant. On the other hand, in an arbitrary random sequence, this occurrence number is typically random, not constant, so elementary sequences form only a small special subset of all random sequences. In fact, as we prove later in Lemma 3, the constant occurrence counts actually characterize elementary sequences. To formalize this, let us introduce the following definition:
Definition 4 (Occurrence count).Let be a sequence and . Then denotes the number of times a occurs in X, that is, The next definition deals with the case when a fixed total ordering ≺ is given on S.
Definition 5 (Ordered sub-domain, order respecting measure).The subset of containing all ordered n-entry sequences with respect to some total ordering ≺ on S is called the ordered sub-domain
of , denoted by : A probability measure on is calledorder respecting (for the ordering ≺), if holds for every measurable set , whenever .
Now we are ready to state and prove our representation theorem for arbitrary finite sequences of random variables.
Theorem 11. Let be an arbitrary finite sequence of random variables, each taking values in a measurable space S. Then X can be represented as a mixture of elementary sequences. Formally, there exist a probability measure μ on (mixing measure), such that for any measurable holds, where are uniform random permutations, possibly not independent of each other, and is drawn independently for each . Furthermore, the claim remains true if the mixing measure μ is restricted to be order respecting for a total ordering ≺ on S (see Definition 5).
In that case, the representation is given by the formula For the proof we need two lemmas. The first is a folklore result, stating that if an arbitrary sequence (deterministic or random, with any distribution) is subjected to a uniform random permutation, independent of the sequence, then the sequence becomes exchangeable. We state it below as a lemma for further reference.
Lemma 1. Applying an independent uniform random permutation to an arbitrary finite sequence gives an exchangeable sequence.
Proof. Let
be an arbitrary finite sequence, taking values from a set
S. Let
be the sequence obtained by applying an independent uniform random permutation
to
X, i.e.,
. Pick
distinct indices
. Then for any
we can write
The reason is that under the independent uniform random permutation any set of
k distinct indices have equal chance to take the place of
, and there are
such sets. As a result, the average obtained on the right-hand side of (
8) does not depend on the specific
values, only on
k. Therefore,
depends only on
k, but not on
. This is precisely one of the equivalent definitions of an exchangeable sequence. □
The second lemma expresses the fact that a uniform random permutation can “swallow” any other permutation, making their composition also a uniform random permutation.
Lemma 2. Let be two permutations, such that
σ is a uniform random permutation
γ is an arbitrary permutation (deterministic or random, possibly non-uniform, and possibly dependent on the sequence to which it is applied)
σ and γ are independent of each other.
Then is a uniform random permutation.
Proof. Let
be the sequence (deterministic or random) to which the permutation
is applied. Fix an index
, and let
be the index to which
maps the index
k, i.e.,
. Note that
may possibly be random, non-uniform, and dependent on
. Let us express the probability that
maps
into a fixed index
i:
where the summation runs over all fixed permutations
of
. Observe that
, as
is uniform and
is fixed. Furthermore,
From this, using the independence of
and
, we obtain
Then we can continue (
9) as
In the above expression, the event
involves only fixed values, so it is not random, it happens either with probability 1 or 0, depending solely on whether
or not. As such, it is independent of the condition
, so we have
, whenever the conditional probability is defined, i.e.,
. If
, then the conditional probability is undefined, but in this case the term cannot contribute to the sum, being multiplied with
. Thus, we can continue (
10) as
Here the sum
is the number of permutations that map a fixed
j into a fixed
i. The number of such permutations is
, as the image of
j is fixed at
i, and any permutation is allowed on the rest. This yields
Thus, we obtain , which means that the position to which is mapped by is uniformly distributed over , no matter how was selected, and how it depended on . This holds for every k, making a uniform random permutation. □
Before turning to the proof of Theorem 11, let us point out that a consequence of the above lemma is interesting in its own right:
Corollary 1. Any permutation (deterministic or random) can be represented as the composition of two uniform random permutations. Formally, let be an arbitrary permutation, deterministic or random; if random, then drawn from an arbitrary distribution. Then there exist two uniform random permutations (possibly not independent of each other), such that .
Proof. Let
be a uniform random permutation, independent of
. Then by Lemma 2, the permutation
becomes a uniform random permutation. Set
, which is also a uniform random permutation. Further, let
denote the identity permutation that keeps everything in place. Then we can write
yielding
. As
are both uniform random permutations (possibly not independent of each other), this proves the claim. □
The above corollary also provides an opportunity to characterize elementary sequences:
Lemma 3 (Characterization of elementary sequences).A sequence is elementary if and only if for any the occurrence count (see Definition 3) is constant.
Proof. If X is elementary, then, by definition, it can be represented as , where are uniform random permutations (possibly not independent), and is a deterministic sequence. Since no permutation can change occurrence counts, and is constant, due to x being deterministic, therefore, remains constant for any .
Conversely, assume
is constant for any
. Let
be the distinct elements for which
. Clearly,
, since there can be at most
n distinct elements in
X, and the identity of these elements is fixed, due to the constant value of
for any
. Let
y be the deterministic sequence that contains
, each one repeated
times. That is,
Then we have for every . Thus, X and y contain the same elements, with the same multiplicities, just possibly in a different order. That is, X is a permutation of y, possibly a random permutation, which may depend on y. Let be the permutation that implements . Then by Corollary 1, the permutation can be represented as , where are uniform random permutations, possibly not independent of each other, and they may also depend on y. However, no matter what dependencies exist, Corollary 1 provides that can be represented as for some uniform random permutations and a deterministic sequence y, proving that X is indeed elementary. □
Proof of Theorem 11. Let us apply a uniform random permutation
to
X, such that
and
X are independent. This results in a new sequence
. By Lemma 1, the obtained
Y is an exchangeable sequence. Then by Theorem 9 we have that
Y can be represented as a mixture of urn sequences. That is, there exists a probability measure
on
, such that for any
holds, where
is a uniform random permutation, drawn independently for every
. This representation means that
Y can be produced by drawing
x from the the mixing measure
, and drawing a uniform random permutation
, and then outputting
.
Now, instead of outputting , let us first permute it by . Thus, we output . Observe that if is a uniform random permutation, then so is , which we denote by . This makes the resulting an elementary sequence. Applying to the mixture means that each component sequence is permuted by . However, then the result is also permuted by , since it does not matter whether the components are permuted first, and then one of them is selected, or the selection is made first and the result is permuted afterward with the same permutation.
Applying
in the above way, we obtain the sequence
as the result. Thus we can re-write (
11) as
Now we observe that
. Then we can continue (
12) as
which is precisely the formula (
6) we wanted to prove, just using the notation
instead of
.
Consider now the case when
is order respecting for some ordering ≺ on
S. Let
be the permutation that orders
according to ≺, that is
where
is the ordered version of
. Let
be a uniform random permutation, chosen independently of
. Then
and
satisfy the conditions of Lemma 2. Therefore, by Lemma 2,
is a uniform random permutation. Introducing the notation
, we obtain from the already proven formula (
6)
where
are uniform random permutations, and
is chosen independently for each
. Since
is order respecting (see Definition 5), it is enough to restrict the integration to the set
, giving us the formula (
7). This completes the proof. □
5. Application of de Finetti Style Theorems in Random Network Analysis
Large, random networks, such as wireless ad hoc networks, are often described by various types of random graphs, primarily by geometric random graphs. A frequently used model is when each node of the network is represented as a random point in some planar domain, and two such nodes are connected by an edge (a network link) if they are within a given distance from each other. This basic model has many variants: various domains may occur, different probability distributions of the node positions within the domain may be used, a variety of distance metrics is possible, etc. Note that it falls in the category of static random graph models, which is our focus here, in contrast to evolving ones (for a survey of random graph models, see e.g., Drobyshevskiy and Turdakov [
21]). Let us now formalize what we mean by a general random graph model.
Definition 6 (Random graph models).Let be an infinite sequence of random variables, each taking its values from a fixed domain S, which is an arbitrary measurable space. Arandom graph model over
Sis a function that maps X into a sequence of graphs: If X is restricted to a subset , then we talk about a conditional random graph model, denoted by .
Note that even though the random graph model depends on the infinite sequence X, the individual graphs typically depend only on an initial segment of X, such as .
Regarding the condition C, a very simple variant is when , where , and we independently restrict each to fall into . Note, however, that C may be much more complicated, possibly not reducible to individual restrictions on each .
A most frequently occurring case is when the points (the components of X) are selected from the same distribution independently, that is, they are i.i.d. The reason is that allowing dependencies makes the analysis too messy. To this end, let us define i.i.d.-based random graph models:
Definition 7 (i.i.d.-based random graph models).If the entries of X in Definition 6 are i.i.d. random variables, then we call an i.i.d.-based random graph model over S, and is called an i.i.d.-based conditional random graph model over S.
The most commonly used and analyzed static random graphs are easily seen to fall in the category of i.i.d.-based random graph models. Typical examples are Erdos–Rényi random graphs (when each edge is added independently with some probability p), different variants of geometric random graphs, random intersection graphs, and many others. On the other hand, sometimes the application provides natural reasons for considering dependent points, as shown by the following example.
Example 3. Consider a wireless ad hoc network. Let each node be a point drawn independently and uniformly from the unit square. Specify a transmission radius, and connect two nodes whenever they are within distancer. (Note:may depend on the number of nodes.) However, allow only those systems of points for which the arising graph has diameter (in terms of graph distance) of at most some value, which may again depend onn. The conditioning makes the points dependent. Nevertheless, the restriction is reasonable if we want the network to experience limited delays in end-to-end transmissions.
This example (and many possible similar ones) shows that there can be good reasons to deviate from the standard i.i.d. assumption. On the other hand, most of the analysis results build on the i.i.d. assumption. How can we bridge this gap? Below we show an approach that is grounded in de Finetti style theorems, and provides a tool that can come in handy in the analysis of conditional random graph models.
Theorem 12. Let S be a Borel measurable space, and let be a property of random graph models. Fix an i.i.d.-based random graph model over S, and assume that has property , regardless of the value of X, with probability 1. Let C represent a condition with . Then also has property , regardless of the value of X, with probability 1.
Proof. Let
Y be a random variable that has the conditional distribution of
X, given
C. That is, for every measurable set
ANote that
Y may not remain i.i.d. However, we show that
Y is still exchangeable. Let
be any permutation. Then we can write
Since
X is i.i.d., therefore,
, i.e., they have the same distribution, making them statistically indistinguishable. This implies that for any measurable set
B the equality
holds. Using it in (
14), we get that
for any permutation
, which means that
Y is exchangeable. Here we also used that
, so the denominator does not become 0.
Recall now that by the Hewitt–Savage Theorem, an infinite sequence Y of S-valued exchangeable random variables is an S-valued i.i.d.-mix, whenever S is a Borel measurable space. For each component X of this i.i.d. mix we can take the function to obtain a random graph model . After taking the mixture, this results in . The reason is that applying the function to each sequence X first and then selecting one of them must yield the same result as first selecting one of them (by the same mixing measure), and applying the function to the selected sequence Y.
As a result of the above reasoning, we get , i.e., the two random graph models have the same distribution. However, Y was chosen such that it has the conditional distribution of X, given C. Therefore, we have . Thus, we obtain . Since, by assumption, has property , regardless of the value of X, with probability 1, therefore, also has property , regardless of the value of X, with probability 1. The reason for we need that does not depend on X (with probability 1) is that when we mix various realizations of X, they should all come with the same property , otherwise, a mixture of properties would result. This completes the proof. □
The above result may sound very abstract, so let us illustrate it with two examples.
Example 4. I
t follows from the results of Faragó [22] that every i.i.d.-based geometric random graph has the following property:If the graph is asymptotically connected (that is, the probability of being connected approaches 1 as the number of nodes tends to infinity), then the average degree must tend to infinity.
Let us choose this as property . One may ask: does this property remain valid in conditional models over the same geometric domain? We may want to know this in more sophisticated models, such as the one presented in Example 3. Observe that the above property
satisfies the condition that it holds regardless of the value of
X (with probability 1), where
X represents the random points on which the geometric random graph model
is built. Therefore, by Theorem 12, the property remains valid for
, as well, no matter how tricky and complicated condition is introduced, as long as the condition holds with positive probability (even when
). Note that this cuts through a lot of complexity that may otherwise arise if we want to prove the same claim directly from the specifics of the model.
Example 5. Consider the variant of Erdos–Rényi random graphs, where each edge is added independently with some probabilityp. These random graphs are often denoted by, wherenis the number of vertices. For constantp, they fit in our general random graph model concept, choosingXnow as an i.i.d. sequence of Bernoulli random variables, representing the edge indicators. Letanddenote the vertex connectivity, edge connectivity and minimum degree of,
respectively. All these graph parameters become random variables in a random graph. A nice (and quite non-trivial) result from the theory of random graphs (see Bollobás [23]) is that for anyp, the following holds: The intuitive meaning of (
15) is that asymptotically both types of connectivity parameters are determined solely by the minimum degree. The minimum degree always provides a trivial lower bound both for
and
, and in a random graph asymptotically they both indeed hit this lower bound, with probability 1.
Now we may ask: what happens if we introduce some condition? Let
be a subset of graphs that represents a condition that
satisfies with some constant probability
, for every
n. That is
, for every
n. Observe that (
15) holds regardless of the value of
X, because (
15) is valid for every
p. Therefore, it can be used as property
in Theorem 12. Thus, if we condition on
falling in
, the relationship (
15) still remains true, by Theorem 12.
Note that if
is complicated, it may be very hard to prove directly from the model that (
15) remains true under the condition
. Fortunately, our result cuts through this complexity. It is also interesting to note that in this case
X is a Bernoulli sequence, so for this case, it would be enough to use the original de Finetti Theorem in the proof, rather than the more powerful Hewitt–Savage Theorem.