Understanding Hierarchical Processes

Hierarchical stochastic processes, such as the hierarchical Dirichlet process, hold an important position as a modelling tool in statistical machine learning, and are even used in deep neural networks. They allow, for instance, networks of probability vectors to be used in general statistical modelling, intrinsically supporting information sharing through the network. This paper presents a general theory of hierarchical stochastic processes and illustrates its use on the gamma process and the generalised gamma process. In general, most of the convenient properties of hierarchical Dirichlet processes extend to the broader family. The main construction for this corresponds to estimating the moments of an infinitely divisible distribution based on its cumulants. Various equivalences and relationships can then be applied to networks of hierarchical processes. Examples given demonstrate the duplication in non-parametric research, and presents plots of the Pitman–Yor distribution.


Introduction
The hierarchical Pitman-Yor process (HPYP) was first presented as a solution to ngram language models [1] where it mimics the behavior of the Kneser-Ney algorithm [2]. It is an extension of the hierarchical Dirichlet process (HDP) [3]. The HPYP has since been used in a wide variety of ways, including for previously state-of-the-art and competitive algorithms for topic models [4] and text compression [5]. The HDP has been used for previously state-of-the-art and competitive algorithms for tweet clustering [6] and document segmentation [7]. Many more novel and creative uses of these processes exist, for instance, hierarchical topic models [8]. More general reviews are given by Teh and Jordan [9] and Jordan [10]. The gamma process can also be used hierarchically [11] and provides an alternative scheme for handling the HDP. The notion of hierarchical models fits in well with the computational approach to statistical modelling adopted in the machine learning community.
However, what exactly is the HPYP? A key concept for understanding the HDP and the HPYP is the notion of a discrete base probability measure. The base measure is a source measure for sampling points of the HDP or HPYP. These are discrete just when they have a countable number of possible points (the set on which the measure is based is countable). When finite, the base probability measure is just a probability distribution, usually represented as a vector. However, in non-parametric modelling, we seek to model structured objects for which the dimension may be unknown ahead of time: the number of clusters for points, the depth of a tree, the number of atoms in a molecule, the number of words in a sentence. Allowing the base measure to be countably infinite is a useful abstraction in this situation. Moreover, being able to generate an infinite discrete base probability measure provides us with the ability to model prior distributions for our structured objects without fixing dimensions ahead of time. The above models for text and clustering give examples.
It is known that the hierarchical Dirichlet process, when applied to a finite discrete base distribution, is just a Dirichlet distribution. Indeed, this property is the axiomatic definition of the process [12]. So, applications of and inference with the HDP are really just using hierarchical Dirichlet distributions, requiring no non-parametric theory to describe, although algorithms may be using non-parametric methods.
So, there is a clear concept of what the HDP model is. What is the corresponding result for the hierarchical Pitman-Yor process? For all the algorithms using the HPYP, it would be nice to know what their actual model is! Teh first referred to the hierarchical version of the PYP as the Pitman-Yor distribution [in talks accompanying] [1], saying it has "no known analytical form". Moreover, is there a more general theory of hierarchical processes, and why does this case (the HDP) come out so neatly? These questions for hierarchical processes have been addressed in recent theory [13][14][15].
Note the Bayesian theory of non-hierarchical processes is extensive. A comprehensive analysis of different processes is developed by James [16], in the more general context of the generalised Indian buffet process [17]. The general posterior analysis of their normalised versions, including the DP, is developed by James et al. [18]. A useful review of theory and a slice sampler for the case of the normalised generalised gamma is given in Lomeli et al. [19]. A study of some of the processes considered here can also be found in Zhou and Carin [11], focusing on gamma processes and their relationships.
However, these treatments are grounded in extensive probability theory and assume the reader is already familiar with Poisson point processes, Lévy processes, subordinators and other advanced areas [20,21]. Some of these details are not strictly necessary for the understanding of the basic ideas. This paper presents the relevant background theory in a self-contained way to develop models for hierarchical processes generally based on the theory of subordinators and completely random measures [20,21]. The theory for the most part reinterprets results from the Bayesian non-parametric and statistical communities [18,22,23], though some related ideas can also be found in machine learning [11]. However, the answers to the questions about the nature of the HPYP and general application to hierarchical processes, networks of hierarchical processes and generalised Chinese restaurants are not well-known outside the Bayesian non-parametric community, so we present them here in a unified manner.

Background Theory
A formal theory of Poisson point processes (PPP), Lévy processes and completely random measures (CRMs) with treatment of measure theory is needed to rigorously cover this area [20,21]. Here, an informal summary is given, though trying to maintain a degree of precision, for instance keeping adequate rigor in the statement of results.

Completely Random Measures
A CRM is a discrete measure µ(dx) on a space X constructed as where the x i ∈ X are called atoms and are assumed distinct, the λ i ∈ R + called jumps, and the background constant C 0 is zero in our use. This means that µ(x i ) = λ i , evaluated at atoms, and µ(x i ) = 0 otherwise. The (λ i , x i ) are mutually independent random variables, and a finite number of the x i can also be fixed. These conditions ensure the measure is completely random, that is for This yields discrete probability distributions on X represented as µ(x)/µ(X ). These are referred to as normalised random measures with independent increments (NRMIs) [18], a concept developed by Kingman [24], and are a general class of discrete probability distributions.

Poisson Point Process
A Poisson point process (PPP) is a stochastic process whose samples represent sets of independent events on a measurable space X . For a sample, the count of events in A ⊆ X is denoted N(A) ∈ N . Events are considered to be a countable subset of X , only significant if X is not countable, for instance the real line. The PPP has complete independence, so for . The sample is specified by a rate ρ(dx) which is any measure on X . In PPP theory, the rate is referred to as a Lévy measure. The PPP has the defining property that N(A) ∼ Poisson(ρ(A)), and samples can be generated from this by working with an ever finer partition of the space X .
A special class of PPP can be used as a family of priors for a CRM. Assume a PPP has rate ρ(dλ)µ(dx) for λ ∈ R + and x ∈ X . This is called homogeneous because the terms in λ and x are independent [18]. In the case considered here, the µ(dx) is a measure on X called a base measure, and the rate ρ(dλ) has the condition ∞ 0 min(1, λ)ρ(dλ) < ∞ to make everything work neatly [20], as follows: This condition is equivalent to ∞ 0 min( , λ)ρ(dλ) < ∞ for any 0 < < ∞. As a consequence, ρ([ , ∞)) is bounded, meaning there will be a finite number of points with λ > in the sample of the PPP (within a finitely measured subset of X ) and 0 λρ(dλ) is bounded, meaning the sum of the λ's less the in the sample of the PPP (within a finitely measured subset of X ) will be finite even if there is an infinite number of them. Then, a sample from the PPP is a countable set of points which can be used to constuct a CRM.

Example Processes
Consider a number of standard PPPs used to construct CRMs [21]: the generalised (three-parameter) beta process [25], the generalised (three-parameter) gamma process [26] and the stable process. These have the forms given in Table 1, where M is a constant background rate. They are given without specifying a base measure on X , which could be given as a final parameter.
The Poisson process and the negative binomial process [11] are also included in Table 1. Both are used in the hierarchical context in Section 4.
The first three processes in Table 1 are widely used in various forms in the nonparametric Bayesian and machine learning communities. From a Bayesian perspective, they are best thought of as improper priors corresponding to the beta, gamma and gamma distributions, respectively. This analysis is presented later in Section 3.4.
NRMIs can be created by normalising CRMs. These are sometimes generated directly from distributions consisting of a normalised discrete set of weights as probabilities. So, generating the λ according to a generalised (or three parameter) gamma process, GGP(M, α, β), and then normalising yields, what is called a normalised generalised gamma process (NGG). The normalised generalised gamma process (NGG) is constructed analogously to the Dirichlet process, which normalises the gamma process. They represent the main examples of NRMIs. These NRMIs, however, are not paired with base measures when forming a discrete process on X , rather they need to be paired with base distributions Pr(x) since only one point is generated per sample. Denote the NGG process as NGG(α, β, M) or NGG(α, β, M, h(·)), where α, β, M are as described for the GGP, line 3 of Table 1, and h(·) is a base distribution. The DP is effectively the case when α = 0.
The Pitman-Yor process itself was developed by Pitman and Yor [28], and a general scheme for developing related models is by Pitman [29], called Poisson-Kingman models. However, as to be shown, the hierarchical PYP is very different from the PYP, so this theory is not entirely relevant for the hierarchical case. Alternatively, in Pitman and Yor [28] ([Proposition 21]), it was shown that a PYP can be developed by marginalising out a parameter of the NGG as follows.
The result is presented rather indirectly in Pitman and has been re-expressed by several authors [23] ([Section 3.1.1]), [30] ([Corollary 1]), and leads to a different class of models to the Poisson-Kingman models called Poisson-gamma models [23].
Notice the lemma restricts the PYP to the case where the concentration is positive. More generally, PYPs can have concentration β > −α. When β = 0 and α > 0, then the PYP is formed from normalising a positive stable distribution.

Defining Processes Axiomatically
This section gathers together some definitions and theory in order to present a general class of processes built on CRMs that can be treated hierarchically analogous to the Dirichlet process.

Subordinators
A simple useful case of these PPPs has the domain X being R + , the positive real line, and is constant for X , so the rate is ρ(dλ) for λ, x ∈ R + . For this, define a new process for our case C 0 = 0 given by the cumulative values, So, σ 0 = 0 and σ t increases in steps as each distinct x i is passed. This σ t corresponds to the class of so-called pure jump driftless subordinators, which are a kind of nondecreasing Lévy process, which in turn are processes with stationary independent increments [20].
The key relationship that underlies the general theory of these processes is that σ t is distributed according to a particular infinitely divisible non-negative distribution, explained in Theorem 1. Examples are given in Table 1. So, for instance, for the generalised gamma process with parameters (M, α, β), the total σ 1 = ∑ ∞ i=1 λ i δ x i ≤1 is distributed as a Tweedie distribution with parameters (α, M 1/α , β).
The basic connection is given as follows, a special case of the Lévy-Khintchine formula for subordinators. This uses the Laplace exponent of a 1D random variable y defined as the function (of u) E [e −uy ], which is related to the characteristic function. Theorem 1. Consider σ t defined as previously by a PPP with rate ρ(dλ) for λ, x ∈ R + and ρ(dλ) satisfying ∞ 0 min(1, λ)ρ(dλ) < ∞. The Laplace exponent of σ t is given by where ψ(u) = (0,∞) (1 − e uλ )ρ(dλ). This form means that σ t has an infinitely divisible nonnegative distribution. The t here can be referred to as the parameter for divisibility, occurring in any infinitely divisible distribution.
Thus, given a rate ρ(dλ) defining a particular σ t , one can derive its Laplace exponent ψ(u) and then infer the distribution on σ t (where analytically possible). Note the scaling term M in Table 1 plays the role of t.
Some instances of this pairing, an infinitely divisible non-negative distribution with a corresponding rate are given in the last two columns of Table 1. Note that distributions corresponding to the generalised beta process are not well-known. Other distributions that could be included in the table are the inverse beta distribution (the beta distribution is not infinitely divisible but its inverse is), which includes the Pareto and F-distributions, and the generalised inverse gamma distribution [31].

Axiomatic Definitions
To extend Theorem 1 to broader classes of base distributions on general domains X , not just the positive real line with constant measure used in subordinators, one can give an axiomatic definition of a process based on an infinitely divisible non-negative distribution:

1.
The derived process is a CRM, 2.
The process behaves like the given infinitely divisible distribution on subsets of X .

Definition 1.
(Axiomatic definition of a CRM process) Consider an infinitely divisible nonnegative distribution G(µ), where µ is the parameter for divisibility. Further assume its Laplace exponent has zero drift. Given a measurable space X , positive intensity M and measure h(dx) on X , consider a stochastic process denoted GP(M, h(·)) induced by G(µ) as follows. X ∼ GP(M, h(·)) yields a CRM on X such that 1.
For A ⊆ X , X(A) ∼ G(M h(A)).
The first condition implies that the measures are CRMs as per Equation (1). The second condition implies one can construct the discrete measures iteratively, on an ever finer, nested sequence of partitions using the distribution G(). Alternatively, one can use the Lévy-Khintchine formula of Theorem 1 to show the existence of a corresponding rate yielding a CRM with rate M h(d x)ρ(d λ) which must then satisfy the conditions. Note that the Dirichlet process can be defined axiomatically [12], akin to Definition 1 with the Dirichlet distribution used instead of the gamma distribution, and base probability distribution used instead of a base measure. This axiomatic construction generalises for any infinitely divisible non-negative distribution as follows: (Axiomatic definition of an NRMI process) Consider an infinitely divisible non-negative distribution G(µ), where µ is the parameter for divisibility. Further assume its Laplace exponent has zero drift. Consider as well the distribution on probability vectors induced by generating K values ζ k ∼ G(µ k ) and normalising to obtain Denote this distribution by NG K ( µ), where µ is the vector of K value µ k , given a measurable space X , positive intensity M and probability distribution h(dx) on X . A process denoted NGP(M, h(·)), developed from G(µ), is defined as follows. It is a stochastic process whose sample is a probability measure on X such that if C ∼ NGP(M, h(·)) then for any finite partition A 1 , . . . , A K of X , and count N > 0, (C(A 1 ), . . . , C(A K )) ∼ multinomial(N, NG K (Mh(A 1 ), . . . , Mh(A K ))).

On the Tweedie Distribution
From Table 1, the marginal distribution for the generalised gamma process is the Tweedie distribution [32] with exponent α, or sometimes expressed as index p = 1 + 1 1−α which has p > 2 necessarily. For α = 0, the Tweedie distribution becomes a gamma distribution.
The Tweedie distribution with exponent 0 < α < 1 is formed from a positive stable distribution defined in terms of the stable distribution with characteristic exponent α, scale parameter s = M 1/α location zero and symmetry one [33]. This distribution, denoted as pstable(α, s), has the functional form [adding a scale to the standard formula of] [34] given by the remarkable formula which yields a simple ingenious sampling formula [34]. To obtain a Tweedie distribution, "exponentially tilt" the pstable(α, s), calculated by multiplying by e −βx and renormalising. The construction of exponentially tilting the distribution (see for instance Pitman [29]) gives the following: Here, the term e (sβ) α −βx is added to achieve normalisation.

Bayesian Analysis
A complete Bayesian analysis of CRMs and NRMIs has been developed by James [16] and James et al. [18], respectively, in the non-hierarchical context. This models the standard framework in which hierarchical DPs or hierarchical PYPs are used, but also applies to the Indian buffet process [17]. This is informally developed below so that their theoretical results can be subsequently used. By Bayesian analysis, the following is meant: one has an infinitely divisible distribution suitable for use with Theorem 1. One samples a CRM from this with unknown parameters of rates λ and atoms x i . Now, hierarchically sample sets of atoms from this CRM using a PPP. Each hierarchical sample from the CRM is a discrete set A ⊆ X , and multiple samples are drawn. Then, the task is to estimate the parameters of the parent CRM.
A CRM is represented in the form where the x i are distinct and is generated according to a homogeneous PPP with rate ρ(dλ)ω(dx) where ρ(dλ) is a rate satisfying the conditions of Theorem 1. One then takes J samples from this according to a PPP, so n j ∼ PPP(µ(·)) for j = 1, . . . , J. Each sample will be a finite subset of the atoms, some possibly occurring multiple times. For representational purposes, post hoc reorder the atoms of µ(x) so that only the first I have non-zero counts. So, for I < i ≤ ∞, none of the samples n j contain x i . The count of atom x i in sample j is represented as n j,i , so the condition n 1:J,i = 0 means that at least one of the J samples contains an atom x i .
The following informal analysis is offered as an explanation, but formal proofs are in James [16]. To make analysis feasible, we have to convert the rate ρ(dλ) to one with finite total measure. James [16] ingeniously and elegantly presents this by viewing the posterior for µ i after seeing the evidence of having at least one non-zero value in the J values, so n 1:J,i = 0. For the particular sampling distribution of n j,i , in our case a Poisson(λ i ), which has a term in λ i so the posterior rate Pr( n 1:J,i = 0 | λ i )ρ(dλ i ) obtains finite total measure. Denote this total by Ψ J = Pr( n 1:J,i = 0 | λ i )ρ(dλ i ). Then, working entirely with finite PPPs, one can compute the marginal. First, we generate the number of non-zero atoms I (for the given sample count J) by a Poisson and then generate the vector of counts for each atom n 1:J,i , like so where the term I! can be removed if one considers that the atoms are ordered. With similar reasoning, one obtains: the posterior rate of λ i : for i ≤ I has rate Pr( n 1: the posterior rate of the remainder CRM: the total rate of the remainder CRM: T R = ∑ ∞ i=I+1 λ i as given by Theorem 1. The key formula for this kind of analysis is given in our context in Table 2. Table 2. Key formula for posterior analysis of CRMs, Ψ J = Pr( n 1:J = 0 | λ)ρ(dλ), and the distribution on the remainder The first line, the beP-BP case, is the three parameter beta process with Bernoulli data, which is the three parameter Indian buffet process. The second line is the gamma process with Poisson data. Note the data marginals Pr( n 1:J,i | λ)ρ(dλ) in our context can be obtained more directly, developed in Section 4.2, so formulas are not given.

Using Discrete Base Distributions
It is important to understand what happens when you use a discrete distribution as a base distribution to a CRM, since this is what happens when hierarchical constructions of these processes are made. Let the base measure on X have the form , and the CRM is constructed using a homogeneous PPP with rate ρ(dλ)ω(dx). What happens? This section considers various implications of this. Note different but more extensive treatment of this scenario for the results on moments, Section 4.2, and the generalised Chinese restaurant process, Section 4.4, is given by Camerlenghi et al. [14], Argiento et al. [15]. They also include example MCMC sampling algorithms.

General Results
Superposition of PPPs says to decompose a discrete CRM into a union of trivial PPPs each with rate in the form µ i ρ(λ)δ x i , so the X component is a delta function. The resultant CRM is also trivial and takes the form, using Definition 1, Λδ x i , where Λ is the total of the λ k generated using the rate µ i ρ(λ). This total is distributed as the corresponding marginal distribution for the subordinator with intensity parameter µ i , as per Theorem 1.
where the x i are distinct, and a homogeneous CRM is constructed by sampling using a PPP with rate ρ(dλ)µ(dx) on R + × X . Let Γ(t) be the marginal total distribution for the corresponding subordinator, where t is the parameter of divisability. Then, the CRM has the form where the random variable γ i ∼ Γ(µ i ), and the x i are inherited from µ(·).
The CRM µ(·) when used as a base distribution for a PPP is mapped element-wise to form a new CRM γ(·). So, no PPP modelling is required if you know the form of the element-wise distribution.
There are a number of very convenient and well-known properties of the Dirichlet that allow it to be used in hierarchical contexts. As it happens, most of these properties also hold for other NRMIs with discrete base measures, and some for CRMs, so these results are developed here. The first property is aggregation. This has that if (x 1 , , and this applies for a Dirichlet of any dimension. The second property is renormalisation and has that if (x 1 , . Both properties clearly follow from the fact that a Dirichlet is a normalised Gamma, and by analogy hold for NRMIs too.
The aggregation property can be used to form arbitrary groupings of the dimensions.

Definition 4. (Renormalisation property)
Consider a process that takes a measure as an input parameter and outputs a probability measure. The process has the renormalisation property if when The renormalisation property then yields probability measures on subsets of the discrete domain, so it can be used for incremental sampling. The results for the PYP can be developed using Lemma 1. The aggregation and renormalisation properties together mean that efficient size-biased samplers can be developed for NRMIs by sampling one dimension at a time according to a two-dimensional version of the NRMI, which is effectively the stick breaking construction (although, only a few explicit cases of this are known). Alternatively, one can sample the underlying CRM according to its corresponding infinitely divisible distribution.
A third property of the Dirichlet is neutrality, which applies in the context of renormalisation and requires that the part taken away is independent of the remainder: if Definition 5. (Neutrality property) Consider a process that outputs a finite discrete probability measure, and without loss of generality let ∑ I i=1 γ i δ x i (x) be a sample from the process where the x i are distinct. The process is completely neutral if there exists mutually independent non-negative variables λ 1 , . . . , λ I such that (γ 1 , . . . , γ K ) and λ 1 , It is known that the only distribution on finite probability vectors with complete neutrality is the Dirichlet distribution [35].

Results on Moments
Moments of CRMs are critical quantities for their posterior analysis [18,36] to be developed in Section 5 and seen in Section 3.4. The generalised version is derived by unfolding the recursion that relates the moments of a distribution to its cumulants. In the context of Lemma 2, where γ i ∼ Γ(µ i ), various moments such as E [γ n i | µ i ] and E [γ n i e −Uγ i | µ i ] can be computed recursively from the moments of the PPP rate ρ(dλ) [22] ([Section 1.3]) and its exponentially titled form. Note these moments compute the marginals one needs for multinomial and Poisson data, respectively, hence their importance.
In the theorem, the notation P n is used to represent all possible non-empty partitions of n items, the set {1, . . . , n}. As an example, P 3 is the set so it contains the partition {{1}, {2, 3}} as an element, for instance. Moreover, P n K ⊆ P n are all members are of size K, so |P n 1 | = |P n n | = 1 and |P 3 2 | = 3. The following Lemma is a corollary the major result by Pitman [22], and some related results appear in Camerlenghi et al. [14], as proven in Appendix A. Let κ n = ∞ 0 λ n ρ(d λ) be the n-th moment for rate ρ(λ), where it exists for n ∈ N + . Let ψ(t) be the Laplace exponent for the rate. Then, the n-th cumulant of γ i can be re-expressed as a moment of the original rate ρ(λ), and the n-th moment of γ i is computed recursively from it.
Note the recursion for T n K starts at T n 1 = κ n derived from the non-recursive form. Thus, if the Laplace exponent is known, one can usually compute the moments of the process and hence the cumulants and evidence terms for its corresponding infinitely divisible distribution. When one has Poisson data, required moments need to include an exponential term, as proven in Appendix B.

Corollary 1. (Adding an exponential term) Consider the context of Lemma 4 with rate ρ(λ).
To obtain exponentiated moments of the form E [γ n i e −Uγ i | µ i ], complete the following steps. 1.
Use rate e −Uλ ρ(λ), and the Laplace exponent is given by ψ(U + t) − ψ(U), so the corresponding moments are given by Obtain the corresponding T n K using Equation (8) with the κ n,U , denoted T n K,U .

3.
Consequently, The components from Lemma 4 for the processes in Table 1 are given in Table 3. These appear in various places in the broader statistical literature. The Laplace exponent is usually computed using integration by parts. The form S n s,α is the second order generalised Stirling number used in PYP inference [1,37], a generalized Stirling number of type (−1, −d, 0) [38]. It can be verified using its recursion [37] with Equation (8).
Note the general beta process has no simple analytic form for either ψ(t) or its marginal distribution. Fortunately, is is difficult to envisage a situation where it would be used hierarchically.

The Gamma Process
Let us consider the simple example of a gamma process, GP(M, β) and assume data yields Poisson likelihoods in the form ∏ I i=1 γ n i i e −Uγ i for dimensions i = 1, . . . , I in the context of Lemma 2. The marginal likelihood then, for the data = { n i , x i : i = 1, . . . , I } is given by where the expectation is taken with respect to γ(·) ∼ GP(M, β, µ(·)), which has γ i ∼ gamma (Mµ i , β). Note, in this case, the exact solution is known since the data marginals of the gamma distribution have a simple closed form, Consider, however, using Corollary 1. In this case, moments including e −Uγ i are found to be κ n = ∞ 0 γ n e −Uγ ρ(d γ) = M Γ(n) (U+β) n , and the Laplace exponent can be obtained using integration by parts as M log(1 + t/β). One can confirm that the corresponding index T n K = 1 (U+β) n S n K M K where S n K is an unsigned Stirling number of the first kind, an index that is found in collapsed versions of the CRP. Equation (8) yields the standard recurrence for it. So, by Equation (7), and adding back the term e −µ i ψ(U) = β U+β Mµ i as per Corollary 1, obtain for atom index i the moment The sum can be converted using a standard identity [37] ([Lemma 16]) to get back The sum in Equation (10) has an interpretation as a form of Chinese restaurant process for the dimension i. Each partition of the set {1, . . . , n i }, given by Π i ∈ P n i corresponds to a configuration of the n i data in |Π i | tables. For any table with participants C ∈ Π i , the probability of the table is Mµ i

Γ(|C|)
(U+β) |C| . The probability of this configuration Π i is then ∏ C∈Π i Mµ i Γ(|C|) (U+β) |C| . So, introducing the partition Π i or its size as an additional variable, The second form, the probability of all configurations of size K(= |Π i |), follows from Equation (8).

General Chinese Restaurant Processes
Motivated by the gamma process example just given, now construct a generalised CRP interpretation of the results in Section 4.2. The marginals have an interpretation as generalised versions of Chinese restaurants, including the more efficient collapsed versions [6], both developed in this section. This is intended to complement the comprehensive Bayesian analysis already developed for the non-hierarchical cases by [16,18].
The significance of the formula in Lemma 4 is that the sum in Equation (6) is over partitions Π of the n data points, and κ |C| represents the probability of generating a single element C of size |C| (in the partition Π) according to the rate ρ(λ). The sum in Equation (7) is now over partition sizes K, and T n K is the probability of generating a partition of K non-empty sets according to the rate ρ(λ).

Lemma 5. (General Chinese restaurant processes for CRMs)
Consider the posterior data marginal for γ(·), as in Corollary 2, where data is in the form of a Poisson likelihood with counts n i > 0 at each atom x i : One can treat Π i ∈ P n i as a latent variable, which represents the seating configuration for instances of the atom. Then, the data marginal using Π 1 , . . . , Π I takes the form: Moreover, for any j (including j > I), where the convention is used that Π j = ∅ for j > I (when there is no data). Alternatively, if K i , the number of tables for atom index i is handled as a latent variable, then the data marginal given table numbers takes the form: Equation (12) is related to the generalized Blackwell-MacQueen sampling scheme by James et al. [18] [Section 3.3]. The data marginals in Equations (11) and (13) have a simple Poisson likelihood in µ. Thus, a CRP interpretation of a Gamma process can be used for hierarchical inference with a Gamma distribution, as used by Zhou and Carin [11], for instance.
To develop a corresponding formula for NRMIs where they are generated by normalising a CRM, we use an ingenious technique for normalising a CRM within a posterior analysis from [18] The basic idea is to convert multinomial sampling into Poisson sampling (without normalisation) but require some post manipulation to derive the results. A generative variation of this goes as follows: 1.
For each multinomial n according to the unnormalised values λ, introduce a scale-free latent relative mass denoted U, with the scale-invariant improper prior d U U .

2.
Generate the data needed according to Poisson n i ∼ Poisson(Uλ i ) for i = 1, . . . , ∞, noting that n i = 0 for i > I.

3.
Then, the joint posterior on n, λ, U becomes quite concentrated for U and can be marginalised out.

4.
To correct the formulas, multiply the marginal by N = ∑ I i=1 n i to obtain a conversion to a multinomial.
To see that this indeed does what is required, one needs to verify the following identity.

Corollary 2. (General Chinese restaurant processes for NRMIs)
Consider the posterior data marginal for γ(·) as given in Lemma 2, where data is in the form of a multinomial likelihood with counts n i > 0 at each atom x i : , and let N = ∑ I i=1 n i be the total count. Let U ∼ gamma(N, ∑ ∞ i=1 γ i ). Then, the data marginal using Π 1 , . . . , Π I , similarly to Lemma 5, takes the form: Moreover, for any j (including j > I), Alternatively, if each K i is handled as a latent variable, then the data marginal given table numbers takes the form: Note, to complete the analysis, one needs to model the unseen parts of the processes. So, while it is assumed µ i for i = 1, . . . , I is being sampled or estimated, of µ i and γ i for i = I + 1, . . . , ∞ only a finite number, if any, can be sampled or estimated. Handling these is illustrated in Section 5 using a remainder term µ R = ∑ ∞ j=I+1 µ j . In general, then, there are two different levels of inference one can use when the marginal does not have a simple closed form and must instead be computed using the latent forms in Lemma 5 or Corollary 2:

Sampling over table configurations:
For the DP, this is exhibited by the standard CRP. One can see from Equations (6) and (12) that to resample which table a point belongs to, one would use the following proportionalities: Sampling over table sizes: For the PYP, this is demonstrated by table indicator sampling methods [6,39] and "direct" Gibbs sampling of Gasthaus and Teh [5], though subsequently not used because in their context they needed to constantly resample discount α. This is a collapsed sampler that instead samples K, the number of tables using Equation (7): This is only efficient when T n K can be tabulated. In the general case, this requires O(n 2 K) steps to follow using Equation (8) and O(nK) for cases such as the gamma process above where a simpler double recursion is available for T n K since they are generalised second-order Stirling numbers.

Variants of the Generalised Gamma Process
In this section, we develop both the CRM and NRMI variants of the generalised gamma process in the hierarchical context. Using the generalised gamma process in an NRMI yields an NGG or a PYP. When the NGG process and the PYP are supplied discrete base distributions as input, they behave analogously to the Dirichlet distribution, as illustrated with Lemma 3. In this discrete context, refer to the corresponding distributions as the NGG distribution and the Pitman-Yor distribution (PYD). Here analytical forms of the PY distribution are developed.

The Hierarchical Context
Consider an NRMI in the context of the base distribution µ(x), as before. Suppose multinomial type data is observed in the form of counts n k associated with the atoms x k of µ(·) for k = 1, . . . , K, with total count N = ∑ K k=1 n k , where all others are zero. The latent relative mass trick of James et al. [18] can be used to include U as a latent variable in the likelihood for the NGG and the PYD. Setting U = 1 and dividing by N in this case restores the posterior to the original Poisson version. The likelihood for a PYD also includes M (via Lemma 1). To express this, the remainder terms for both the base distribution and the CRM need to be represented.
The joint posterior for the NGG is now where the second line is obtained by applying the exponential tilting formula. Note, Lemma 2 means element-wise application of a distribution to the parameter vector µ inside µ(·). Forms for the PYD are obtained by adding the prior for M. For the normalised stable process, denoted NSP, one obtains Pr({λ k , n k , x k : k = 1, . . . , K}, U, λ R | NSP, M, α, N, µ(·)) From this, one can derive an integral formula for the PYD. Details are in Appendix C, and the result is original. Lemma 6. (Integral formula for the PY distribution) Let µ be a K-dimensional non-zero probability vector. Then, consider θ ∼ PYD(α, β, µ) for α > 0 and β ≥ 0. To express the probability of θ, introduce corresponding latent variables ν = (ν 1 , . . . , ν K ) ∈ [0, π] K : This can be readily evaluated using numerical integration for small K. Plots of the marginal for θ 1 for different parameter settings are given in Figures 1 and 2.
Due to the aggregation property of the PYD, these are representative marginals of the distribution for all dimensions. One can see the distributions becoming increasingly skewed as α increases.

Networks of Processes
The next natural question to consider is how the above results apply to networks of processes. Several general schemes have been developed for inference in more general networks [3,6,11,19,39,40]. General networks for HPYPs have been demonstrated to scale [4,6], in contrast to earlier Gibbs schemes [3,40], and arguably the HGP has advantages over the HDP [11]. This section is a review of related material with regards to hierarchical processes.

Identifiability
One important question is the issue of statistical identifiability, and an underlying issue here is whether the parametric structure admits a unique representation [41]. In our case, some simple classes of non-uniqueness are easily identified and avoided. For instance, in Poisson matrix factorisation, if the matrix entry x i,j ∼ Poisson ∑ K k=1 θ i,k φ k,j , then one can insist that the scale of one of the matrices Θ or Φ (comprising the entries θ i,k and φ k,j respectively) needs to be anchored somehow so that the scale of the Poisson parameter is uniquely determined by just the other one. So, the rows of one of the matrices should normalise.

Equivalences
Another issue is that in some cases, networks can be transformed from one case to another. For instance, Zhou and Carin [11] ([Section VB]) show that a Poisson gammagamma process construction is equivalent to a HDP construction with an independent Poisson-gamma on the total. Given that there are significant differences between the corresponding algorithms in this case, and there are many more in the literature, what other equivalences are there?
Normalising processes are conducted to convert a CRM into an NRMI and in some cases, independence between the parts yields an equivalence between the CRM form and the NRMI form augmented with a total. This has major implications to networks of such processes, presented in the following subsection, so the results are summarised here.
The first results are on discrete processes and are well-known, some for instance reproduced by Zhou and Carin [11]. • NBP as a Poisson-gamma mixture, • DCMP, given X ∈ N + , as a multinomial-Dirichlet mixture, • Conditioning the NBP, The conditioned versions of the PPP and NBP are used to decompose a likelihood into a total count and the vector of counts for atoms, given the total. Notice, while the conditioned version of the PPP yields a likelihood where the normalised measure ( µ) and its total (M) are independent, the same does not hold for the conditioned NBP.

Normalisation and Independence
On non-discrete processes, some independences apply. • For the generalised gamma process where 0 < α < 1, marginalising M , δ), where Λ and λ/Λ are independent. Moreover, the gamma process is the only case of such independence possible for pure NRMIs (this excludes the second case as it is marginalised).
That the gamma process is the only independence case for CRMs and their NRMIs is a result by Perman et al. [27] ([Corollary 2.3]). This is equivalent to the neutrality of the Dirichlet distribution, again the only distribution on probability vectors exhibiting neutrality. Neutrality and independence in this case can be shown to be equivalent properties. Independence in both these cases is also a consequence of the fact that so-called sized-biased sampling for the cases is independent of the total [27,29]. Independence properties such as in Lemma 8 do not hold generally, as indeed sized-biased sampling is not generally independent of the total.

Modelling LDA Using HDP
Consider models for the HDP variant of LDA [3], called HDP-LDA, which has been the subject of extensive research. There is a wide variation in the literature of how these are to be represented by graphical model and for statistical inference. Figure 3 shows two equivalent models for HDP-LDA. Figure 3a gives the original model as formulated by Teh et al. [3], and Figure 3b shows the modification used here. Authors sometimes use a more complicated formulation in terms of the underlying stick-breaking model.  In this problem, there are D documents and N d words in each document for d = 1, . . . , D, where the words w d,n are modelled with an admixture. The probabilistic specification for the corresponding models are given in Figure 4.  Figure 4a shows the probabilistic specification with full base distributions. While this follows the theory directly, it is a fairly large departure from the original representation of LDA. The reformulation in Figure 4b is a direct analogue of the original representation of LDA with two modifications essential for the treatment of a HDP, discussed below as the root node and the non-root node.
The root node of the DP hierarchy is represented as a GEM, which generates the infinite vector. In practice, this can be represented using size-biased sampling [27] formulations, and in the simplest and popular cases this corresponds to stick-breaking methods [42]. In implementation, however, there is no need for this as posterior formulations for the processes are well understood and require no implicit ordering constraints as in stick-breaking.
Non-root nodes down the hierarchy are represented using their underlying infinitely divisible non-negative distribution, in this case the Dirichlet. Note, however, this extends the standard definition of a Dirichlet as the input parameter is an infinite dimensional vector. In implementation, this is no impediment as only a finite amount of data is ever dealt with, although it does require modelling the current number of non-empty dimensions. This can be readily handled using standard parametric techniques [6] or by using truncation [4].
Note Figure 4a also uses a nested construction [43] with the expression DP(c α , Dirichlet(c β β)). Here a distribution, in this case a Dirichlet, but it could also be a GP, a DP or any other process, is used as the base distribution. This nesting construction is exactly what is needed to model matrix and tensor factorisation using hierarchical processes.
The nested, hierarchical equivalent to Figure 5b is as follows: The background word probabilities β are generated, then used as the base distribution for a PYD which then creates variants φ k as each atom of the gamma process G 0 . The mixture weights of G 0 correspond to α from Figure 5b. Variants of this, G d , are then created which modify the mixture weights α but leave the atoms constant. So, G d is a weighted sum of the original φ k , as is the case in Figure 5b. This is very elegant, but Figure 5b better exposes the detail needed for implementation.

Example Equivalences with Non-Parametric LDA
Consider extending HDP-LDA to include a Pitman-Yor distribution on the word side. This model, termed NP-LDA Buntine and Mishra [4], has been demonstrated using a truncated approximation. To bring out equivalences, the multinomial form of the topic model is given, and both are defined in Figure 5.
The gamma scale parameter on α 0 is one as it has an equivalent affect to c θ . So, it needs to be made a constant for identifiability. The equivalence is obtained by noting, from Lemmas 7 and 8, and many such results exist for the finite case, for instance by [44]. One can introduce a total rate for documents, Θ d , and model the count, N d , entirely independently: If the concentration parameters are estimated during learning, which is the common case, and recommended for topic models, then equivalence does not hold.
Experimental evidence [4] shows the following: • The topic side, θ d , is best not modelled using PYPs because experiments indicate that this gives no performance improvement. The non-Zipfian DPs work best, probably because of the smaller dimensions for number of topics. • Modelling the word side, φ k , using PYPs systematically outperforms HDP-LDA by a moderate margin in perplexity and yields more explainable topics because the overall "background" words are separately modelled using β.
Several model equivalences hold with regard to these kinds of models.
• The asymmetric-symmetric version of LDA [45] is a truncated version, not well understood in the community. • The asymmetric-asymmetric version of LDA, evaluated by Wallach et al. [45], is a truncated version of the model in Figure 5a. • Hierarchical Poisson factorisation [46] (HPF) is a non-parametric formulation of Poisson-gamma matrix factorisation using stick-breaking, and thus is equivalent to HDP-LDA above (when augmented with a gamma model of the total counts). • Robust (negative binomial) Poisson factorisation by Zhou et al. [47] is related (ignoring some issue with hyperparameters) to bursty topic models by Doyle and Elkan [48], which has a non-parametric extension in Buntine and Mishra [4].

Conclusions
Discrete base distributions make CRMs behave like vectors of infinitely divisible distributions, where application is element-wise without the non-parametrics. So, the gamma process becomes an element-wise gamma distribution, and the generalised gamma process becomes an element-wise Tweedie distribution. This was presented in Lemma 2, Lemma 4 and Corollary 1 and accompanying tables. Similarly, discrete base distributions make NRMIs and related processes behave as normalised versions of the above, sharing some properties of the DP such as renormalisation. So, the HPYP becomes the PY distribution, whose form was developed in Section 5.
If closed forms for analysis of the infinitely divisible distributions do not exist, the generalised versions of Chinese restaurant process (CRP) sampling, given in Equation (17), can be used instead, including versions of the more recent, efficient collapsed samplers for CRPs [6,39], given in Equation (18). Similar formulations also appear in [14,15]. Note many of these quantities, for instance in Table 1, can be derived from the Laplace exponent of the CRM, so a convenient form of the distribution is not needed. The CRPs come about when unfolding the recursion that relates the cumulants of a distribution to the moments of the distribution, a simple result in basic statistics. In this way, known CRPs for the gamma process follow a general scheme that also applies for the generalised gamma process, the generalised beta process and others.
While most of these results follow fairly simply from general results in the nonparametric Bayesian community, some have not yet seen use in the Bayesian machine learning community.
As a specific example of hierarchical distributions, it was also shown in Section 5 that the NGG and PY distributions, for the case where discount α > 0 and concentration β > 0, are behaving like normalised Tweedie variables, and for the case where concentration β = 0 like normalised positive stable variables. Moments of the Tweedie distribution show how the standard hierarchical likelihood for the HPYP used to date [5,37] can be directly derived from this framework without considering non-parametric theory. A novel integral expression for the PY distribution for discount α > 0 and concentration β ≥ 0 was also developed in Equation (21). This answers the question, "what is a hierarchical PYP"?
There are a rich number of variations of matrix factorisation and topic models that exist, for instance, see ( [11] Table 1) with seven different versions of negative binomial matrix factorisation, and the software used in Buntine and Mishra [4] has seven different non-parametric versions of LDA. This is ignoring the extensions of the model where the problem is changed significantly: document segmentation [7], hierarchical topics [8], supervised topic models, etc., and these extensions no doubt have their own rich variety of versions and equivalences. Moreover, some of the known equivalences between processes, when applied in the hierarchical case, yield relationships between models and algorithms in the machine learning community that deserve further investigation, discussed in Section 6. This is confounded by the fact that variants are evaluated using significantly different methodologies; compare, for instance, topic modelling evaluation with recommender systems evaluation. It is an open question as to what other significant equivalences exist in the literature, and the implications this has to the algorithms one can use. Proof. The major result is by Pitman [22]. Equation (5) is obtained by differentiating inside the integral of the Laplace exponent. Note that when they both exist, cumulants κ n and central moments c n are related by the following recursive formula c n = κ n + n−1 ∑ k=1 n − 1 k − 1 κ k c n−k .
One can expand this iteratively to remove the recursion on moments. While P n represents the set of all non-empty partitions of n objects, let S n denote the set of all vectors representing the sizes of non-empty partitions of n. So, if m ∈ S n then m l > 0 for l = 1, . . . , | m| and ∑ | m| l=1 m l = n. One obtain s the following: This is the same form of expression used in defining the generalised Stirling numbers [37] ([ Lemma 16]). The significance is that the sum is over the sizes of the partitions of n, and the product of choose expressions represents the number of partitions with those sizes. Thus, this can be re-expressed as Equation (6). The recursion of Equation (8) can be obtained from the original recursion on c n and reformulation.
The derivation for the β = 0 case is similar, starting from Equation (20), but again there is no data and U = 0. Introduce the integral expression for the pstable(α, s), perform a change of variables from (λ 0 , λ 1 , . . . , λ K ) to (Λ, θ 1 , . . . , θ K ) and then marginalise out Λ. At this point, the terms in M will cancel.