In this section, we demonstrate two applications of RRH under assumptions of categorical (

Section 4.1) and continuous (

Section 4.2) latent spaces. First,

Section 4.1, uses a simple closed-form system consisting of a mixture of two beta distributions on the (0,1) interval to give exact comparisons of the behavior of RRH against that of existing non-categorical heterogeneity indices (

Section 2.2). This experiment provides evidence that existing non-categorical heterogeneity indices can demonstrate counterintuitive behavior under various circumstances. Second,

Section 4.2 demonstrates that RRH can yield heterogeneity measurements that are sensible and tractably computed, even for highly complex mappings

$f:\mathcal{X}\to \mathcal{P}\left(\mathcal{Z}\right)$. There, we use a deep neural network to compute the effective number of observations in a database of handwritten images with respect to compressed latent representations on a continuous space.

#### 4.1. Comparison of Heterogeneity Indices Under a Mixture of Beta Distributions

Consider a system

X with event space

$\mathcal{X}$ on the open interval

$(0,1)$, containing an embedded, unobservable, categorical structure represented by the latent system

Z with event space

$\mathcal{Z}=\left\{1,2\right\}$. The systems’ collective behavior is governed by the joint distribution of a beta mixture model (BMM),

where

${\mathrm{Beta}}_{\alpha ,\beta}\left(x\right)$ is the probability density function for a beta distribution with shape parameters

$\alpha ,\beta $, and

$\mathit{\theta}=\left({\theta}_{1},{\theta}_{2},{\theta}_{3}\right)$ are parameters. The indicator function

$\mathbb{1}[\xb7]$ evaluates to 1 if its argument is true, and to 0 otherwise. The prior distribution is

and marginal probability of observable data is as follows (see

Figure 4 for illustrations):

To facilitate exact comparisons between heterogeneity indices, below, let us assume we have a model

$f:\mathcal{X}\to \mathcal{P}\left(\mathcal{Z}\right)$ that maps an observation

$x\in \mathcal{X}$ onto a degenerate distribution over

$\mathcal{Z}$:

The subscripting of

${f}_{\theta}$ denotes that the model is optimized such that the threshold

$0\le \tau \left(\theta \right)\le 1$ is the solution to

which is

Under this model, the categorical RRH at point

$x\in \mathcal{X}$ is

The expected value of

${f}_{\theta}(z=2|x)$ with respect to the data generating distribution (Equation (

47)) is

where

${I}_{{x}_{0}}^{{x}_{1}}(a,b)$ is the generalized regularized incomplete beta function (

`BetaRegularized[${x}_{0},{x}_{1},a,b$]` command in the Wolfram language and

`betainc($a,b,{x}_{0},{x}_{1}$,regularized=True)` in Python’s

`mpmath` package). Equation (

52) implies that

${\overline{f}}_{\theta}(z=1)=1-{\overline{f}}_{\theta}(z=2)$. The pooled heterogeneity is thus expressed as a function of

$\theta $ as follows:

As a function of

$\theta $, the within-group heterogeneity is

and therefore the between-group heterogeneity is

${\mathsf{\Pi}}_{q}^{\mathrm{B}}\left(\mathit{\theta}\right)={\mathsf{\Pi}}_{q}^{\mathrm{P}}\left(\mathit{\theta}\right)$.

Analytic expressions for the existing non-categorical heterogeneity indices

${\widehat{Q}}_{e}$ (Equation (

17)),

${F}_{q}$ (Equation (

21)), and

${L}_{q}$ (Equation (

23)) were computed as “best-case” scenarios, as follows. First, the probability distributions over states for all expressions was the true prior distribution (Equation (

46)). Distance matrices—and by extension, the similarity matrix for

${L}_{q}$—were computed using the closed-form expectation of the absolute distance between two beta-distributed random variables (see

Appendix B and the

Supplementary Materials).

Figure 5 compares the categorical RRH against

${\widehat{Q}}_{e}$,

${F}_{q}$, and

${L}_{q}$ for BMM distributions of varying degrees of separation, and across different mixture component weights (

$0.5\le {\theta}_{1}<1$). Without significant loss of generality, we show only those comparisons at

$q=1$ (which excludes the numbers equivalent quadratic entropy), and

$q=2$.

The most salient differences between these indices occur when the BMM mixture components completely overlap (i.e., at ${\theta}_{2}={\theta}_{3}$). The RRH correctly identifies that there is effectively only one component, regardless of mixture weights. Only the Leinster–Cobbold index showed invariance to the mixture weights when ${\theta}_{2}={\theta}_{3}$, but it could not correctly identify that data were effectively unimodal.

The other stark difference arose when the mixture components were furthest apart (here when

${\theta}_{2}=5$ and

${\theta}_{3}=20$). At this setting, the functional Hill numbers showed a paradoxical increase in the heterogeneity estimate as the prior distribution on components was skewed. The Leinster–Cobbold index was appropriately concave throughout the range of prior weights, but it never reached a value of 2 at its peak (as expected based on the predictions outlined in

Section 2.2.3). Conversely, the RRH was always concave and reached a peak of 2 when both mixture components were equally probable.

#### 4.2. Representational Rényi Heterogeneity is Scalable to Deep Learning Models

In this example, the observable system

X is that of images of handwritten digits defined on an event space

$\mathcal{X}={[0,1]}^{784}$ of dimension

${n}_{x}=784$ (the black and white images are flattened from

$28\times 28$ pixel matrices into 784-dimensional vectors). Our sample

$\mathbf{X}={\left({x}_{ij}\right)}_{i=1,2,\dots ,N}^{j=1,2,\dots ,784}$ from this space is the familiar MNIST training dataset [

22] (

Figure 6), which consists of

$N=60,000$ images roughly evenly distributed across digits

$\{0,1,\dots ,9\}$, and where approximately 10% of all images come from each class. We assume each image carries equal importance, given by a weight vector

$\mathbf{w}={\left({N}^{-1}\right)}_{i=1,2,\dots ,N}$. We are interested in measuring the heterogeneity of

X with respect to a continuous latent representation

Z defined on event space

$\mathcal{Z}={\mathbb{R}}^{2}$. In the present example, this space is simply the continuous 2-dimensional compression of an image that best facilitates its reconstruction. We choose a dimensionality of

${n}_{z}=2$ for the latent space in order to facilitate a pedagogically useful visualization of the latent feature representation, below. Unlike

Section 4.1, in the present case we have no explicit representation of the true marginal distribution over the data,

$p\left(\mathbf{x}\right)$.

Having defined the observable and latent spaces, measuring RRH now requires defining a model

$f:\mathcal{X}\to \mathcal{P}\left(\mathcal{Z}\right)$ that maps a (flattened) image vector

${\mathbf{x}}_{i}\in \mathcal{X}$ onto a probability distribution over the latent space. Our chosen model is the encoder module of a pre-trained convolutional variational autoencoder (cVAE) provided by the (

https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb, Smart Geometry Processing Group at University College London) (

Figure 7) [

23,

24]:

where

$\varphi $ are the encoder’s parameters, which specify a convolutional neural network (CNN) whose output layer returns a

$2\times 1$ mean vector

$\mathbf{m}\left({\mathbf{x}}_{i}\right)$ and a

$2\times 1$ log-variance vector

$\mathbf{s}\left({\mathbf{x}}_{i}\right)$ given

${\mathbf{x}}_{i}$. For simplicity, we denote the latter as the

$2\times 2$ diagonal covariance matrix

$\mathbf{C}\left({\mathbf{x}}_{i}\right)={\left({e}^{{s}_{j}\left({\mathbf{x}}_{i}\right)}{\delta}_{jk}\right)}_{j=1,2}^{k=1,2}$. Further details of the cVAE and its training can be found in Kingma and Welling [

23,

24], although the specific implementation in this paper was a pre-trained implementation by the (

https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb, Smart Geometry Processing Group at University College London). Briefly, the cVAE learns to generate a compressed latent representation (via encoder

${f}_{\varphi}$, which is an approximate posterior distribution) that contains enough information about the input

${\mathbf{x}}_{i}$ to facilitate its reconstruction by a “decoder” module. The objective function is a lower bound on the model evidence

$p\left(\mathbf{x}\right)$, which if maximized is equivalent to minimizing the Kullback–Leibler divergence between the approximate and true (but unknown) posteriors

${f}_{\varphi}$ and

$p\left(\mathbf{z}\right|\mathbf{x})$, respectively.

The continuous RRH under the model in Equation (

55) for a single example

${\mathbf{x}}_{i}\in \mathcal{X}$ can be computed by merely evaluating the Rényi heterogeneity of a multivariate Gaussian (Equation (

43) in Example 2) for the covariance matrix given by

$\mathbf{C}\left({\mathbf{x}}_{i}\right)$. This is interpreted as the effective area of the 2-dimensional latent space consumed by representation of

${\mathbf{x}}_{i}$.

Since the handwritten digit images belong to groups of “Zeros, Ones, Twos, …, Nines,” this section will call the quantity

${\mathsf{\Pi}}_{q}^{\mathrm{W}}$ the within-observation heterogeneity (rather than the “within-group” heterogeneity) in order to avoid its interpretation as measuring the heterogeneity of a group of digits. Rather, it is interpreted as the effective area of latent space consumed by representation of a single observation

$\mathbf{x}\in \mathcal{X}$ on average. It is computed by evaluation of Equation (

44) at

$\mathbf{C}\left(\mathbf{X}\right)={\left\{\mathbf{C}\left({\mathbf{x}}_{i}\right)\right\}}_{i=1,2,\dots ,N}$, given uniform weights on samples.

Finally, to compute the pooled heterogeneity

${\mathsf{\Pi}}_{q}^{\mathrm{P}}$, we use the parametric pooling approach detailed in Example 2, wherein the pooled distribution is a multivariate Gaussian with mean and covariance given by Equations (

41) and (

42), respectively. The pooled heterogeneity is then merely Equation (

43) evaluated at

${\mathbf{C}}_{\ast}\left(\mathbf{X}\right)$, and represents the total amount of area in the latent space consumed by the representation of

X under

${f}_{\varphi}$. The effective number of observations in

X with respect to the continuous latent representation

Z is, therefore, given by the between-observation heterogeneity:

Equation (

56) gives the effective number of observations in

X because it uses the entire sample

$\mathbf{X}$ (of course, assuming

$\mathbf{X}$ provides adequate coverage of the observable event space). However, one could compute the effective number of observations in a subset of

$\mathbf{X}$, if necessary. Let

${\mathbf{X}}^{\left(j\right)}={\left({\mathbf{x}}_{k}\right)}_{k=1,2,\dots ,{N}_{j}}$ be the subset of

${N}_{j}$ points in

$\mathbf{X}$ found in the observable subspace

${\mathcal{X}}_{j}\subset \mathcal{X}$ (such as the subspace of MNIST digits corresponding to a given digit class). Given corresponding weights

${\mathbf{w}}^{\left(j\right)}={\left({N}_{j}^{-1}\right)}_{k=1,2,\dots ,{N}_{j}}$, Equation (

56) is then simply

Figure 8 shows the effective number of observations in the subsets of MNIST images belonging to each image class, under the continuous representation learned by the cVAE. One can appreciate that the MNIST class of “Ones” (in the training set) has the smallest effective number of observations. Subjective visual inspection of the MNIST samples in

Figure 6 may suggest that the Ones are indeed relatively more homogeneous as a group than the other digits (this claim is given further objective support in

Appendix C based on deep similarity metric learning [

47,

48]).

Figure 9 demonstrates the correspondence of between-observation heterogeneity (i.e., the effective number of observations) and the visual diversity of different samples from the latent space of our cVAE model. For each image in the MNIST training dataset, we computed the effective location of its latent representation:

$\mathbf{m}\left({\mathbf{x}}_{i}\right)$ for

$i\in \{1,2,\dots ,N\}$. For each of these image representations, we defined a “neighborhood” including the 49 other images whose latent coordinates were closest in Euclidean distance (which is sensible on the latent space given the Gaussian prior). For all such neighbourhoods defined, we then reconstructed the corresponding images on

$\mathcal{X}$, whose between-observation heterogeneity was then computed using Equation (

57).

Figure 9b shows the estimated effective number of observations for the latent neighborhoods with the greatest and least heterogeneity. One can appreciate that neighborhoods with

${\mathsf{\Pi}}_{q}^{\mathrm{B}}$ close to 1 include images with considerably less diversity than neighborhoods with

${\mathsf{\Pi}}_{q}^{\mathrm{B}}$ closer to the upper limit of 49. These data suggest that the between-observation heterogeneity—which is the effective number of observations in

X with respect to the latent features learned by a cVAE—can indeed correspond to visually appreciable sample diversity.