Representational Rényi Heterogeneity

A discrete system’s heterogeneity is measured by the Rényi heterogeneity family of indices (also known as Hill numbers or Hannah–Kay indices), whose units are the numbers equivalent. Unfortunately, numbers equivalent heterogeneity measures for non-categorical data require a priori (A) categorical partitioning and (B) pairwise distance measurement on the observable data space, thereby precluding application to problems with ill-defined categories or where semantically relevant features must be learned as abstractions from some data. We thus introduce representational Rényi heterogeneity (RRH), which transforms an observable domain onto a latent space upon which the Rényi heterogeneity is both tractable and semantically relevant. This method requires neither a priori binning nor definition of a distance function on the observable space. We show that RRH can generalize existing biodiversity and economic equality indices. Compared with existing indices on a beta-mixture distribution, we show that RRH responds more appropriately to changes in mixture component separation and weighting. Finally, we demonstrate the measurement of RRH in a set of natural images, with respect to abstract representations learned by a deep neural network. The RRH approach will further enable heterogeneity measurement in disciplines whose data do not easily conform to the assumptions of existing indices.


I. INTRODUCTION
Measuring heterogeneity is of broad scientific importance, such as in studies of biodiversity (ecology and microbiology) [1,2], resource concentration (economics) [3], and consistency of clinical trial results (biostatistics) [4], to name a few.In most of these cases, one measures the heterogeneity of a discrete system equipped with a probability mass function.
Discrete systems assume that all observations of a given state are identical (zero distance), and that all pairwise distances between states are permutation invariant.This assumption is violated when relative distances between states are important.For example, an ecosystem is not biodiverse if all species serve the same functional role [5].Although species are categorical labels, their pairwise differences in terms of ecological functions differ and thus violate the discrete space assumptions.Mathematical ecologists have thus developed heterogeneity measures for non-categorical systems, which they generally call "functional diversity indices" [6][7][8][9][10][11].These indices typically require a priori discretization and specification of a distance function on the observable space.
The requirement for defining the state space a priori is problematic when the states are incompletely observable: that is, when they may be noisy, unreliable, or invalid.For example, consider sampling a patient from a population of individuals with psychiatric disorders and assigning a categorical state label corresponding to his or her diagnosis according to standard definitions [12].Given that psychiatric conditions are not defined by objective biomarkers, the individual's diagnostic state will be uncertain.Indeed, many of these conditions are inconsistently diagnosed across raters [13], and there is no guarantee that they correspond to valid biological processes.Alternatively, it is possible that variation within some categorical diagnostic groups is simply related to diagnostic "noise," or nuisance variation, but that variation within other diagnostic groups constitutes the presence of sub-strata.Appropriate measurement of heterogeneity in such disciplines requires freedom from the discretization requirement of existing non-categorical heterogeneity measures.
Pre-specified distance functions may fail to capture semanti-cally relevant geometry in the raw feature space.For example, the Euclidean distance between Edmonton and Johannesburg is relatively useless since the straight line path cannot be traversed.Rather, the appropriate distances between points must account for the data's underlying manifold of support.Representation learning addresses this problem by learning a latent embedding upon which distances are of greater semantic relevance [14].Indeed, we have observed superior clustering of natural images embedded on Riemannian manifolds [15] (but also see Shao, Kumar, & Fletcher [16]), and preservation of semantic hierarchies when linguistic data are embedded on a hyperbolic space [17].
We therefore seek non-categorical heterogeneity indices without requisite a priori definition of categorical state labels or a distance function.The present study proposes a solution to these problems based on measurement of heterogeneity on learned latent representations, rather than on raw observable data.
Our method, representational Rényi heterogeneity (RRH), involves learning a mapping from the space of observable data to a latent space upon which an existing measure (the Rényi heterogeneity [18], also known as the Hill numbers [19] or Hannah-Kay indices [20]) is meaningful and tractable.The original categorical formulation of Rényi heterogeneity and several non-categorical extensions [8,10,21] are reviewed in Section II.Section III introduces RRH.Section III A compares RRH with existing non-categorical heterogeneity indices when the latent space is a categorical set learned by a beta-mixture model.Section III B illustrates our method's performance when the latent space is a Gaussian embedding of images (the MNIST dataset [22]) learned by a deep generative model.Section III B shows that RRH is particularly sensitive to the distinctiveness of components in a mixture distribution.

II. EXISTING HETEROGENEITY INDICES A. Rényi Heterogeneity
Given a categorical set X = {1, 2, . . ., n} with probability distribution p = (p i ) i=1,2,...,n , Jost [1] has argued that heterogeneity is best measured using the exponential of the Rényi entropy [18,19], whose parameter 0 ≤ q specifies insensitivity to rare classes.Special cases of Equation 1 include the observed richness (at the exponential of Shannon entropy [23], which is the perplexity, at q → 1: the inverse Simpson concentration or effective number of parties [24,25] and the Berger-Parker Diversity index as q → ∞: There are two unique properties of the family defined by Equation 1.First is satisfaction of the replication principle [6], which states that if M equally heterogeneous systems with non-overlapping domains are pooled, the aggregated system's heterogeneity is simply M -fold larger than that of one of the subsystems.
Second, this family is measured in terms of the size of a system's event space or domain of support: a set of units known as "numbers equivalent" [26].The numbers equivalent of a system A is the number of partitions in an equally heterogeneous, but uniformly distributed system B.These units are always interpretable in terms of physical sizes, rather than the various alternative interpretations of other indices [27][28][29].This may facilitate comparison across studies.Equation 1 may also be transformed into many other indices of heterogeneity [30] and inequality [31,32].Thus, Rényi heterogeneity is an interpretable and rich family of measures.

B. Numbers Equivalent Heterogeneity Indices for
Non-Categorical Spaces Many real-world datasets are not discrete.Several ecological measures of non-categorical heterogeneity thus attempt to relax the restrictive discrete metric assumption.On account of primarily their interpretability and satisfaction of the replication principle (Section II A), we focus only on those indices with units of numbers equivalent.

Preliminaries
The measures reviewed in this section necessitate discretization of the observable space, with specification of a corresponding discrete probability distribution p = (p i ) i=1,2,...,n over n ∈ N + categories, and an n × n matrix of pairwise distances, D = (d ij ) j=1,2,...,n i=1,2,...,n , between categories.For the examples in this section, we are required to model metric and ultrametric distances, which we do for a simple system with n = 3 states.The parametric distance matrix is as follows:  1. Simulated distance function and probability distribution for a 3-category system.Panel A: Demonstration of the fact that the ultrametric triangle inequality is satisfied only by isosceles triangles (here h is the triangle height, base width is constant at 1).Panel B: A one-parameter probability mass function with skewness parameter 1 ≤ κ.At κ = 1, all discrete states are equiprobable.For κ > 1, probability mass is increasingly skewed toward a single state.
and it will be ultrametric when √ 3b/2 ≤ h (Fig. 1).The probability distribution over states is where 1 ≤ κ governs skewness (Fig. 1).Each of the following heterogeneity indices are analyzed in closed form with respect to distance h and inequality level κ.

Numbers Equivalent Quadratic Entropy
Rao [33] introduced a quadratic entropy that was later generalized into a power mean by Chiu and Chao [8]: The RQE is the expected pairwise distance between states given matrix D and state probabilities p.The units of RQE are distance, and it is unbounded with respect to increases in D.
Ricotta and Szeidl [21] derived an expression of RQE in numbers equivalent based on where δ ij is Kronecker's delta.Simplifying Equation 9 yields the Gini-Simpson index, which can be converted into Rényi heterogeneity by substitution into Equation 4.
We then generalize the distance matrix from 1−I but rescale it such that ∀(i, j) 0 ≤ d ij ≤ 1, as in the categorical case, yielding where One feature of Qe that suggests good interpretability is that the maximal value of Qe in the present case reaches n = 3 (occurring at the ultrametric transition point).Furthermore, Qe approaches 1 as κ → ∞.However, these benefits are offset by some limitations.First, in the limit of distance in the present example, Qe → 4/9, whereas intuition suggests that as one vertex of a triangle is pulled further away, the effective number of states should approach 2 (since the two other states become ever closer).Finally, Fig. 2 shows that numbers equivalent RQE behaves in a categorically different fashion depending on whether the distance function is ultrametric.This is problematic if ultrametric property cannot be guaranteed.7) and distance (h; Fig. 1) on numbers equivalent quadratic entropy ( Qe) for a simple three-state categorical system.Recall this system forms a triangular graphical orientation; we hold the width (distance between two of three nodes) fixed at 1, and h ∈ R ≥0 (x-axis) denotes distance of the third point from the remaining two.Each line denotes a different value of the abundance inequality parameter κ, where higher values denote a more skewed distribution (which should reduce heterogeneity).For the range of h shown as solid lines, the system's distance matrix satisfies the ultrametric triangle inequality (d(x, z) ≤ max{d(x, y), d(y, z)}).However, the dashed region corresponds to values of h where the ultrametric triangle inequality is not satisfied.

Functional Hill Numbers
Chiu and Chao [8] derived the functional Hill numbers This index can be particularly useful when the categorical partitioning on the observable space is reliable (such as species labels in ecological samples).However, when p i = p j ∀(i, j), then F q (D, p) = n.That is, when all states are equally likely, F q is insensitive to their dissimilarities (Fig. 3).A notable benefit of F q in comparison to Qe is that F q behaves consistently regardless of whether distance is ultrametric.However, Fig. 3 shows additional drawbacks.First, F q often paradoxically increases as the state probability distribution becomes more unequal (most notably in F 0 and F 1 ; behaviour opposite to that required of heterogeneity measures [27,34,35]).Moreover, this increase results in F q (κ, h) > 3, which paradoxically occurs as one state is being pushed closer to the others.To summarize, the functional Hill numbers are estimating more states than are really present despite reduction in between-state distances and greater inequality in the probability mass function.

Leinster-Cobbold Index
The index derived by Leinster and Cobbold [10] (L q ) is defined as where S is an n × n positive semidefinite similarity matrix, here obtained by the transformation S ij = e −uDij , where u ∈ R ≥0 is a scaling factor.When u = 0, S ij = 1 everywhere.Conversely, when u → ∞, we obtain the identity matrix and L q recovers the Rényi heterogeneity.The Leinster-Cobbold index compares favourably to F q in that L q does not lose sensitivity to dissimilarity when p i = p j ∀(i, j).However, the Leinster-Cobbold index is particularly sensitive to the form of similarity transformation.In the present case, the maximal value of the L q gradually approaches 3 as u grows (and only when u → ∞ does it reach 3), while progressively losing sensitivity to distance.In our opinion, this property somewhat defeats the purpose of non-categorical heterogeneity measurement, since the correct number of states can only be identified when we return to imposition of the discrete metric.

III. REPRESENTATIONAL R ÉNYI HETEROGENEITY
The indices reviewed in Section II B were non-parametric generalizations of a categorical heterogeneity measure onto non-categorical spaces.In this section we propose and evaluate two methods that measure Rényi heterogeneity on learned latent representations, rather than on the observable space.There are two main approaches: A. Learning a categorical representation to which we can apply the standard Rényi heterogeneity.

B. Deriving parametric forms of Rényi heterogeneity for learned non-categorical representations
Both methods employ the same logic, illustrated in Fig. 4. Essentially, we learn a model for a posterior distribution on some latent variable z ∈ Z given observable data x ∈ X such that the Rényi heterogeneity is either more scientifically relevant or easier to measure on the latent space.

A. Categorical Representational Rényi Heterogeneity
For each of n s ∈ N + observations of data X = (x ij ) j=1,2,...,nx i=1,2,...,ns , this approach assumes that there exists a latent categorical representation z = (z i ) i=1,2,...,ns ∈ Z = {1, 2, . . ., n c } to which each point in X can be mapped (and vice versa).Unlike the indices presented in Section II B, we do not presume that X → Z is known, and rather seek to learn a parameterized posterior distribution p θ (z|X) with which we may compute the Rényi heterogeneity as  7) and distance (h; Fig. 1) on the functional Hill numbers (Fq).Each plot depicts the functional Hill numbers at a different value of the parameter q for the three-category system depicted in Fig. 1.Axes, colours, and line styles are defined exactly as described in the caption for Fig. 2.
 (Observable)  (Latent) 4. Graphical illustration of the two main approaches for computing representational Rényi heterogeneity.In both cases, we map sampled points on an observable space X onto a latent space Z, upon which we apply the Rényi heterogeneity measure.The mapping is illustrated by the curved arrows, and should yield a posterior distribution over the latent space.Panel A shows the case in which the latent space is categorical (for example, discrete components of a mixture distribution on a continuous space).Panel B illustrates the case in which Z is non-categorical.An example of the latter case would be that of dimensionality reduction with an embedding technique such as principal components analysis, where the latent space is continuous.
In cases where Z is continuous, we derive a parametric form for the Rényi heterogeneity.
Π q (z|X) = Z p q θ (z|X) In this section, we contrast this approach with the Qe , F q , and L q indices introduced in Section II B using a beta mixture model (BMM).We show key elements of the model here, with a more thorough treatment offered in Appendix B.
For the purposes of this paper, we define a simple BMM by the following joint distribution: , where c = p(z = 2), and is the probability density function for beta distribution with label z = 1, B is the beta function, and (α, β) are pseudo-count parameters.For notational and computational parsimony, we define p(x|z = 2) = Beta β,α (x).
Assuming that the α, β, and c parameters have been learned for a BMM given data X, we can compute the posterior probability of cluster assignment for point x ∈ (0, 1) as follows: ) One approach to compute the RRH in this case would be to use the p α,β,c (z|x) values across x ∈ (0, 1) to perform the following decomposition, which follows from the original derivation by Jost [36]: Here, Π W q (z|X) can be interpreted as the average effective number of mixture components per x ∈ X , the term Π P q (z|X) is the overall effective number of mixture components, and q (z|X) is the effective number of distinct components (factoring out the within-point uncertainty).However, the integral in the denominator may be expensive to compute.Another approach is to assume the model assigns "hard" component labels over the domain of X based on estimation of an optimal assignment threshold τ α,β,c on X for which p(z = 2|x) > p(z = 1|x) ∀x > τ α,β,c .In the case of the two-component BMM, the assignment threshold is computed deterministically as follows: The posterior marginal distribution on Z is given by p dx is the survival function of the BMM marginal distribution on the observable space X .The resulting RRH is thus where we dropped the subscripts on ψ α,β,c (•) for notational parsimony.Figure 5 plots the relationship between threshold values and the RRH for two BMM distributions (one where the components overlap completely, and the other where a more clear separation exists).As expected, when the component distributions are completely overlapping (i.e.where one category assignment is always more probable than the other), the RRH is 1.This provides an important sanity check, since it suggests that RRH in this scenario will not falsely overestimate the number of categories.With greater separation of the component distributions, the RRH increases, but remains within the upper bound of 2.

Comparison with Existing Heterogeneity Indices
Figure 6 compares the categorical RRH (using the hard thresholding method) against Qe , F q , and L q for BMM distributions of varying degrees of separation, and across different mixture component weights (0 < c < 1).Without significant loss of generality, we show only those comparisons at q = 2.
The most salient differences between these indices occur when the BMM mixture components completely overlap (i.e. at α = β).The RRH correctly identifies that there is effectively only one component, regardless of mixture weights.Only the Leinster-Cobbold index showed invariance to the mixture weights when α = β, but it could not correctly identify that data were effectively unimodal.
The other stark difference arose when the mixture components were furthest apart (here when α = 0.1 and β = 10).At this setting, the functional Hill numbers showed a paradoxical increase in the heterogeneity estimate as the prior distribution on components was skewed.The Leinster-Cobbold index was appropriately concave throughout the range of prior weights, but it never reached a value of 2 at its peak (as expected based on the predictions outlined in Section II B 4).Conversely, the RRH was always concave and reached a peak of 2 when both mixture components were equally probable.

B. Non-Categorical Representational Rényi Heterogeneity
In many cases, the most appropriate latent representation for our observable data will be continuous, in which the Rényi heterogeneity is computed as In the two-component BMM example shown in the previous section, we were able to simplify the computation by mapping observations onto the latent space using a hard threshold.This is not possible in the non-categorical setting.Instead, we present a method based on the Rényi heterogeneity decomposition procedure outlined by Jost [36].As an example system that is decidedly more complex than the 2-component BMM, we consider a convolutional variational autoencoder (cVAE; [37]) which we apply to the MNIST dataset of handwritten digit images [22].However, the procedure we employ herein may be generalized to other distributions.
A VAE is made up of an encoder and a decoder (Fig. 7), which together aim to learn a compressed latent representation that enables reconstruction of the respective input data.In the convolutional VAE model, the encoder consists of a convolutional neural network whose output layer encodes the mean (µ(X)) and diagonal covariance (Σ(X)) functions for a Gaussian distribution over latent representation vector z ∈ Z nz .The dimension of Z is typically much smaller than that of the input space X .The decoder of a cVAE consists of a neural network whose input is a latent representation vector z, and whose output is a reconstruction of some input data.For more details, see Kingma and Welling [37].In the present study, the input data consist of 28-by-28 binary images, and the latent space is set to a dimension of n z = 2 for illustrative purposes.Computing the RRH requires us to derive a parametric form for Π q (z|X) on the latent space, which for the cVAE translates into computing Π q (Encoder(X)).

Deriving Continuous Representational Rényi Heterogeneity for the cVAE
Conceptually, the Rényi heterogeneity of the space captured by our dataset X = (x i ) i=1,2,...,ns corresponds to the effective number of observations.Here, greater variation in X corresponds to an increase in the effective number of completely distinct observations.Since the encoder generates a Gaussian distribution for each observation x i , the latent representation of the whole dataset is a Gaussian mixture model with equal Illustration of the convolutional variational autoencoder [37].The computational graph is depicted from top to bottom.An nxdimensional input data Xi (white rectangle) is passed through an encoder (in our experiment this is a convolutional neural network, CNN) which parameterizes an n − z-dimensional multivariate Gaussian over the coordinates zi for the image's embedding on the latent space Z.The latent embedding can then be passed through a decoder (blue rectangle) which is a neural network employing transposed convolutions (here denoted CNN ) to yield a reconstruction Xi of the original input data.The loss function for this network is a variational lower bound on the model evidence of the input data (see Kingma and Welling [37] for details).
weights (since we assume a uniform distribution on each observation).One can show that the Rényi heterogeneity for a multivariate Gaussian is where Σ i is the covariance function given observation x i .The average Rényi heterogeneity for a single observation in the set i ∈ {1, 2, . . ., n s } can be computed as The projection of all observations onto the latent space results in a mixture of Gaussians with pooled covariance matrix where µ is the mean function given some input data (indices omitted for parsimony), and • denotes expectation over all observations.The RRH over the pooled dataset is simply Π q ( Σ).As proven by Jost [36], the pooled Rényi heterogeneity can be decomposed as follows Within-observation heterogeneity Between-observation heterogeneity (27) The units of overall and within-observation heterogeneity values generalize numbers equivalent to continuous spaces, and are perhaps better referred to as effective support or domain size (length, area, volume).We are interested primarily in the between-observation value, since it returns the effective number of observations.Conversely, the within-observation heterogeneity is more of a description of the model's uncertainty regarding the latent representation of input data.For the mixture of Gaussians model, the between-observation heterogeneity can be easily computed as follows:

Empirical Evaluation on MNIST Dataset
Code for these analyses is available in a project repository (https://github.com/abrahamnunes/RRH).We began by projecting the 60,000 MNIST training images (some samples are shown in Fig. 8(a)) into the latent space and computing the pooled, within-observation, and between-observation Rényi heterogeneity according to the handwritten digit label (Fig. 8(b)).To evaluate each digit's contribution to the RRH of the aggregated dataset, we recomputed the Rényi heterogeneity on the aggregated data with each of the digit classes left out (Fig. 8(c)).
Figures 8(b) and (c) demonstrate some important properties of the RRH.First and foremost, it highlights the importance of using between-observation heterogeneity as the statistic of interest.In the MNIST dataset, the class of handwritten "Ones" is generally the most homogeneous (by both visual inspection and empirical evaluation; Appendix E).As such, we can see that measuring heterogeneity using the pooled or within-observation indices would be a mistake, since the within-observation heterogeneity is driven primarily by uncertainty in the model's mapping of the image to the latent space.The pooled heterogeneity is a combination of the number of observations mapped to the latent space and the heterogeneity of each observation's mapping.Using a physical analogy, one may consider the pooled heterogeneity as the total "volume" of latent space occupied by the data embeddings, and the within-observation heterogeneity as the average volume per observation embedding.Dividing the total volume by the average volume-per-observation yields the effective number of observations in the dataset (here the between-observation heterogeneity).
Second, and somewhat paradoxically, Fig. 8(c) shows that removal of images of Ones (the class with the smallest effective number of observations in Fig. 8(b)) from the dataset results in one of the greatest reductions in the between-observation heterogeneity of the remaining data.Conversely, removing the images of Fives (the class with the largest effective number of observations in Fig. 8(b)) results in a small increase in overall heterogeneity.How can removal of an effectively large subset of images (i.e. one with a high between-observation heterogeneity) increase the amount of heterogeneity in the remaining sample?On the same note, how can removal of the effectively smallest subset of images result in one of the largest reductions of heterogeneity in the overall sample?

Mapping Heterogeneity on the Latent Space
Figure 9(a) shows a visualization of the 2-dimensional latent space in our cVAE and samples from regions with different levels of between-observation heterogeneity.The digit classes whose exclusion from the aggregate dataset results in increased between-observation heterogeneity can clearly be seen to occupy the peripheries of the latent space (i.e. the "tails" of the latent multivariate Gaussian).
We then sought to "map" the relative amounts of pooled, within-, and between-observation heterogeneity encoded in the latent space.This was done by (A) reconstructing M images for each square M × M neighbourhood of latent coordinates, then (B) projecting those samples back into the latent space, where the between-observation heterogeneity was recalculated.Figure 9(a) shows eight such instances, wherein one can appreciate that the between-observation heterogeneity indeed tracks visually appreciable sample diversity.For example, the 7 × 7 patches with the lowest between-observation heterogeneity depict 49 digits that are almost indistinguishable copies of each other.However, the 7 × 7 patches with higher betweenobservation heterogeneity consist of greater variation in the digit classes shown, as well as the graphical features within digit classes.
The resulting latent-space heterogeneity maps are shown in Fig. 9(b).Here, one can appreciate that the cVAE encodes the bulk of its sample diversity in the center of the latent space, with more homogeneous subgroups encoded in the peripheries (i.e. the tails, since the latent distribution is centered at [0,0]).These data suggest a potential mechanism for the paradox observed in Fig. 8(c): that the continuous RRH is driven by the presence of distinct subset components in the mixture distribution.In our MNIST example, we can observe that the "Ones, Sixes, and Zeros" are sufficiently distinct from the other digits such that they are pushed to the latent distribution's tails, which increases the overall between-observation heterogeneity.Removal of the "Fives," which are embedded more centrally in the latent space, will further accentuate the tail modes and increase the between-observation heterogeneity.In sum, we hypothesize that the RRH primarily captures the degree of distinct multimodality in the system.Consider a mixture of three univariate Gaussians separated from their nearest neighbour by a distance of µ ∈ R ≥0 .The marginal distribution over x in this case is where σ c is the standard deviation for the "central" mixture component, and σ t is the standard deviation for the "tail" mixture component.We are interested in the effect of pruning either the central or tail component on the between-observation heterogeneity.We also show samples of 49 digits from areas of the latent space with varying levels of betweenobservation heterogeneity; these samples illustrate that indeed the representational Rényi heterogeneity captures visually appreciable sample diversity.Panel B: Heterogeneity levels (pooled, within-, and between-observation components) across the latent space.Since the pooled-, within-, and between-observation heterogeneity values will have different scales, and here only their relative magnitudes matter, we represent the colormap scale as merely high-low to simplify the illustration.The x and y axes show the respective dimensions of the latent space.Each row of plots corresponds to the number of neighbouring points on the latent space over which the representational heterogeneity values were computed.
Let us first focus on the scenario in which we prune the tail mode.For simplicity, we allow σ t > 0 to remain free, but set σ = σ c = 1.Equation 29 in this case becomes where GMM +T simply denotes the case in which the tail mode is present.After pruning the tail mode, Equation 31becomes The change in between-observation RRH after tail mode pruning is as follows (evaluated here at q = 1 without loss of generality): One can easily show that the point σ t = 1, µ = 0 is a global maximum, at which the value of ∆(Tail) is 0. Thus, if we prune a component whose distribution is identical to all others (same mean and standard deviation), the heterogeneity of the system should not change.Furthermore, pruning a tail mode will never increase the between-observation RRH (Fig. 10).
The analogous case for central mode pruning assumes that σ = σ t = 1, and leaves σ c free.The change in betweenobservation RRH is whose sole extremum at σ c = 1, µ = 0 is a saddle point (Fig. 10).Thus, removing a central mode can result in both increases and decreases in the overall between-observation heterogeneity of the mixture distribution.Figure 11 plots contours of ∆(Center) along with the plane curve over µ and σ c at which ∆(Center) = 0. Figure 16 in Appendix D shows specific cases of ∆(Center) across different combinations of (µ, σ c ).Overall, these data suggest that pruning a central component will increase RRH if it results in a more clear distinction between remaining mixture components.Conversely, if the central component is highly distinct relative to other components (e.g.where it has lower variance), pruning it will decrease heterogeneity.
We then simulated existing indices' (Section II B) responses to mode pruning, in comparison to that of the RRH.First, we arranged 20 bivariate Gaussian distributions (with equal variance) along a straight line.The resulting mixture distribution's heterogeneity was measured in response to pruning mixture components either (A) from the center of the distribution outward, or (B) from the tails inward (illustrated in Fig. 12(a)).
As predicted by theory, pruning equal-variance mixture components from the centre outward will increase the RRH, while the opposite effect is observed when mixture components are pruned from the tails inward.Recall that Fig. 10 and Equation 32suggest that tail pruning will always reduce heterogeneity unless all mixture components are identical.Equation 33and Fig. 10 (and Fig. 16 in Appendix D) also showed that pruning a center mode will increase heterogeneity, but likely only if it leaves the resulting mixture distribution consisting of more distinct components.Conversely, the functional Hill number decreases by a constant amount for every mode removed, regardless of whether pruning occurred centrally or at the tails.A similar monotonic decrease was observed for the Leinster-Cobbold index, although tail pruning resulted in a greater reduction of heterogeneity than central mode pruning.Interestingly, the Numbers equivalent RQE increased with both central and tail mode pruning.Thus, in the present case, only the RRH shows the ability to capture heterogeneity that results from distinct mixture components or modes.

IV. DISCUSSION
We have introduced RRH for quantifying heterogeneity in arbitrary datasets.Representational Rényi heterogeneity satisfies the replication principle [1,38,39] and is decomposable [36] while requiring neither a priori (A) categorical partitioning nor (B) specification of a distance function on the input space.Rather, the experimenter is free to define a model that maps observable data onto a semantically relevant domain upon which Rényi heterogeneity may be tractably computed, and where a distance function need not be explicitly manipulated.These properties facilitate heterogeneity measurement for several new applications.Compared to state-of-the-art comparator indices under a beta mixture distribution, RRH more reliably quantified the number of unique mixture components (Section III A 1), and under a deep generative model of image data, RRH was able to measure heterogeneity of continuous abstract feature embeddings (Section III B 2).Finally, we found that RRH can uniquely measure heterogeneity caused by the distinctiveness of different components in a mixture distribution.In this section, we further synthesize our conclusions, discuss their implications, and highlight open questions for future research.
The main problem we set out to address was that all state of the art numbers equivalent heterogeneity measures (Section II B) require a priori specification of a distance function and categorical partitioning on the observable space.To this end, we showed that does not require categorical partitioning of the input space (Section III).Although our analysis under the two-component BMM assumed that the number of components was known, RRH was the only index able to accurately identify an effectively singular cluster (i.e.where mixture components overlapped; Fig. 6).We also showed that the categorical RRH did not violate the principle of transfers [34,35] (i.e it was strictly concave with respect to mixture component weights), unlike the functional Hill numbers (Fig. 6).Future studies should extend this evaluation to mixtures of other distributional forms in order to better characterize the generalizability of our conclusions.
Sections III A and III B both showed that RRH does not require specification of a distance function on the observable space.Instead, one must specify a model with which a latent representation of the input space may be learned.This is beneficial since input space distances are often irrelevant or misleading.For example, latent representations of image data learned by a convolutional neural network will be robust to translations of the inputs since convolution is translation invariant.However, pairwise distances on the observable space will be exquisitely sensitive to semantically irrelevant translations of input data.Furthermore, semantically relevant information must often be learned from raw data using hierarchical abstraction Ultimately, when (A) pre-defined distance metrics are sensitive to noisy perturbations of the input space, or (B) the relevant semantic content of some input data is best captured by a latent abstraction, the RRH measure will be particularly useful.
Our measure also shows excellent sensitivity to multimodality on non-categorical spaces.Section III B showed that heterogeneity will grow if one adds distinct components to either (A) the tails of a mixture distribution (i.e."extreme groups), or (B) to the center of a distribution, provided the latter components are sufficiently distinct from the existing distribution.These features motivate exploration of Rényi heterogeneity-based tests out-of-distribution sample detection.These problems include outlier detection or evaluating whether two datasets (of potentially high dimension, with abstract features) are drawn from distributions with overlapping domains.
In conclusion, we have introduced an approach for measuring heterogeneity in arbitrary datasets that requires neither (A) categorical partitioning nor (B) distance measure on the observable space.Our approach enables measurement of heterogeneity in disciplines where categorical entities are unreliably defined, or where relevant semantic content of some data is best captured by a hierarchical abstraction.Future work should evaluate the RRH in practice and under alternative distributions and model architectures.and for the aggregation of M subsystems is The replication principle asserts that which simple algebra shows to be true.Let To compute the numbers equivalent RQE Qe , the functional Hill numbers F q , and the Leinster-Cobbold index L q under the beta mixture model, we must derive an analytical expression for the distance matrix.This involves the following integral: where f (x) = Beta α,β (x) and g(y) = Beta β,α (y).By exploiting the identity and expanding, the integral is greatly simplified and gives the following closed-form solution: where and where y = α2 α2+β2 , x = α1 α1+β1 , and the Φ's are regularized hypergeometric functions: Figure 15 provides numerical verification of this result.One simply uses Equation C3 to compute the analytic distance matrix which, with the component probabilities p(z) = {1−c, c}, can be used to compute Qe , F q , and L q using the formulas shown in the main body.In our evaluation of non-categorical RRH using the MNIST data, we asserted that the class of handwritten Ones were relatively more homogeneous than other digits.Our initial statement was based simply on visual inspection of samples from the dataset, wherein the Ones ostensibly demonstrate fewer relevant feature variations than other classes.However, to test this hypothesis more objectively, we conducted an empirical evaluation using similarity metric learning.
We implemented a deep neural network architecture known as a "siamese network" [40] to learn a latent distance metric on the MNIST classes.Our siamese network architecture is depicted in Fig. 17.Training is conducted by sampling batches of 10,000 image pairs from the MNIST test set, where 5,000 pairs are drawn from the same class (i.e. a pair of Fives or a pair of Threes), and 5,000 pairs are drawn from different classes (i.e. the pairs [2,3] or [1,7]).The siamese network is then optimized using gradient-based methods over 100 epochs using the contrastive loss function ( [41]; Fig. 17).Code for this analysis can be found at our paper's GitHub repository (https://github.com/abrahamnunes/RRH).
After training, we sampled same-class pairs (n=25,000) and different-class pairs (n=25,000) from the MNIST training set (which contains 60,000 images).Pairwise distances for each sample were computed using the trained siamese network.If the "ones" are indeed the most homogeneous class, they should demonstrate a generally smaller pairwise distance than other digit class pairs.We evaluated this hypothesis by comparing empirical cumulative distribution functions (CDF) on the classpair distances (Fig. 18).Our results show that the empirical CDF for "1-1" image pairs dominates that of all other class pairs (where the distance between pairs of "ones" is lower).
A and X  For instance, "0-0" refers to pairs where each image is a "zero."We combine all disjoint class pairs, for example "0-8" or "3-4," into a single empirical CDF denoted as "A =B."

10 FIG. 2 .
FIG. 2. Effects of abundance inequality (κ; Equation7) and distance (h; Fig.1) on numbers equivalent quadratic entropy ( Qe) for a simple three-state categorical system.Recall this system forms a triangular graphical orientation; we hold the width (distance between two of three nodes) fixed at 1, and h ∈ R ≥0 (x-axis) denotes distance of the third point from the remaining two.Each line denotes a different value of the abundance inequality parameter κ, where higher values denote a more skewed distribution (which should reduce heterogeneity).For the range of h shown as solid lines, the system's distance matrix satisfies the ultrametric triangle inequality (d(x, z) ≤ max{d(x, y), d(y, z)}).However, the dashed region corresponds to values of h where the ultrametric triangle inequality is not satisfied.

FIG. 3 .
FIG.3.Effects of abundance inequality (κ; Equation7) and distance (h; Fig.1) on the functional Hill numbers (Fq).Each plot depicts the functional Hill numbers at a different value of the parameter q for the three-category system depicted in Fig.1.Axes, colours, and line styles are defined exactly as described in the caption for Fig.2.

FIG. 6 .
FIG. 5. Relationship between optimal assignment threshold and categorical representational Rényi heterogeneity for two beta mixture models: BMM(9, 9, 0.1) (complete overlap case) and BMM(2, 9, 0.8) (partial overlap case).In all plots, the x-axes depict the beta-distribution's domain x ∈ (0, 1).The top row of plots shows the representational Rényi heterogeneity (at different values of q, shown as different line colors) across different category assignment thresholds for the beta-mixture models shown in the bottom row.The gray vertical line denotes the optimal assignment boundary given the respective mixture distribution's parameterization.Black dots highlight the resulting values of Rényi heterogeneity on the categorical space.The bottom row of plots show the probability density functions for the two mixture components, Beta[α, β, c] (shown in blue) and Beta[β, α, c] (shown in red), for different parameterizations of the pseudocounts (α, β) and the mixture component weights 0 ≤ c ≤ 1.The different parameterizations are organized along the columns of the plot grid.The bottom row also shows the marginal survival function of the respective beta-mixtures, ψ α,β,c [x] (green lines).

( A )
FIG. 8. Representational Rényi heterogeneity for digit classes in MNIST.Panel A: Illustrative samples from different digit classes in the MNIST dataset.Panel B: Heterogeneity for each digit class projected alone onto the latent space of the convolutional variational autoencoder (cVAE).Panel C: Heterogeneity of the overall dataset (ALL) and with the images of individual digit class left out of the cVAE latent space.

( A )
FIG.9.Mapping digit embeddings and heterogeneity levels on the cVAE latent space.Panel A: Visualization of digit embeddings on the 2-dimensional latent space of the cVAE.We also show samples of 49 digits from areas of the latent space with varying levels of betweenobservation heterogeneity; these samples illustrate that indeed the representational Rényi heterogeneity captures visually appreciable sample diversity.Panel B: Heterogeneity levels (pooled, within-, and between-observation components) across the latent space.Since the pooled-, within-, and between-observation heterogeneity values will have different scales, and here only their relative magnitudes matter, we represent the colormap scale as merely high-low to simplify the illustration.The x and y axes show the respective dimensions of the latent space.Each row of plots corresponds to the number of neighbouring points on the latent space over which the representational heterogeneity values were computed.

( A )
FIG. 12. Setup and results of the mode-pruning experiment.Panel A: Depiction of the mode-pruning experimental protocol.Panel B: Effect of central vs. tail-mode pruning on representational Rényi heterogeneity.Panel C: Effect of central vs. tail-mode pruning on existing heterogeneity indices from Section II B.

1 0FIG. 14 .
FIG. 14.Comparison of the between-observation representationalRényi heterogeneity on the beta mixture model using hard or soft cluster assignment procedures.

FIG. 15 .
FIG.15.Numerical verification of the analytical expression for the expected absolute distance between two Beta(α, β)-distributed random variables.Solid lines are the theoretical predictions.Ribbons show the bounds between 2.5-97.5'thpercentiles of the simulated values.
Figure 16 demonstrates several specific examples of the change in heterogeneity of a mixture distribution with pruning Appendix E: Evidence Supporting Relative Homogeneity of MNIST "Ones"

2 FIG. 17 .
FIG. 17. Depiction of a siamese network architecture.At iteration k, each of two samples, X

B
, are passed through a convolutional neural network to yield embeddings zA and zB, respectively.The class label for samples A and B are denoted yA and yB, respectively.The L2-norm of these embeddings is computed as DAB.The network is optimized on the contrastive loss[41] L.Here I[•] is an indicator function.

9 FIG. 18 .
FIG.18.Empirical cumulative distribution functions (CDF) for pairwise distances between images of the listed classes under the siamese network model.The x-axis plots the L2-norm between embedding vectors produced by the siamese network.The y-axis shows the proportion of samples in the respective group (by line colour) whose embedded L2 norms were less than the specified threshold on the x-axis.Class groups are denoted by different line colors.For instance, "0-0" refers to pairs where each image is a "zero."We combine all disjoint class pairs, for example "0-8" or "3-4," into a single empirical CDF denoted as "A =B."