Alternative Dirichlet Priors for Estimating Entropy Via a Power Sum Functional

: Entropy is a functional of probability and is a measurement of information contained in a system; however, the practical problem of estimating entropy in applied settings remains a challenging and relevant problem. The Dirichlet prior is a popular choice in the Bayesian framework for estimation of entropy when considering a multinomial likelihood. In this work, previously unconsidered Dirichlet type priors are introduced and studied. These priors include a class of Dirichlet generators as well as a noncentral Dirichlet construction, and in both cases includes the usual Dirichlet as a special case. These considerations allow for ﬂexible behaviour and can account for negative and positive correlation. Resultant estimators for a particular functional, the power sum, under these priors and assuming squared error loss, are derived and represented in terms of the product moments of the posterior. This representation facilitates closed-form estimators for the Tsallis entropy, and thus expedite computations of this generalised Shannon form. Select cases of these proposed priors are considered to investigate the impact and effect on the estimation of Tsallis entropy subject to different parameter scenarios.


Introduction
Shannon entropy and related information measures are functionals of probability and a measurement of information contained in a system that arise in information theory, machine learning and text modelling, amongst others. Ref. [1] discussed quantifying the information carried by neural signals to estimating the dependency structure and inferring causal relations, uncertainty and dispersion in statistics being applied in fields such as molecular biology. Other interests include studies measuring complexity of dynamics in physics, to studies measuring diversity in ecology and genetics, fields of coding theory and cryptography [2], financial analysis and data compression [3]. Numerous inferential tasks rely on data-driven procedures to estimate these quantities. In these settings and utilising the estimated quantities, researchers are often confronted with data arising from an unknown discrete distribution, and seek to estimate its entropy. This motivates sustained research interest within entropy, coupled with the current data-driven and computing-rich era, for practitioners.
Entropy estimation remains an openly discussed challenge. Ref. [4] investigated how the maximum likelihood estimator (MLE) performed. This is also referred to as the plug-in principle in functional estimation, where a point estimate of the parameter is used to build an estimate for a functional of the parameter. The classical asymptotic theory of MLEs does not adequately address high-dimensional settings in this current data-driven era [4]. High-dimensional statistics arguably demand theoretical tools to address the needs of these high-dimensional settings. Ref. [5] investigated 18 different estimation measures and the suitability was determined experimentally based on the bias and the mean squared error. This work takes a Bayesian approach to entropy estimation, building upon work by [1,4,6,7].
Multivariate count data constrained to add up to a certain constant are commonly modelled using the multinomial distribution. This is widely used in modelling categorical data, of which features could be for example, words in the case of textual documents or visual words in the case of images. The Dirichlet distribution, closely related to the probabilistic behaviour of the multinomial distribution, is a conjugate prior for the multinomial distribution when a Bayes perspective is of interest. Ref. [8] highlights how the use of prior distributions in a Bayesian framework makes it possible to work with very limited data sets. Ref. [9] underscores the superior performance by using the hierarchical approach which introduces the construction of the statistical model. Some meaningful studies include [1,4,6,10]. Ref. [11] also showed how using different Dirichlet distributions for the bivariate case gives one the opportunity to include prior information and expert opinion to obtain more realistic results in certain situations. Ref. [4] also experimented with the estimation of entropy and this triggered more exploration with alternative priors. Experimentation on diverse data sets might necessitate parameter-rich priors; therefore, this study proposes these alternative Dirichlet priors to address this potential challenge.
The paper illustrates how a Bayesian approach is applied in a multinomial-Dirichlet family setup, which allows us to obtain a posterior distribution from where explicit expressions for the Tsallis entropy can be derived, by particularly focussing on the Product moment for the power sum functional, and assuming squared error loss. The first of two main contributions of this paper is the addition of flexible priors from a Dirichlet family, utilising them within an information-theoretical world, which also allows for positive correlation in addition to the usual negative correlation characteristic. The second shows that using elegant constructs of the complete product moments of the posteriors, gives one the comparative advantage of obtaining explicit estimators for entropy under these Dirichlet priors. Ref. [8] echos how the computation on moments accelerates the estimation of entropy.
The paper is outlined as follows. In Section 2, the essential components that are used in the paper are outlined. In Section 3, alternative Dirichlet priors will be introduced and studied, as candidates for the Bayesian analysis of entropy. In Section 4, analytical expressions for the entropy expressions under consideration will be derived and studied. Section 5 contains conclusions and final thoughts.

Essential Components
The countably discrete model under consideration in this paper is given by the wellmotivated multinomial distribution. A discrete random variable X = (X 1 , . . . , X K ) follows the multinomial distribution of order K (i.e., with K distinct classes of interest) with parameters p = (p 1 , p 2 , . . . , p K ) and n > 0 if its probability mass function (pmf) is given by The Dirichlet distribution (of type 1, see [12]) of order K ≥ 2 and parameters Π = (π 1 , π 2 , . . . , π K+1 ) for π i > 0, i = 1, . . . , k + 1, has a probability density function (pdf) with respect to the Lebesgue measure on the Euclidean space R K given by on the K dimensional simplex, defined by p 1 , p 2 , . . . , p K > 0 p 1 + p 2 + · · · + p K < 1 p K+1 = 1 − p 1 − · · · − p K , and where Γ(·) denotes the usual gamma function (the space and constraints of this K dimensional simplex is denoted by A).
To derive a Bayesian engine, we need the likelihood function f (x|p) in addition to a suitable prior distribution h(p). The fundamental relationship between the likelihood function and the prior distribution to form the posterior distribution f (p|x) is given by The most popular form of entropy is that of Shannon: Various generalised cases of this entropy exist, which relies on the power sum: where α > 0. The power sum functional occurs in various operational problems ( [4]). Under the assumption of squared error loss within Bayes estimation, the estimates of both these quantities is given by their expected values: Thus, it is of value to consider the expected value of p α i for all values of i. Since there are cases, such as the non-extensive system like alignment processing, namely registration, which has complex behaviours associated with the phenomena of radar-imaging systems [13])which cannot be fully explained by Shannon entropy, other generalized forms were designed. The Tsallis entropy considered in this paper, which is a popular generalised entropy, tends to Shannon entropy as α tends to 1 [14] and is given by The estimate of this generalisation can be written in terms of the estimate of the power sum: Since the power sum is easier to estimate than the Shannon entropy, the power sum is used in our case. We consider the estimate as the expectation under the posterior distribution, thus under squared-error loss.

Alternative Dirichlet Priors
In this section, two previously unconsidered Dirichlet priors, namely the Dirichlet generator prior and the noncentral Dirichlet prior will be proposed. Positive correlation can be observed for special cases of the Dirichlet generator prior, which is a benefit of this generator form. These new contributions add to the field of generative models for count data and have not been previously considered for entropy.

Dirichlet Generator Prior
In this section, Dirichlet generator distributions are proposed as alternative candidates. From this form, numerous flexible candidates can be "generated". Definition 1. Suppose p is Dirichlet-generator distributed. Then, its pdf is given by with C a normalising constant such that The vector p ∈ A is thus a Dirichlet generator variate with parameters Π = (π 1 , . . . , π K+1 ), θ ∈ R, and whichever additional parameters g(·) imposed, which ensures that the pdf h(·) is non-negative. The following conditions also apply: g(·) admits a Taylor series expansion; 3.
The usual Dirichlet distribution with pdf (2) is thus a special case of (6) when θ = 0.
For illustration of the implementation of the Dirichlet generator prior, we focus on where p F q (·) denotes the generalised hypergeometric function (see [15]) and (a) k = Γ(a+k) is the Pochhammer function. The prior distribution (6) will then take on the following form with pdf In this paper, three hypergeometric functions are considered ( 0 F 0 ; 0 F 1 and 1 F 1 ), since these are commonly considered functions representing exponential, binomial, and the confluent hypergeometric functions. For illustrative investigation, bivariate observations from the corresponding distributions were simulated using Algorithm 1 and the associated pdfs are overlaid and presented in Figures 1-3. The data were simulated from (8) using the following steps of the Acceptance/Rejection method: Define y i ∈ (0, 1) of size n for i = 1, 2; 2.

If
Repeat steps 4-7 k times. Figures 1 and 2 illustrate the three chosen hypergeometric functions for two choices of θ and for three different sets of Πs if K = 2 with a 1 = 4 and b 1 = 5 for 0 F 1 and 1 F 1 , respectively. This firstly illustrates the difference between the different hypergeometric candidates as well as the effect a change in π 1 has on these three functions (note that a symmetric observation would be made for π 2 ). The difference between Figures 1 and 2 shows the effect that θ has on these different combinations with Figure 1 having a very small (almost negligible) θ, while Figure 2 increases the value of θ. An increase in π 1 results in a more highly dense concentration of the pdf for corresponding values of p 1 and p 2 . This is observed for all three considered hypergeometric candidates, as seen in Figures 1 and 2, also for an increase in θ. For Figure 3, a single set of Πs were selected with θ = 0.1 to showcase the effect that the parameters a and b of the hypergeometric function have on the 1 F 0 and 1 F 1 functions. As a increases, an increased mass is observed closer to the restriction p 1 + p 2 < 1 while an increase in b results in a lower pdf volume.  Next, the posterior distribution is derived, assuming the Dirichlet generator prior (8) together with a multinomial likelihood (1). (1) and the prior distribution for p is given by (8). Then, the pdf of the posterior distribution is given by
The complete product moment of the Dirichlet generator posterior (10) is of interest for the power sum (4), thus we are interested in E(p Theorem 2. Suppose that p|x follows a Dirichlet generator posterior distribution with pdf given in (10). Then, the complete product moment is given by . (11) Special cases of the above expression include setting k k+1 = 0 to obtain an expression for the usual product moment of the Dirichlet generator distribution under investigation in this paper.
Proof. See Appendix A for the proof.

Noncentral Dirichlet Prior
In this section, a noncentral Dirichlet distribution will be constructed via the use of Poisson weights. Ref. [16] explored the use of a compounding method as a distributional building tool to obtain bivariate noncentral distributions and showed how this form of the distribution isolated the noncentrality parameter by retaining them in a Poisson probability form and hence introducing mathematical convenience. Ref. [17] extended on this work by introducing new bivariate gamma distributions emanating from a scale mixture of normal class. Theorem 3. Suppose p is Dirichlet distributed with pdf given by (2). Then, a noncentral Dirichlet distribution can be constructed in the following manner: where h(p|j 1 , . . . , j K+1 ) denotes the conditional (central) Dirichlet distribution (see (2)) with parameters Π * = (π 1 + j 1 , . . . , π K+1 + j K+1 ), and Λ denotes the vector of noncentral parameters (λ 1 , . . . , where h(p; Π) denotes the (unconditional) Dirichlet distribution (see (2)) with parameter Π and where ∑ φ = ∑ ∞ j 1 =1 · · · ∑ ∞ j K+1 =1 . (13) reflects a parametrization of the noncentral Dirichlet distribution of [12] and can be represented via the confluent hypergeometric function of several variables:
Bivariate observations from the corresponding distributions were simulated using Algorithm 1 and the associated pdfs are overlaid and presented in in Figure 5 for different values of λ 1 and three combinations of Πs. These results showcase the effect that λ 1 has on these three different functions. Figure 5 clearly demonstrates the movement of the centroid of the contour plot. Next, the posterior distribution is derived, assuming the noncentral Dirichlet prior (12) together with a multinomial likelihood (1). (1) and the prior distribution for p is given by (12). Then, the posterior distribution has pdf

Remark 2.
See that (14) can be represented using the confluent hypergeometric function from Remark 1 as Theorem 5. Suppose that p|x follows a noncentral Dirichlet distribution with pdf given in (14). Then, the complete product moment is given by Proof. See Appendix B for the proof.

Entropy Estimates
In this section, the Bayesian estimators (16) and (17) based on the posterior distributions (10) and (14) are derived for the power sum (3).

Numerical Experiments of Entropy
The following steps illustrate the empirical behaviour of the Tsallis entropy under consideration for the alternative priors under consideration (Algorithm 2).

1.
Simulate p 1 and p 2 from the posterior distribution given by (10) and (14) using the Accept/Rejection method as described earlier for n = 50; 2. Calculate Calculate the p α i values for all the samples; 4.
Calculate the median for the sample of quantities in Step 4 (Note that ∑ 3 i=1 p α i might not be symmetric, thus the median is used.) 6.

Conclusions
This study focussed on the power sum functional and its estimation as a key tool to model a generalised entropy form, namely Tsallis entropy, via a Bayesian approach. In particular, previous unconsidered Dirichlet priors have been proposed and studied, offering the practitioner more pliable options given experimental data. Specific choices of the proposed Dirichlet family allow for positive correlation in addition to the usual negative correlation characteristic. An example illustrated theoretical results accurately described empirical entropy. Future work could include further investigations into generalised functionals and their modeling in this information-theoretic environment.  Acknowledgments: The authors would like to thank the anonymous reviewers for their insightful comments which led to the improvement of this paper. The support of the Department of Statistics at the University of Pretoria is acknowledged.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: