Several Basic Elements of Entropic Statistics

Zhang, Zhiyi

doi:10.3390/e25071060

Open AccessFeature PaperArticle

Several Basic Elements of Entropic Statistics

by

Zhiyi Zhang

Department of Mathematics and Statistics, UNC Charlotte, Charlotte, NC 28223, USA

Entropy 2023, 25(7), 1060; https://doi.org/10.3390/e25071060

Submission received: 14 June 2023 / Revised: 11 July 2023 / Accepted: 12 July 2023 / Published: 13 July 2023

(This article belongs to the Special Issue Entropy-Based Statistics and Their Applications)

Download Versions Notes

Abstract

Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a nonmetrized and nonordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, et cetera, associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a sound framework for rigorous development of entropy-based statistical exercises. In this article, several basic elements of entropic statistics are introduced and discussed, including notions of general entropies, entropic sample spaces, entropic distributions, entropic statistics, entropic multinomial distributions, entropic moments, and entropic basis, among other entropic objects. In particular, an entropic-moment-generating function is defined and it is shown to uniquely characterize the underlying distribution in entropic perspective, and, hence, all entropies. An entropic version of the Glivenko–Cantelli convergence theorem is also established.

Keywords:

entropies; entropy estimation; entropic-moment-generating function; entropic statistics

1. Introduction and Summary

Let

X = {ℓ_{k}; k \geq 1}

be a countable alphabet and let

p = {p_{k}; k \geq 1}

be a probability distribution on

X

. Let

P

be the collection of all probability distributions on

X

. Let

p_{↓} = {p_{(k)}; k \geq 1}

be the nonincreasingly rearranged

p

, that is,

p_{(k)} \geq p_{(k + 1)}

for every

k \geq 1

. Let

P_{↓}

be the collection of all possible

p_{↓}

. It follows that

P_{↓} \subset P

is an aggregated version of

P

in the sense that

P

is partitioned and represented by

p_{↓} \in P_{↓}

.

Across a wide spectrum of scientific investigation, a random system is often described as a probability distribution on a countable alphabet,

{X, p}

; however many complex system properties of interest, such as those studied in information theory and statistical mechanics, are often described by functions of

p_{↓}

, for example, the Shannon entropy

H = - \sum_{k \geq 1} p_{k} ln p_{k}

as in [1], the members of the Rényi entropy family

R_{α} = ln \sum_{k \geq 1} p_{k}^{α} / (1 - α)

where

α \in (0, 1) \cup (1 . \infty)

, as in [2], and the members of the Tsallis entropy family

T_{α} = (1 - \sum_{k \geq 1} p_{k}^{α}) / (α - 1)

where

α \in (- \infty, 1) \cup (1, \infty)

, as in [3]. Other similar functions come under the names of diversity indices, for example, the Gini–Simpson index

ζ = 1 - \sum_{k \geq 1} p_{k}^{2}

as in [4], the generalized Simpson’s indices

ζ_{u, v} = \sum_{k \geq 1} p_{k}^{u} {(1 - p_{k})}^{v}

where

u \geq 1

and

v \geq 0

are integers, as described in [5], Hill’s diversity numbers

H_{α} = {(\sum_{k \geq 1} p_{k}^{α})}^{1 / (1 - α)}

where

α \in (0, 1) \cup (1, \infty)

, as in [6], Emlen’s index

D = \sum_{k \geq 1} p_{k} e^{- p_{k}},

as in [7], and the richness index

K = \sum_{k \geq 1} 1 [p_{k} > 0],

where

1 [\cdot]

is the indicator function. While all the abovementioned functions each have their unique significance in their respective fields of study, they share one characteristic in common: they are all functions of

p_{↓}

.

The word entropy has ancient Greek roots, en and tropē, that is, inward and change respectively, in English, or internal change collectively. As such, it is a label-independent concept. For generality and conciseness of the presentation in this article, let the following definition be adopted.

Definition 1.

Let

f (p)

be a function defined for every

p \in P

. The function

f (p)

is referred to as an entropy if

f (p)

depends on

p

only through

p_{↓}

, that is,

f (p) = f (p_{↓})

.

By Definition 1, all entropies and diversity indices mentioned about are indeed entropies. In addition,

p_{(1)}

, or more generally

p_{(k)}

for any positive integer k, is an entropy, and therefore

p_{↓}

is an array of entropies. One important property to be noted about entropies is that

p_{↓}

is independent of labels of the alphabet,

{ℓ_{k}; k \geq 1}

. Another fact to be noted is that all entropies are uniquely determined by

p_{↓}

. For clarity of terminologies throughout this article, let it be noted that any properties of the underlying random system that are described by one or more entropies are referred to as entropic properties. Furthermore,

p

is referred to as the underlying probability distribution, or simply the distribution, of a random system, and

p_{↓}

is referred to as the entropic distribution associated with

p

. It is also to be noted that

p_{↓} = {p_{(k)}; k \geq 1}

is not a probability distribution in the usual sense since it is not associated with any specific probability experiment. It is merely an array of nonincreasingly ordered positive parameters that sum up to one.

Let

{X_{1}, \dots, X_{n}}

, drawn from

X

according to

p

, be a random sample of size n. The sample may be summarized into

Y = {Y_{k}; k \geq 1}

, where

Y_{k}

is the observed frequency of letter

ℓ_{k}

, or into

\hat{p} = {{\hat{p}}_{k} = Y_{k} / n; k \geq 1}

. Let

Y_{↓} = {Y_{(k)}; k \geq}

and

{\hat{p}}_{↓} = {{\hat{p}}_{(k)} = Y_{(k)} / n; k \geq 1}

be the nonincreasingly rearranged

Y

and

\hat{p}

, respectively, where

Y_{(k)} \geq Y_{(k + 1)}

and

{\hat{p}}_{(k)} \geq {\hat{p}}_{(k + 1)}

for every k. Under the assumption that the study interest of the underlying random system only lies with the properties described by indices that are functions of the form

f (p_{↓})

, that is, entropies by Definition 1, there are two conceptual perspectives to the associated with statistical inference. The first is a framework of estimating

f (p)

based on

\hat{p}

, and the second is one of estimating

f (p) = f (p_{↓})

based on

{\hat{p}}_{↓}

. For lack of better terms, let the first framework be referred to as the classical statistics and the second framework as the entropic statistics. These two frameworks are not equivalent and, in particular, the entropic framework has its special and useful implications.

The literature of statistical estimation of entropies, mostly in the specific form of the Shannon entropy, begins with the early works, as in [8,9,10], and expands in width and depth in works by, for example, [11,12,13]. Many other worthy references on entropy estimation may be found in the literature review in [14]. The general entropies of Definition 1, however, allow a discussion on the foundational elements of the statistics in entropic perspective, or entropic statistics, in a broader sense. This article focuses on three basic basic issues.

First, a notion of entropic sample space is introduced in Section 2 below. An entropic sample space is an aggregated sample space to register; not a single data point, but an ensemble of data points. It is a sample space of the entropic statistics,

Y_{↓}

or

{\hat{p}}_{↓}

, and hence is label-independent. The said label-independence in turn allows an entropic sample space to accommodate statistical sampling into a population that is not necessarily prescribed, that is, the labels of alphabet

X

need not be completely specified a priori. This property of an entropic sample space gives new meaning to statistical learning and lends foundational support for statistical exploration into an unknown, or partially known, universe.

Second, an entropic characteristic function,

ϕ (t) = \sum_{k \geq 1} p_{(k)}^{t}

for

t \geq 1

, is introduced. It is obvious that

ϕ (t)

is an entropy by Definition 1 and that it always exists. It is established in Section 3 that

ϕ (t)

in an arbitrarily small neighborhood of any interior point of

[1, \infty)

uniquely determines the

p_{↓} \in P_{↓}

and vice versa. Therefore, it is immediately implied that any and all entropic properties of a random system, including statistical inferences, may be approached by way of

ϕ (t)

.

Third, it is established in Section 4 that the entropic statistics converges almost surely and uniformly to the underlying entropic distribution, that is,

{\hat{p}}_{↓} \overset{a . s .}{⟶} p_{↓}

uniformly, for any

p_{↓} \in P_{↓}

. In light of the entropic sampling space and an entropic characterization of the associated entropic sampling distribution, the Glivenko–Cantelli-like convergence theorem provides a fundamental support in theory for exercises in entropic statistics.

The article ends with an appendix where a lengthy proof is found.

2. Things Entropic

2.1. Sample Spaces in Different Resolutions

Consider the experiment of randomly drawing a marble from urn 1, which contains marbles of

K = 3

known colors, red, white, and blue. In anticipating the outcome of the experiment, one may introduce an index k,

k = 1, 2, 3

, to label the possible outcomes by

ℓ_{1} = red

,

ℓ_{2} = white

, and

ℓ_{3} = blue

, and denote the corresponding proportions by

p_{1}

,

p_{2}

, and

p_{3}

. In this case, the sample space is

Ω_{1} = {ℓ_{1}, ℓ_{2}, ℓ_{3}},

(1)

the event space is

B = {\emptyset, {ℓ_{1}}, {ℓ_{2}}, {ℓ_{3}}, {ℓ_{1}, ℓ_{2}}, {ℓ_{1}, ℓ_{3}}, {ℓ_{2}, ℓ_{3}}, {ℓ_{1}, ℓ_{2}, ℓ_{3}}}

, and the point mass probability measure

μ (\cdot)

assigns

p_{1}

to

ℓ_{1}

,

p_{2}

to

ℓ_{2}

, and

p_{3}

to

ℓ_{3}

. Let X denote the random outcome of the experiment. The following model of probability distribution,

\begin{matrix} X & ℓ_{1} & ℓ_{2} & ℓ_{3} \\ P (x) & p_{1} & p_{2} & p_{3} \end{matrix}

(2)

or in a different form

p = {p_{1}, p_{2}, p_{3}}

on

X = Ω_{1} = {ℓ_{1}, ℓ_{2}, ℓ_{3}}

, is well defined with three parameters,

p_{1}

,

p_{2}

, and

p_{3}

, subject to the constraints,

0 \leq p_{k} \leq 1

for each k and

\sum_{k = 1}^{3} p_{k} = 1

. The result of drawing

n = 1

marble from the urn may also be represented by a triplet of random variables

Y = {1 [X = ℓ_{1}], 1 [X = ℓ_{2}], 1 [X = ℓ_{3}]}

. If

Y

is used to represent the outcome of the experiment, the sample space may be denoted as

Ω_{1} = {{1, 0, 0}, {0, 1, 0}, {0, 0, 1}}

with corresponding probability distribution

P (Y = {1, 0, 0}) = p_{1}

,

P (Y = {0, 1, 0}) = p_{2}

and

P (Y = {0, 0, 1}) = p_{3}

. For clarity in terminology, X is referred to as a random element but

Y

is a set of random variables. In general, random results of an experiment that are represented by numerical values are referred to as random variables, and those by non-numerical symbols are random elements.

For a given experiment, the sample space may be chosen at different levels of resolution depending on the experimenter’s interest in the study. Suppose the experimenter is to randomly draw

n = 3

marbles from urn 1 with replacement in sequence, resulting in

X = {X_{1}, X_{2}, X_{3}}

where

X_{i}

,

i = 1, 2, 3

, is the color of the ith marble drawn in the sequence. The sample space associated with

X

may be represented by

Ω_{s} = \{\begin{matrix} {ℓ_{1}, ℓ_{1}, ℓ_{1}}, & {ℓ_{2}, ℓ_{2}, ℓ_{2}}, & {ℓ_{3}, ℓ_{3}, ℓ_{3}}, \\ {ℓ_{1}, ℓ_{1}, ℓ_{2}}, & {ℓ_{2}, ℓ_{1}, ℓ_{1}}, & {ℓ_{1}, ℓ_{2}, ℓ_{1}}, \\ {ℓ_{1}, ℓ_{1}, ℓ_{3}}, & {ℓ_{3}, ℓ_{1}, ℓ_{1}}, & {ℓ_{1}, ℓ_{3}, ℓ_{1}}, \\ {ℓ_{2}, ℓ_{2}, ℓ_{1}}, & {ℓ_{1}, ℓ_{2}, ℓ_{2}}, & {ℓ_{2}, ℓ_{1}, ℓ_{2}}, \\ {ℓ_{2}, ℓ_{2}, ℓ_{3}}, & {ℓ_{3}, ℓ_{2}, ℓ_{2}}, & {ℓ_{2}, ℓ_{3}, ℓ_{2}}, \\ {ℓ_{3}, ℓ_{3}, ℓ_{1}}, & {ℓ_{1}, ℓ_{3}, ℓ_{3}}, & {ℓ_{3}, ℓ_{1}, ℓ_{3}}, \\ {ℓ_{3}, ℓ_{3}, ℓ_{2}}, & {ℓ_{2}, ℓ_{3}, ℓ_{3}}, & {ℓ_{3}, ℓ_{2}, ℓ_{3}}, \\ {ℓ_{1}, ℓ_{2}, ℓ_{3}}, & {ℓ_{1}, ℓ_{3}, ℓ_{2}}, & {ℓ_{2}, ℓ_{1}, ℓ_{3}}, \\ {ℓ_{2}, ℓ_{3}, ℓ_{1}}, & {ℓ_{3}, ℓ_{1}, ℓ_{2}}, & {ℓ_{3}, ℓ_{2}, ℓ_{1}} \end{matrix}\},

(3)

where the subscript “s” stands for sequential. There are 27 distinct elements in (3). In this case, the sample space may also be expressed as

Ω_{s} = {ℓ_{1}, ℓ_{2}, ℓ_{3}}^{3}

. This sample space may be adopted if the order of the

n = 3

observations is observable and is of interest.

Suppose in the above experiment the order of the observations is not observable or not of interest. Then the relevant information in

X = {X_{1}, X_{2}, X_{3}}

may be represented in the form of

Y = {Y_{1}, Y_{2}, Y_{3}}

, where

Y_{k}

,

k = 1, 2

, and 3 is the number of

ℓ_{k}

s observed in the sample. The sample space associated with

Y

is

Ω_{m} = \{\begin{matrix} {3, 0, 0}, & {0, 3, 0}, & {0, 0, 3}, & {2, 1, 0}, & {2, 0, 1}, \\ {0, 2, 1}, & {1, 2, 0}, & {0, 1, 2}, & {1, 0, 2}, & {1, 1, 1} \end{matrix}\},

(4)

where the subscript “m” stands for multinomial. There are 10 distinct elements in (4). In fact,

Y = {Y_{1}, Y_{2}, Y_{3}}

is the usual multinomial random vector with

K = 3

categories and category probabilities

p_{1}

,

p_{2}

, and

p_{3}

.

The two sample spaces,

Ω_{s}

and

Ω_{m}

, serve different statistical interests in various situations.

Ω_{s}

is well defined if

X = {X_{1}, X_{2}, X_{3}}

is observable.

Ω_{m}

is well defined if

X = {X_{1}, X_{2}, X_{3}}

is observable or only

Y = {Y_{1}, Y_{2}, Y_{3}}

is observable. Noting that

X = {X_{1}, X_{2}, X_{3}}

implies

Y = {Y_{1}, Y_{2}, Y_{3}}

, a lower-resolution sample space may always be adopted if a higher-resolution sample space may, but not vice versa. For example, if the order of the draws is not observable, then only

Ω_{m}

is appropriate since

Y = {Y_{1}, Y_{2}, Y_{3}}

is not linked uniquely to the elements of

Ω_{s}

.

Ω_{m}

is an aggregated form of

Ω_{s}

and is hence of lower resolution; however,

Ω_{m}

may be further reduced in resolution. Let

Y_{↓} = {Y_{(1)}, Y_{(2)}, Y_{(3)}},

(5)

where

Y_{(1)}, Y_{(2)}, Y_{(3)}

are nonincreasingly ordered observed frequencies of the three colors. The sample space associated with

Y_{↓}

is

Ω_{e} = \{\begin{matrix} {3, 0, 0}, & {2, 1, 0}, & {1, 1, 1} \end{matrix}\},

(6)

where the subscript “e” stands for entropic.

Ω_{e}

is yet an aggregated form of

Ω_{m}

and hence of lower resolution still than that of

Ω_{m}

. Noting that

Ω_{e}

is label-independent, it is an example of entropic sample space.

It is easily verified that the probability distribution of

Y_{↓}

may be expressed in terms of

p_{↓}

as follows.

\begin{matrix} Y_{↓} & {3, 0, 0} & {2, 1, 0} & {1, 1, 1} \\ P (y_{↓}) & \sum_{k \geq 1} p_{(k)}^{3} & 3 \sum_{k \geq 1} p_{(k)}^{2} (1 - p_{(k)}) & 6 p_{(1)} p_{(2)} p_{(3)} \end{matrix}

(7)

Let it be noted that all the probabilities in (7) are label-independent, and therefore they are entropies by Definition 1.

In the case of sampling

n = 3

marbles from urn 1 in sequence, a subscription to the entropic sample space,

Ω_{e}

, is by choice since both

Ω_{m}

and

Ω_{e}

are available. There are situations when the subscription to an entropic sample space may be by necessity.

Consider the experiment of randomly drawing

n = 3

marbles in sequence from urn 2, which contains marbles of

K = 3

unknown but distinguishable colors. In this case, the sample spaces,

Ω_{1}

of (1) and

Ω_{m}

of (3), are not well defined due to the lack of knowledge of the color labels. However, the entropic sample space,

Ω_{e}

, is available for subscription regardless of what the colors are, known or unknown, as long as they are distinguishable.

In general, consider drawing a random sample of size n from

X = {ℓ_{k}; k \geq 1}

under

p = {p_{k}; k \geq 1}

in sequence. The sequential sample space is of the form

Ω_{1} = X^{n}

. The aggregated sample space,

Ω_{m} = {{y_{k}; k \geq 1} : y_{k} \geq 0 for every k \geq 1 and \sum_{k \geq 1} y_{k} = n},

(8)

is that of the mutinomial array,

Y = {Y_{k}; k \geq 1}

, with probability mass function

P ({y_{k}; k \geq 1}) = \frac{n!}{\prod_{k \geq 1} y_{k}!} \prod_{k \geq 1} p_{k}^{y_{k}}

(9)

where

0 \leq y_{k} \leq n

for every

k \geq 0

and

\sum_{k \geq 1} y_{k} = n

. Moreover,

Ω_{m}

may be further aggregated into a sample space,

Ω_{e}

, for

Y_{↓} = {Y_{(k)}; k \geq 1}

, that is,

Ω_{e} = {{y_{(k)}; k \geq 1} : y_{(k)} \geq 0 and y_{(k)} \geq y_{(k + 1)} for every k \geq 1, and \sum_{k \geq 1} y_{(k)} = n} .

(10)

Let

Ω_{e}

of (10) be referred to as the entropic sample space. The associated probability distribution is

P ({y_{(k)}; k \geq 1}) = \sum_{*} P ({y_{k}; k \geq 1})

(11)

where

\sum_{*}

is summation of (9) over all

{y_{k}; k \geq 1}

s in

Ω_{m}

sharing the same given

{y_{(k)}; k \geq 1}

.

Given a

y_{↓} = {y_{(k)}; k \geq 1}

, (11) is an entropy. This may be seen in two steps. First, let

\hat{K} = \sum_{k \geq 1} 1 [y_{(k)} \geq 1]

be the number of distinct letters of

X

represented in a sample of size n, and let

z = {z_{1}, \dots, z_{\hat{K}}}

be the set of

\hat{K}

positive integer values of

y_{↓}

.

\hat{K}

is a positive finite integer. Let the cardinality of

X

be denoted as

K = \sum_{k \geq 1} 1 [p_{k} > 0]

.

K \geq 1

may be finite or countably infinite. Consider an array

a (y_{↓}) = {a_{k} (y_{↓}); k \geq 1}

of length K whose entries are a particular allocation of the

\hat{K}

values of

z_{j}

,

j = 1, \dots, \hat{K}

, with the other

K - \hat{K}

values of

a (y_{↓})

being zeros. Let

A (y_{↓})

be the complete collection of all such distinct

a (y_{↓})

s. Then it is clear that

y_{↓}

uniquely implies

A (y_{↓})

.

Second, the probability in (11) may be re-expressed as

P ({y_{(k)}; k \geq 1}) = \frac{n!}{\prod_{k \geq 1} y_{(k)}!} \sum_{* *} (\prod_{k \geq 1} p_{(k)}^{a_{k} (y_{↓})})

(12)

where

\sum_{* *}

is summation over all

a (y_{↓}) \in A (y_{↓})

, given a

y_{↓} = {y_{(k)}; k \geq 1}

. Equation (12) implies that

P ({y_{(k)}; k \geq 1})

is a function of

p_{↓}

and hence an entropy. Let

P ({y_{(k)}; k \geq 1})

of (11) or (12) be referred to as the entropic distribution associated with the entropic sample space,

Ω_{e}

.

2.2. Entropic Objects

Let the adjective “entropic” be used to describe objects that are label-independent. Several such objects are defined or summarized below.

A function, $f (p) = f (p_{↓})$ for all $p_{↓} \in P_{↓}$ , is an entropy.
The elements of $p_{↓} = {p_{(k)}; k \geq 1}$ are the entropic parameters, as compared to the elements of $p = {p_{k}; k \geq 1}$ , which are multinomial parameters.
The elements of $Y_{↓} = {Y_{(k)}; k \geq 1}$ or equivalently of ${\hat{p}}_{↓} = {{\hat{p}}_{(k)}; k \geq 1}$ are entropic statistics, as compared to the elements of $Y = {Y_{k}; k \geq 1}$ or equivalently $\hat{p} = {{\hat{p}}_{k}; k \geq 1}$ , which are multinomial statistics.
$Ω_{e}$ of (10) is the entropic (multinomial) sample space, as compared to $Ω_{m}$ of (8), which is the multinomial sample space.
The distribution $P ({y_{(k)}; k \geq 1})$ of (11) or (12), is the entropic probability distribution, while $P ({y_{k}; k \geq 1})$ of (9) is the multinomial probability distribution.
Entropic statistics is the collection of statistical methodologies that help to make inference on the characteristics of a random system exclusively via entropies.

In addition, there are several useful other entropic objects. First, letting

ζ_{v} = \sum_{k \geq 1} p_{k} {(1 - p_{k})}^{v}

for all non-negative integers

v \geq 0

,

ζ = {ζ_{v}; v \geq 0}

is referred to as the entropic basis. The name comes from the fact that, for any well-behaved function,

h (p)

for

p \in [0, 1]

, an entropy of the form

H = \sum_{k \geq 1} p_{k} h (p_{k})

may be expressed as a linear combination

H = \sum_{v \geq 1} w (v) ζ_{v}

. For example, the Shannon entropy, provided that it is finite, may be written as

H = - \sum_{k \geq 1} p_{k} ln p_{k} = \sum_{v \geq 1} (1 / v) ζ_{v - 1} .

The entropic basis is useful because it unfolds many entropies into simple and linearly additive forms.

Second, letting

η_{u} = \sum_{k \geq 1} p_{k}^{u}

for all positive integers

u \geq 1

,

η = {η_{u}; u \geq 1}

is often referred to as the entropic moment. The elements of both

ζ

and

η

have good estimators. A detailed discussion may be found in [14].

Definition 2.

Let X be an random element on a countable alphabet

X = {ℓ_{k}; k \geq 1}

with a corresponding probability distribution

p \in P

and its associated entropic distribution

p_{↓} \in P_{↓}

. The function,

ϕ (t) = \sum_{k \geq 1} p_{k}^{t}, f o r t \geq 0,

(13)

is referred to as the entropic-moment-generating function of X, of

p

, or of

p_{↓}

. The two complementary parts of its domain,

[1, \infty)

and

[0, 1)

, are, respectively, referred to as the primary domain and the secondary domain of the entropic-moment-generating function.

Depending on context,

ϕ (t)

may be denoted as

ϕ_{X} (t)

,

ϕ_{p} (t)

, or

ϕ_{p_{↓}} (t)

whenever appropriate. Obviously,

ϕ (t)

is uniformly bounded above by one for all

p \in P

in the primary domain but is not necessarily finitely defined in the secondary domain. However, in the case of a finite alphabet, that is,

K = \sum_{k \geq 1} 1 [p_{k} > 0] < \infty

,

ϕ (t)

is finitely defined for each and every

t \in R

, in particular for

t \geq 0

. The characteristic utility of

ϕ (t)

is further explored in Section 3 below.

2.3. Examples of Entropic Statistics

Example 1.

Consider the Bernoulli experiment of tossing a coin, where

P (h) = p

and

P (t) = 1 - p

. The question of whether the coin is fair may be formulated in the usual classical sense, that is, whether

p = 0.5

. The question may be approached by estimating p based on a sample proportion,

\hat{p}

, if it is observable which trials lead to “h” and which lead to “t”. The question may alternatively be formulated by an equivalent entropic statement, for example, whether

H = p (1 - p) = 0.25

. More generally, if

K = \sum_{k \geq 1} 1 [p_{k}]

is finite and known, then the uniformity of

p

on

X

may be formulated entropically by, for example,

H = \sum_{k \geq 1} p_{k}^{2} = 1 / K

,

H = \sum_{k \geq 1} p_{k} (1 - p_{k}) = (K - 1) / K

, or

H = - \sum_{k \geq 1} p_{k} ln p_{k} = ln K

. The validity of these entropic statements may then be gauged statistically.

Example 2.

Consider a two-stage sampling scheme: a random sample of size n,

{X_{1}, \dots, X_{n}}

, and then a single extra observation

X_{n + 1}

are taken. The sample of size n may be summarized into letter frequencies,

Y = {Y_{k}; k \geq 1}

. Let

π_{0} = \sum_{k \geq 1} p_{k} 1 [Y_{k} = 0]

. Clearly,

π_{0}

is label-independent and therefore an entropic random variable. Given the sample of size n,

π_{0}

may be thought of as the probability of that

X_{n + 1}

assumes a letter in

X

that is not represented in the sample of size n. In some context,

π_{0}

may be thought of as the probability of new discovery. Let

N_{1} = \sum_{k \geq 1} 1 [Y_{k} = 1]

and

T_{n} = N_{1} / n

.

T_{n}

is commonly known as Turing’s formula, introduced in [15], but credited largely to Alan Turing. It is to be noted that

N_{1}

is label-independent and, therefore, so is

T_{n}

.

T_{n}

is an good estimator of

π_{0}

and a discussion on many of its statistical properties may be found in [14].

Example 3.

In developing a decision tree classifier, the data space is partitioned into an ensemble of small subspaces, in each of which a local classification rule is sought. The central spirit of every local classification may be described by a two-step scheme.

1.: First, a random sample of size n, ${X_{1}, \dots, X_{n}}$ , is taken from $X = {ℓ_{k}; k \geq 1}$ , under an unknown $p = {p_{k}; k \geq 1}$ , which is summarized into $Y = {Y_{k}; k \geq 1}$ .
2.: The data-based local classification rule is as follows: the next observation, $X_{n + 1}$ , is predicted to be the letter which is observed most frequently in the sample of size n. For simplicity, let it be assumed that $p_{(1)} > p_{(2)}$ , and a letter with the sample maximum frequency is unique (if not, some randomization may be employed).

Obviously, the designated letter based on a sample is not necessarily the letter associated with the letter corresponding to the maximum of

p_{k}

s. In such a setup, the performance of the tree classifier may be gauged by evaluating (calculating or estimating) the probability of the event that “the designated letter is the same letter of

X

with probability

p_{(1)}

”, that is,

P (\underset{ℓ_{k}; k \geq 1}{arg max} {p (ℓ_{k}); k \geq 1} = \underset{ℓ_{k}; k \geq 1}{arg max} {\hat{p} (ℓ_{k}); k \geq 1}) .

(14)

Note that the event in (14) is label-independent and hence the probability is an entropy, which may be estimated. The probability in (14) may reasonably called the confidence level of the simple classifier.

For illustration purpose, consider the special case of a binary X, with

n = 2 m + 1

for some positive integer m. For simplicity, n is chosen to be odd here so that

Y_{(1)} > Y_{(2)}

always holds true. Suppose that

p_{1} = p_{(1)} > p_{(2)} = 1 - p_{(1)}

. The event that a classifier based on the sample of n correctly identifies the letter of maximum probability may be equivalently expressed as

Y_{1} \geq m + 1

. The probability of such an event, (14), is

\begin{matrix} P (ℓ_{1} = \underset{ℓ_{1}, ℓ_{2}}{arg max} {{\hat{p}}_{1}, 1 - {\hat{p}}_{1}}) & = P (Y_{1} \geq m + 1) \\ = \sum_{y \geq m + 1} \frac{n!}{y! (n - y)!} p_{(1)}^{y} {(1 - p_{(1)})}^{n - y}, \end{matrix}

(15)

which is independent of the assumption that

p_{1} > p_{2} = 1 - p_{1}

and, therefore, is an entropy. More specifically, (15) is computed for several combinations of n and

p_{↓}

and the resulting values are tabulated in Table 1. Table 1, and its likes, may be used in two different ways. First, given a fixed

p_{↓}

, it indicates how large a sample is needed to assure a reliability level of the classifier. On the other hand, at a given level of n and a particular

p_{↓}

, the classifier may be evaluated by the probabilities in the table. In practice,

p_{↓}

is unknown but may be estimated.

3. Entropic Characterization

Entropic statistics focuses on making inference via entropies; it is therefore of interest to find a function which may characterize

p_{↓} \in P_{↓}

. Since the function

ϕ (t)

in its primary domain and

η = {η_{u}; u \geq 1}

, where

η_{u} = \sum_{k \geq 1} p_{k}^{u}

and u is an integer, imply each other (see Lemma 1 below), it follows immediately that (13) uniquely determines all entropies. However, the following theorem claims that the characteristic property of the entropic-moment-generating function,

ϕ (t)

, remains intact in any arbitrarily neighborhood of any

t \in (1, \infty)

.

Theorem 1.

Let

p = {p_{k}; k \geq 1}

and

q = {q_{k}; k \geq 1}

be two probability distributions on a same countable alphabet,

X = {ℓ_{k}; k \geq 1}

. Let

p_{↓} = {p_{(k)}; k \geq 1}

and

q_{↓} = {q_{(k)}; k \geq 1}

be the respective corresponding entropic distributions of

p

and

q

. Then

p_{↓} = q_{↓}

if and only if

ϕ_{p} (t) = ϕ_{q} (t)

for all

t \in (a, b)

where

(a, b)

is an arbitrary interval such that

1 \leq a < b < \infty

.

Lemma 1.

Let

p = {p_{k}; k \geq 1}

and

q = {q_{k}; k \geq 1}

be two probability distributions in

P

with two corresponding associated entropic distributions

p_{↓}

and

q_{↓}

in

P_{↓}

. Then

p_{↓} = q_{↓}

if and only if

\sum_{k \geq 1} p_{k}^{n} = \sum_{k \geq 1} q_{k}^{n}

for all positive integers

n \geq 1

.

A proof of Lemma 1 may be found on pages 50 and 51 in [14]. To prove Theorem 1, it suffices to show that

ϕ (t)

in an arbitrarily small neighborhood of any interior point of

[1, \infty)

determines the function globally.

Proof of Theorem 1.

If

p_{↓} = q_{↓}

, then it immediately follows that

ϕ_{p} (t) = ϕ_{q} (t)

for all

t \in [1, \infty)

and, therefore, for

t \in (a, b)

specifically. To prove the theorem, it suffices to show the converse.

Consider the series

f (z) = \sum_{k = 1}^{\infty} p_{k}^{z}

where

z \in C

is a complex variable. Denote the real and the imaginary parts of a complex value z by

Re (z)

and

Im (z)

, respectively.

Let

D = {z : Re (z) > 1}

be the subset of

C

such that the real part of z is greater than 1. For every

z \in D

, since

p_{k}^{Re (z - 1)} \leq 1

and

|p_{k}^{i Im (z)}| = 1

, for every k, where

| z |

is the modulus of z, it follows that

f (z) = \sum_{k = 1}^{\infty} p_{k} p_{k}^{z - 1} = \sum_{k = 1}^{\infty} p_{k} p_{k}^{Re (z - 1)} p_{k}^{i Im (z)}

and

| f (z) | \leq \sum_{k = 1}^{\infty} p_{k} .

(16)

Letting

α_{k} = ln (1 / p_{k})

,

f (z) = \sum_{k = 1}^{\infty} e^{- α_{k} z},

(17)

and the functions

e^{- α_{k} z}

,

k \geq 1

, are analytic on

C

.

Since the series in (17), for

z \in D

, is dominated by the convergent series

\sum_{k \geq 1} p_{k}

as in (16), by the Weierstrass uniform convergence theorem,

f (z)

is analytic on

D

. By a similar argument,

g (z) = \sum_{k = 1}^{\infty} q_{k}^{z}

is also analytic on

D

.

Assuming that

ϕ_{p} (t) = ϕ_{q} (t)

for

t \in (a, b)

where

1 \leq a < b < \infty

, there exists a convergent sequence,

{z_{n}; n \geq 1}

in

(a, b)

such that

{lim}_{n \to \infty} z_{n} = z_{0} \in (a, b)

. Noting that

(a, b) \subset D

and

f (z_{n}) = ϕ_{p} (z_{n}) = ϕ_{q} (z_{n}) = g (z_{n})

for

n \geq 0

, by the identity theorem for analytic functions,

f (z) = g (z)

for all

z \in D

. It follows that

ϕ_{p} (t) = f (t) = g (t) = ϕ_{q} (t)

for all

t \in [1, \infty)

, specifically,

\sum_{k \geq 1} p_{k}^{n} = \sum_{k \geq 1} q_{k}^{n}

for all

n \geq 1

. Finally, by Lemma 1,

p_{↓} = q_{↓}

. □

Theorem 1 immediately implies that a subfamily of the Rényi entropy

R_{α}

with

α \in (a, b) \subset (1, \infty)

, a subfamily of the Tsallis entropy

T_{α}

with

α \in (a, b) \subset (1, \infty)

, and a subfamily of Hill’s diversity numbers

H_{α}

with

α \in (a, b) \subset (1, \infty)

, respectively, characterizes

p_{↓}

and, hence, characterizes all entropies.

The characterization of

p_{↓}

in Theorem 1 may be equivalently stated only on any infinitely countable subset of

(a, b)

.

Corollary 1.

Let

p

and

q

be two probability distributions on a same countable alphabet,

X

. Let

p_{↓}

and

q_{↓}

be the corresponding entropic distributions of

p

and

q

, respectively. Then

p_{↓} = q_{↓}

if and only if

ϕ_{p} (t) = ϕ_{q} (t)

on any infinite sequence of distinct values,

{t_{n}; n \geq 1}

, such that

{lim}_{n \to \infty} t_{n} = c \in (1, \infty)

.

Proof.

Both

ϕ_{p} (t)

and

ϕ_{q} (t)

are analytic at

t = c

, and therefore

h (t) = ϕ_{p} (t) - ϕ_{q} (t)

is analytic at

t = c

. Let it be first shown, by induction, that all derivatives of

h (t)

at

t = c

are zero, that is,

h^{(m)} (c) = 0

for

m \geq 0

. Note first that

h (c) = h^{(0)} = 0

by the fact that both

ϕ_{p} (t)

and

ϕ_{q} (t)

are continuous and

{lim}_{n \to \infty} ϕ_{p} (t_{n}) = ϕ_{p} (c) = ϕ_{q} (c) = {lim}_{n \to \infty} ϕ_{q} (t_{n})

. Suppose that

h^{(0)} (c) = h^{(1)} (c) = h^{(2)} (c) = \dots = h^{(m)} (c) = 0

but

h^{(m + 1)} (c) \neq 0

. Then there exists an interval

(c - ε, c + ε)

such that

h (t) \neq 0

for

t \in (c - ε, c + ε)

. However, there is at least one

t_{n} \in (c - ε, c + ε)

such that

h (t_{n}) = 0

by assumption. This is a contradiction and therefore

h^{(m)} (c) = 0

for all

m \geq 1

. □

Corollary 2.

Let

p

and

q

be two probability distributions on a same countable alphabet,

X

. Let

p_{↓}

and

q_{↓}

be the corresponding entropic distributions of

p

and

q

, respectively. Then

p_{↓} = q_{↓}

if and only if

ϕ_{p} (t) = ϕ_{q} (t)

on any infinite sequence of distinct values,

{t_{n}; n \geq 1} \in (a, b)

where

1 \leq a < b < \infty

.

Proof.

Noting that the infinitely many

t_{n}

s are in an bounded interval, there exists an infinite subset of

{t_{n}; n \geq 1}

that converges to a constant

c \in [a, b]

. The corollary follows Corollary 1. □

Consider a pair of random elements,

(X, Y)

, on a countable joint alphabet,

X \times Y = {(l_{i}, m_{j}); i \geq 1, j \geq 1}

, with a corresponding joint probability distribution,

p_{X, Y} = {p_{i, j}; i \geq 1, j \geq 1}

. Let

p_{X} = {p_{i, \cdot}; i \geq 1}

and

p_{Y} = {p_{\cdot, j}; j \geq 1}

, where

p_{i, \cdot} = \sum_{j \geq 1} p_{i, j}

and

p_{\cdot, j} = \sum_{i \geq 1} p_{i, j}

, be the two marginal probability distributions of X and Y, respectively.

Corollary 3.

X and Y are independent if and only if

ϕ_{X, Y} (t) = ϕ_{X} (t) \times ϕ_{Y} (t)

(18)

for all

t \in (a, b)

, where a and b are two arbitrary real numbers such that

1 \leq a < b < \infty

.

Proof.

If X and Y are independent, then (18) follows immediately. Conversely, suppose that (18) holds. Consider another pair of independent random elements,

(U, V)

, on the same countable joint alphabet

X \times Y

and with identical marginal distributions to those of

(X, Y)

, that is,

p_{X}

and

p_{Y}

. It then follows, by (18) and Theorem 2, that

p_{U, V} = p_{X, Y}

, which in turn implies that X and Y are independent. □

Corollary 3 provides a characterization of independence on a general countable joint alphabet, and its utility may be explored further.

4. A Basic Convergence Theorem

From an entropic perspective, the convergence of

{\hat{p}}_{↓}

to

p_{↓}

, to be distinguished from that of

\hat{p}

to

p

, is of fundamental interest.

For clarity of presentation in this section, let it be noted that, whenever necessary, the subindex n may be added to

Y

,

Y_{k}

,

\hat{p}

,

{\hat{p}}_{k}

,

{\hat{p}}_{↓}

, and

{\hat{p}}_{(k)}

to highlight the dynamic nature of these previously defined quantities as n changes, that is,

Y = Y_{n}

,

Y_{k} = Y_{k, n}

,

\hat{p} = {\hat{p}}_{n}

,

{\hat{p}}_{k} = {\hat{p}}_{k, n}

,

{\hat{p}}_{↓, n} = {\hat{p}}_{↓}

and

{\hat{p}}_{(k)} = {\hat{p}}_{(k), n}

, respectively.

The main result established in this section is the uniform almost-sure convergence of

{\hat{p}}_{↓}

to

p_{↓}

, which is made more precise in Theorem 2 below.

Consider the experiment of repeatedly and independently drawing a letter from

X

under

p

, resulting in a sequence of randomly selected letters,

ω = {x_{1}, x_{2}, \dots}

. Let the collection of all possible such sequences or paths be denoted

Ω

. A sample of size n is a partial sequence of the first n randomly selected letters in an

ω

,

{x_{1}, \dots, x_{n}}

.

Let

p_{↓} = {p_{(k)}; k \geq 1}

and

{\hat{p}}_{↓} = {{\hat{p}}_{(k)}; k \geq 1}

be defined as above. It is to be specifically noted that the rearrangement of the observed relative frequencies,

{\hat{p}}_{↓}

, is performed independently based on the observed values of

{\hat{p}}_{k}

for all

k \geq 1

, with no regard to the arrangement of the probabilities,

p_{↓} = {p_{(k)}; k \geq 1}

. Consequently, the letter of which the relative frequency

{\hat{p}}_{(k)}

is observed is not necessarily the same letter with which the probability

p_{(k)}

is associated. This is, in fact, the essence of entropic perspective.

Theorem 2.

For any

p \in P

, let

p_{↓}

,

\hat{p}

and

{\hat{p}}_{↓}

be defined as above. Then

max_{k \geq 1} |{\hat{p}}_{(k)} - p_{(k)}| \overset{a . s .}{⟶} 0 .

(19)

A proof of Theorem 2 requires Lemmas 2 and 3 below.

Lemma 2.

For any

p \in P

, let

\hat{p}

be as defined above. Then

max_{k \geq 1} |{\hat{p}}_{k} - p_{k}| \overset{a . s .}{⟶} 0 .

(20)

Proof.

For each k, by the strong law of large numbers,

{\hat{p}}_{k} \overset{a . s .}{⟶} p_{k}

or equivalently

| {\hat{p}}_{k} - p_{k} | \overset{a . s}{⟶} 0

. Let the collection of paths

ω = {x_{1}, x_{2}, \dots}

in

Ω

that satisfies

{lim}_{n \to \infty} | {\hat{p}}_{k} - p_{k} | = 0

be denoted as

Ω_{k} \subseteq Ω

. It follows that

P (Ω_{k}) = 1

, that the complement of

Ω_{k}

,

Ω_{k}^{'}

, is of probability zero, that

\cup_{k \geq 1} Ω_{k}^{'}

is of probability zero, and that, letting

Ω^{*} = \cap_{k \geq 1} Ω_{k}

,

P (Ω^{*}) = 1 - P (\cup_{k \geq 1} Ω_{k}^{'}) = 1

.

For each and every path

ω \in Ω^{*}

and every k,

{lim}_{n \to \infty} | {\hat{p}}_{k} - p_{k} | = 0

. Note the fact that

| {\hat{p}}_{k} - p_{k} | \leq {\hat{p}}_{k} + p_{k}

and, therefore,

\sum_{k \geq 1} | {\hat{p}}_{k} - p_{k} | \leq \sum_{k \geq 1} ({\hat{p}}_{k} + p_{k}) = 2

, by the bounded convergence theorem,

lim_{n \to \infty} \sum_{k \geq 1} | {\hat{p}}_{k} - p_{k} | = \sum_{k \geq 1} lim_{n \to \infty} | {\hat{p}}_{k} - p_{k} | = 0,

(21)

that is,

\sum_{k \geq 1} | {\hat{p}}_{k} - p_{k} | \overset{a . s .}{⟶} 0

. By (21), the lemma follows from the fact that

max_{k \geq 1} | {\hat{p}}_{k} - p_{k} | \leq \sum_{k \geq 1} | {\hat{p}}_{k} - p_{k} | \overset{a . s .}{⟶} 0 .

□

Lemma 2 may be viewed as a version of the Glivenko–Cantelli theorem on countable alphabets with respect to observed data from a classical multinomial experiment. The uniformity of the convergence in (20) is of essential importance in the proof of Theorem 2, which is given below by way of Lemma 3.

Lemma 3.

For each

k \geq 1

,

{\hat{p}}_{(k)} - p_{(k)} \overset{a . s .}{⟶} 0 .

(22)

A proof of Lemma 3 is given in Appendix A. Let it be noted that

Ω

is the sample space of a perpetual multinomial iid sampling scheme on

X

under a probability distribution

p \in P

. Each path in

Ω

may be represented by

{{\hat{p}}_{n}; n \geq 1}

where

{\hat{p}}_{n} = {{\hat{p}}_{k, n}; k \geq 1}

. For each such path

{{\hat{p}}_{n}; n \geq 1} \in Ω

, there exists a corresponding path

{{\hat{p}}_{↓, n}; n \geq 1}

, which is the rearranged

{{\hat{p}}_{n}; n \geq 1}

over all k for every n. Let the total collection of all rearranged paths of

Ω

be denoted as

Ω_{↓}

, and let the collection of all rearranged paths of

Ω^{*}

be denoted as

Ω_{↓}^{*}

. It follows that

P (Ω_{↓}^{*}) = P (Ω^{*}) = 1

. Lemma 3 states that, in each path of

Ω_{↓}^{*}

, the k th component of

{\hat{p}}_{↓, n}

converges to the k th component of

p_{↓}

, namely,

p_{(k)}

, for each k.

Proof of Theorem 2.

For any

ω \in Ω_{↓}^{*}

, note that

\sum_{k \geq 1} | {\hat{p}}_{(k)} - p_{(k)} | \leq 2

, by the bounded convergence theorem and Lemma 3,

{lim}_{n \to \infty} {max}_{k \geq 1} | {\hat{p}}_{(k)} - p_{(k)} | \leq {lim}_{n \to \infty} \sum_{k \geq 1} | {\hat{p}}_{(k)} - p_{(k)} | = \sum_{k \geq 1} {lim}_{n \to \infty} | {\hat{p}}_{(k)} - p_{(k)} | = 0 .

The theorem follows the fact that

P (Ω_{↓}^{*}) = 1

. □

Theorem 2 may be viewed as a version of the Glivenko–Cantelli theorem on countable alphabets with respect to observed data from an entropic multinomial experiment. Theorem 2 immediately implies almost sure convergence for estimators of several key quantities in classification procedures.

Example 4.

{\hat{p}}_{(1)} \overset{a . s .}{\to} p_{(1)}

.

Example 5.

Suppose that

p_{(1)} > p_{(2)}

, that is, there exists a unique letter in

X

, denoted

ℓ_{0}

, associated with probability

p_{(1)}

. Then the probability of a correct classification, that is,

ℓ_{0} = {arg max}_{X} {{\hat{p}}_{k}; k \geq 1}

, converges almost surely to one. This is so because, for any path in

Ω_{↓}^{*}

and any

ε < (p_{(1)} - p_{(2)}) / 2

, there exists an N such that, for any

n > N

,

| {\hat{p}}_{(1)} - p_{(1)} | < ε

and

| {\hat{p}}_{(1)} - p_{(k)} | > ε

for all

k \geq 2

.

The results of Examples 4 and 5 lend fundamental support for classification algorithms based on maximum observed frequency, used widely in exercises of modern data science, for example, decision trees, as mentioned in Example 3.

Many entropies of interest across a wide spectrum of studies are of the additive form,

H (p_{↓}) = \sum_{k \geq 1} g (p_{(k)}) h (p_{(k)})

, where

g (p) \geq 0

and

h (p) \geq 0

are functions of

p \in [0, 1]

. The almost-sure convergence of Theorem 2 may be passed on to the plug-in estimators of some such entropies by way of a rather trivial statement in the proposition below.

Proposition 1.

Let

H (p_{↓}) = \sum_{k \geq 1} g (p_{(k)}) h (p_{(k)})

where

g (p) \geq 0

and

0 \leq h (p) \leq M

for some

M > 0

are continuous functions of

p \in I = [0, 1]

. Suppose that

p \in P

such that

1.: $\sum_{k \geq 1} g (p_{(k)}) < \infty$ , and
2.: $\sum_{k \geq 1} g ({\hat{p}}_{(k), n}) \overset{a . s .}{\to} \sum_{k \geq 1} g (p_{(k)})$ .

Then

H ({\hat{p}}_{↓}) \overset{a . s .}{\to} H (p_{↓})

.

Proof.

Noting that

H (p_{↓}) = \sum_{k \geq 1} g (p_{(k)}) h (p_{(k)}) \leq M \sum_{k \geq 1} g (p_{(k)}) < \infty

, it follows by Conditions 1 and 2 that

| H ({\hat{p}}_{↓}) - H (p_{↓}) | \leq M \sum_{k \geq 1} g ({\hat{p}}_{(k)}) + M \sum_{k \geq 1} g (p_{(k)}) < \infty .

(23)

Let

Ω_{↓}^{* *} \subseteq Ω_{↓}

be the total collection of paths such that Condition 2 holds. For each path,

{{\hat{p}}_{n, ↓}; n \geq 1} \in Ω_{↓}^{* *}

, by (23), the proposition follows by the bounded convergence theorem and the fact that

P (Ω_{↓}^{* *}) = 1

. □

Example 6.

Let

H (p_{↓}) = \sum_{k \geq 1} p_{(k)}^{s} {(1 - p_{(k)})}^{t}

where

s \geq 1

and

t \geq 0

are two real constants. In the setup of Proposition 1,

h (p) = {(1 - p)}^{t} \leq 1

on

I = [0, 1]

, and

g (p) = p^{s}

satisfying

\sum_{k \geq 1} p_{(k)}^{s} \leq 1

for all

p_{↓} \in P_{↓}

without qualification, and therefore also for

{\hat{p}}_{↓} \in P_{↓}

, that is,

\sum_{k \geq 1} {\hat{p}}_{(k)}^{s} \leq 1

, which implies, by the bounded convergence theorem,

\sum_{k \geq 1} {\hat{p}}_{(k)}^{s} \to \sum_{k \geq 1} p_{(k)}^{s}

along each and every path in

Ω^{*}

. By Proposition 1,

\sum_{k \geq 1} {\hat{p}}_{k}^{s} {(1 - {\hat{p}}_{(k)})}^{t} \overset{a . s .}{\to} \sum_{k \geq 1} p_{k}^{s} {(1 - p_{(k)})}^{t}

. More specifically, when s and t take integers

u \geq 1

and

v \geq 0

, the plug-in estimator of the generalized Simpson’s diversity index

H (p_{↓}) = \sum_{k \geq 1} p_{(k)}^{u} {(1 - p_{(k)})}^{v}

(see [5,16]) converges almost surely.

Example 6 implies that the plug-in estimator of

H (p_{↓}) = \sum_{k \geq 1} p_{(k)}^{s}

where

s \geq 1

converges almost surely, which in turn implies that the plug-in estimators of members of the Rényi entropy family and the Tsallis entropy family converge almost surely for all

p_{↓} \in P_{↓}

without qualification when

α \geq 1

. However, it is not known whether the plug-in estimators of the members of the families with

α \in (0, 1)

converge almost surely when

p_{↓} \in P_{↓}

without other qualification (also, see [17]).

Example 7.

The plug-in estimator of the Shannon entropy,

H (p_{↓}) = - \sum_{k \geq 1} p_{(k)} ln p_{(k)}

, converges almost surely when

p_{↓}

is such that

K = \sum_{k \geq 1} 1_{[p_{(k)} > 0]} < \infty

. In this case, even though

- ln p

is not bounded above on

I = [0, 1]

,

h (p) = - p^{α} ln p \leq 1 / (α e)

is for any

α \in (0, 1)

. Writing

H (p_{↓}) = \sum_{k \geq 1} g (p_{(k)}) h (p_{(k)})

where

g (p) = p^{1 - α}

and

h (p) = - p^{α} ln p

, it suffices to show

\sum_{k \geq 1} {\hat{p}}_{(k)}^{1 - α}

converges almost surely. However, this is the case since, by Theorem 2, for every path

{{\hat{p}}_{n, ↓}; n \geq 1} \in Ω_{↓}^{*}

,

\sum_{k \geq 1} {\hat{p}}_{(k)}^{1 - α} \to \sum_{k \geq 1} p_{(k)}^{1 - α}

, due to the fact that

K < \infty

and

P (Ω_{↓}^{*}) = 1

.

It is not known whether the plug-in estimator of the Shannon entropy converges almost surely when

p_{↓} \in P_{↓}

without further qualification.

The Shannon entropy has utilities across a wide spectrum of scientific investigations (see [18]). However, it is not finitely defined for all distributions in

P

. A family of the generalized Shannon entropies, for any

p_{↓} \in P_{↓}

, is proposed as follows:

H_{m} (p_{↓}) = - \sum_{k \geq 1} (\frac{p_{(k)}^{m}}{\sum_{j \geq 1} p_{(j)}^{m}}) ln (\frac{p_{(k)}^{m}}{\sum_{j \geq 1} p_{(j)}^{m}})

(24)

in [19], where

m \geq 1

is an integer. The Shannon entropy is a special family member corresponding to

m = 1

. It may be verified that each member of the family, except the Shannon entropy, is finitely defined for all

p \in P

and offers all important utilities that the Shannon entropy offers, including the fact that the mutual information derived based on each member with

m \geq 2

is zero if and only if the two underlying random elements are independent.

Example 8.

The plug-in estimator of (24) converges almost surely for any

p_{↓} \in P_{↓}

whenever

m \geq 2

. To see this, let it be first noted that the plug-in estimator of

- \sum_{k \geq 1} p_{k}^{m} ln p_{k}

converges almost surely. This fact follows from Proposition 1 with

g (p) = p

,

h (p) = - p^{m - 1} ln p

which is uniformly bounded above on

I = [0, 1]

. The claimed almost-sure convergence then follows the fact that, in the re-expression of (24) below,

H_{m} (p_{↓}) = (\frac{1}{\sum_{j \geq 1} p_{(j)}^{m}}) [(- m \sum_{k \geq 1} p_{(k)}^{m} ln p_{(k)}) + (\sum_{k \geq 1} p_{(k)}^{m}) (ln \sum_{j \geq 1} p_{(j)}^{m})],

(25)

and the fact that the plug-in estimator of each of the four series converges almost surely.

5. Conclusions and Discussion

This article introduces a perspective termed entropic statistics. One of the motivations of the perspective is to accommodate probability experiments on sample spaces which may include outcomes that are known to exist (and therefore are prescribed) and those whose existence is not known (and therefore not prescribable). Such a framework allows statistical exploration into a general population with possibly infinitely many previously unobserved and unknown outcomes, or new discoveries. The key concept to foster such a framework is the label-independence, that is, all parameters and statistics do not depend of the labels of an alphabet as long as they are distinguishable. Consequently, in this article an array of label-independent objects are defined and termed entropic objects. In particular, a general entropy, entropic parameters, entropic statistics, entropic sample spaces, entropic probability distributions, and an entropic-moment-generating function are defined.

Based on the defined entropic objects, two basic theorems are established. Theorem 1 provides a characterization of the entropic probability distribution on the alphabet via the entropic-moment-generating function, and Theorem 2 establishes the almost-sure convergence of the entropic statistics to the entropic parameters and, hence, provides a foundational support to the entropic framework.

On the other hand, this article merely provides a few basic results in entropic statistics. On a broader spectrum, many other issues may be fruitfully considered on at least three fronts, namely, fundamental, probabilistic, and statistical. To begin with, the fundamental question of what constitutes entropy may be explored in many directions. One of the most cited sets of axioms is that discussed by Khinchin [20], under which the Shannon entropy is proved to be unique. However under slightly less restrictive axioms, many other entropies exist and enjoy almost all the desirable utilities of the Shannon entropy; for example, see [19]. The existing literature on generalization of entropy is extensive in physics and information theory; for example, see [21,22]. The collective effort to better understand what entropy is and how it may help to describe an underlying random system is ongoing. Further research in understanding generalized entropies and their implications could greatly enrich the framework of entropic statistics.

Entropy in general is often thought of as summary of a profile state, however measured numerically, of inner energy or chaos within a random system. As such, it is independent of any labeling systems, regardless of whether the state is observable or not. A key conceptual shift introduced in this article is from statistical inference on

p

(or a function of

p

) based on the multinomial frequencies

Y

to that on

p_{↓}

(or a function of

p_{↓}

) based on the entropic frequencies

Y_{↓}

. Such a framework shift, by necessity or by choice, triggers a long array of basic probability and statistics questions, under different degrees of model restriction, ranging from parametric forms of

p_{k} = p (k, θ)

for some parameter

θ

to the nonparametric form,

{p_{(k)}; k \geq 1}

. It may be interesting to note that even for the nonparametric form, there are several qualitatively different forms, that of a known

K = \sum_{k \geq 1} 1_{[p_{(k)} > 0]} < \infty

, that of an unknown

K = \sum_{k \geq 1} 1_{[p_{(k)} > 0]} < \infty

, and that of

K = \sum_{k \geq 1} 1_{[p_{(k)} > 0]} = \infty

. Each of these model classes could imply a very different stochastic behavior of

Y_{↓}

as the sample size n increases. Even long before the notion of information entropy was coined by Shannon in [1], the behavior of

Y_{↓}

had been discussed in the literature by, for example, Auerbach [23] and Zipf [24]. More recently, several articles [25,26] discussed domains of attraction in the total collection of all distributions on a countable alphabet by a tail index,

τ_{n} = n \sum_{k \geq 1} p_{(k)} {(1 - p_{(k)})}^{n}

. Each domain characterizes the decay rate of the tail of the underlying entropic distribution and, in turn, dictates the rates of convergence of various statistical estimators of various entropies. Further advances on that front would enhance the understanding of probabilistic behavior of the entropic statistics and, in turn, the estimated entropies of interest.

In terms of statistical estimation, a large proportion of the existing literature mainly focuses on the Shannon entropy and variations of the plug-in estimators under various conditions, most of which are described and referenced in [14]. There are also non-plug-in estimators of different types, for example, the Bayes estimators [27,28,29], the hierarchical Bayes estimators [30], the James–Stein estimators [31], the coverage-adjusted estimators [32,33,34], and an unbiased estimator based on sequential data proposed by Montgomery-Smith and Schürmann. In general, the asymptotic distributions of the plug-in estimators and their variants seem to have been studied and described to some extent; for example, see [12,35,36,37,38]. However, it is fair to say that many, if not most, of the proposed estimators of various types have not yet been assigned asymptotic distributions. Any advances in that direction could much benefit applications of these estimators.

In short, the landscape of entropic statistics is quite porous in comparison to that of richly supported classical statistics. Many basic and important questions are yet to be answered, from the axiomatic foundation, to the definitions of basic elements, to the theoretical supporting architecture, and to the relevance in applications. However, the same said porosity also offers opportunities for interesting contemplation.

Funding

This research received no external funding.

Data Availability Statement

This research involved no data.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Proof of Lemma 3.

For clarity, the proof of (22) is given, respectively, in five progressively more general cases: (1)

p_{1} = 1

; (2)

K = \sum_{k \geq 1} 1_{[p_{k} > 0]} < \infty

and all positive

p_{k}

’s are distinct; (3)

K < \infty

; (4) K is infinite and all positive

p_{k}

’s are distinct; (5) K is infinite.

For notation simplicity in all cases, let it be assumed without loss of generality that

p = {p_{k}; k \geq 1}

is nonincreasingly arranged to begin with, that is,

p_{k} \geq p_{k + 1}

for every k. With this assumption, the only rearranged object is

{\hat{p}}_{↓} = {{\hat{p}}_{(k)}; k \geq 1}

with

{\hat{p}}_{(k)} \geq {\hat{p}}_{(k + 1)}

for every k.

In Case 1, the statement of (22) is trivial.

In Case 2, let

p_{0} = 1

and

p_{K + 1} = 0

. It follows that

1 = p_{0} > p_{1} > p_{2} > \dots > p_{k - 1} > p_{k} > p_{k + 1} > \dots > p_{K} > p_{K + 1} = 0 .

For each sequence

ω \in Ω^{*}

as defined in the proof of Lemma 2, the uniformity of (20) implies that for any

ε > 0

, there exists an N such that for all

n > N

,

- ε < {\hat{p}}_{k} - p_{k} < ε

for all

k \geq 1

. Specifically, let

ε_{0} = min {(p_{k} - p_{k + 1}) / 2; 1 \leq k \leq K} > 0 .

(A1)

There exists an N such that for all

n > N

,

max {| {\hat{p}}_{k} - p_{k} |; 1 \leq k \leq K} < ε_{0}

, which has the following two implications.

$\cap_{k = 1}^{K} (p_{k} - ε_{0}, p_{k} + ε_{0}) = \emptyset$ , that is, $p_{k} \pm ε_{0}$ are disjoint for all $k = 1, \dots, K$ .
For every k, $k = 1, \dots, K$ , ${\hat{p}}_{k} \in p_{k} \pm ε_{0}$ and it is the only observed relative frequency in $p_{k} \pm ε_{0}$ .

Combining the above two implications, it follows that

{\hat{p}}_{k} = {\hat{p}}_{(k)}

, that is,

| {\hat{p}}_{(k)} - p_{k} | \to 0

, for every k,

k = 1, \dots, K

. Since

Ω^{*}

is of probability one, (22) is established.

In Case 3, it is allowed that several consecutive probabilities in

p = {p_{k}; 1 \leq k \leq K}

, where

K = \sum_{k \geq 1} 1 [p_{k} > 0] < \infty

, are identical. It follows that

1 = p_{0} \geq p_{1} \geq p_{2} \geq \dots \geq p_{k - 1} \geq p_{k} \geq p_{k + 1} \geq \dots \geq p_{K} > p_{K + 1} = 0 .

Noting that

p = {p_{k}; k \geq 1}

is a finite sequence of runs of identical values, collecting the first value in each run and retaining its index value, a subset of

{p_{k}; k \geq 1}

is obtained, namely,

{p_{k_{i}}; i = 1, \dots, I}

, where I is the number of distinct values in

p

. Let

r_{i}

be the multiplicity of

p_{k_{i}}

in

p

,

i = 1, \dots, I

. It follows that

1 = p_{0} = p_{k_{0}} > p_{k_{1}} > p_{k_{2}} > \dots > p_{k_{K^{'}}} > p_{K + 1} = 0 .

For each sequence

ω \in Ω^{*}

as defined in the proof of Lemma 2, the uniformity of (20) implies that for any

ε > 0

, there exists an N such that for all

n > N

,

- ε < {\hat{p}}_{k} - p_{k} < ε

for all k,

k = 1, \dots, K

. Specifically, let

ε_{1} = min {(p_{k_{i}} - p_{k_{i + 1}}) / 2; i = 0, \dots, I} > 0

(A2)

where

p_{k_{0}} = p_{0} = 1

. There exists an

N_{1}

such that for all

n > N_{1}

,

max {| {\hat{p}}_{k} - p_{k} |; k \geq 1} < ε_{1}

, which has the following implications.

$\cap_{i = 1}^{I} (p_{k_{i}} - ε_{1}, p_{k_{i}} + ε_{1}) = \emptyset$ , that is, $p_{k_{i}} \pm ε_{1}$ are disjoint for all i, $i = 1, \dots, I$ .
For every given k, and therefore an implied i, there are exactly $r_{i}$ relative frequencies among ${{\hat{p}}_{k}; 1 \leq k \leq K}$ found in $p_{k_{i}} \pm ε_{1}$ .

It then follows that, for each given k,

min {{\hat{p}}_{(k_{i} + j)}; j = 0, \dots, r_{i} - 1} \leq {\hat{p}}_{(k)} \leq max {{\hat{p}}_{(k_{i} + j)}; j = 0, \dots, r_{i} - 1},

and hence

{\hat{p}}_{(k)} \to p_{(k)} = p_{k} .

Finally, (22) follows the fact that

P (Ω^{*}) = 1

.

In Case 4,

p_{k} > 0

for all

k \geq 1

and all probabilities are distinct. Letting

p_{0} = 1

and

p_{\infty} = 0

,

1 = p_{0} > p_{1} > p_{2} > \dots > p_{k - 1} > p_{k} > p_{k + 1} > \dots > p_{\infty} = 0 .

For every fixed

k^{'}

such that

p_{k^{'}} \in (0, 1)

, let

m \geq 1

be an integer such that

1 - \sum_{k = 1}^{m} p_{k} < p_{k^{'} + 1}, and

(A3)

m \geq k^{'} + 1 .

(A4)

Such an m exists for any given

p

with an infinite K and a fixed

k^{'} \geq 1

.

For each sequence

ω \in Ω^{*}

, as defined in the proof of Lemma 2, the uniformity of (20) implies that for any

ε > 0

, there exists an N such that for all

n > N

,

- ε < {\hat{p}}_{k} - p_{k} < ε

for all k,

k \geq 1

. Specifically, let

ε_{2} = min {(p_{k} - p_{k + 1}) / 2; k = 0, \dots, m} > 0 .

There exists an

N_{2}

such that for all

n > N_{2}

,

max {| {\hat{p}}_{k} - p_{k} |; k \geq 1} < ε_{2}

, which implies the following.

The first m probabilities of $p$ , $p_{1}, \dots, p_{m}$ , are covered, respectively, by m disjoint intervals, $p_{k} \pm ε_{2}$ , $k = 1, \dots, m$ .
The relative frequencies corresponding to ${p_{1}, \dots, p_{m}}$ , namely, ${{\hat{p}}_{1}, \dots, {\hat{p}}_{m}}$ , are also covered, respectively, by the same disjoint intervals, $p_{k} \pm ε_{2}$ , $k = 1, \dots, m$ .

On the other hand, noting the strict inequality in (A3) and the fact that

k^{'}

is a fixed integer, there exists a sufficiently small

ε_{3}

such that

1 - \sum_{k = 1}^{m} p_{k} + m ε_{3} < p_{k^{'} + 1}

(A5)

or equivalently

1 - \sum_{k = 1}^{m} (p_{k} - ε_{3}) < p_{k^{'} + 1} .

(A6)

Let

ε_{4} = min {ε_{2}, ε_{3}}

. By Lemma 2, there exists an

N_{4}

such that for all

n > N_{4}

,

p_{k} - ε_{4} < {\hat{p}}_{k} < p_{k} + ε_{4},

(A7)

for all k,

k = 1, \dots, m

, and that the updated (A5) and (A6) hold, namely,

1 - \sum_{k = 1}^{m} p_{k} + m ε_{4} < p_{k^{'} + 1}

or, equivalently,

1 - \sum_{k = 1}^{m} (p_{k} - ε_{4}) < p_{k^{'} + 1} .

(A8)

That is, in each of the disjoint intervals of (A7), there is at least one relative frequency. In particular,

{\hat{p}}_{k}

is covered in

(p_{k} - ε_{4}, p_{k} + ε_{4})

for each k,

k = 1, \dots, k^{'} < m

, by (A4).

Next it is necessary to show that there may not be more than one relative frequency in

(p_{k} - ε_{4}, p_{k} + ε_{4})

for each k,

k = 1, \dots, k^{'}

. Toward that end, consider the total mass of

100 %

distributed among

{\hat{p}}_{k}

,

k \geq 1

, given n. From interval

(p_{1} - ε_{4}, p_{1} + ε_{4})

to interval

(p_{m} - ε_{4}, p_{m} + ε_{4})

, the total collective mass covered is at least

\sum_{k = 1}^{m} {\hat{p}}_{k}

; however, by (A7) and (A8),

\begin{matrix} \sum_{k = 1}^{m} {\hat{p}}_{k} = & \sum_{k = 1}^{m} ({\hat{p}}_{k} + ε_{4}) - m ε_{4} > \sum_{k = 1}^{m} p_{k} - m ε_{4} = \sum_{k = 1}^{m} (p_{k} - ε_{4}) > 1 - p_{k^{'} + 1} \end{matrix}

and the remainder of the mass is

\begin{matrix} 1 - \sum_{k = 1}^{m} {\hat{p}}_{k} < & p_{k^{'} + 1} < p_{k^{'} + 1} + ε_{4} . \end{matrix}

(A9)

Regardless of the mass,

1 - \sum_{k = 1}^{m} {\hat{p}}_{k}

, on the left side of (A9) is allocated to one or more than one letter, other than those in

{ℓ_{1}, \dots, ℓ_{m}}

, the corresponding

{\hat{p}}_{k}

,

k \geq m + 1

, could not possibly be sufficiently large to exceed

p_{k^{'} + 1} + ε_{4}

, nor, therefore,

p_{k^{'}} - ε_{4}

. That implies that, along the path of that selected

ω \in Ω^{*}

, for any

n > N_{4}

,

{\hat{p}}_{k}

and

{\hat{p}}_{k}

alone is covered in

(p_{k} - ε_{4}, p_{k} + ε_{4})

for k,

k = 1, \dots, k^{'}

. This immediately implies that

{\hat{p}}_{k} = {\hat{p}}_{(k)}

for all k,

k = 1, \dots, k^{'}

, and in particular

{\hat{p}}_{k^{'}} = {\hat{p}}_{(k^{'})}

.

{\hat{p}}_{(k^{'})} \to p_{k^{'}}

since

{\hat{p}}_{k^{'}} \to p_{k^{'}}

. Finally (22) follows the fact that

P (Ω^{*}) = 1

.

In Case 5,

p_{k} > 0

for all

k \geq 1

but the probabilities in

p = {p_{k}; k \geq 1}

are allowed to have multiplicities. Letting

p_{0} = 1

and

p_{\infty} = 0

,

1 = p_{0} > p_{1} \geq p_{2} \geq \dots p_{k} \geq \dots > p_{\infty} = 0 .

(A10)

p = {p_{k}; k \geq 1}

has a special pattern: its maximum value runs for

r_{1}

times; then its second largest value runs for

r_{2}

times, and so on and so forth. In general, its i th largest value runs for

r_{i}

times followed by a run of its

i - 1

st largest value. Collect the first value in each run and record its index,

k_{i}

,

i \geq 1

, resulting in a strictly decreasing subsequence,

{p_{k_{i}}; i \geq 1}

. Letting

k_{0} = 0

and

k_{\infty} = \infty

,

1 = p_{0} = p_{k_{0}} > p_{k_{1}} > p_{k_{2}} > \dots p_{k_{i}} > \dots > p_{k_{\infty}} = p_{\infty} = 0 .

Consequently,

p = {p_{k}; k \geq 1}

may be viewed as a sequence containing

p_{k_{i}}

for

i \geq 1

with

r_{i} - 1

p_{k_{i}}

s between

p_{k_{i}}

and

p_{k_{i + 1}}

.

Given a value of k, say

k^{'}

, there is an

i^{'}

such that

p_{k^{'}} = p_{k_{i^{'}}}

and

k^{'}

must be one of the values from the list

{k_{i^{'}}, k_{i^{'}} + 1, \dots, k_{i^{'}} + r_{i^{'}} - 1}

, noting

p_{k_{i^{'}} + r_{i^{'}}} = p_{k_{i^{'}} + 1} < p_{k^{'}}

. Let m be such that

1 - \sum_{i = 1}^{m} r_{i} p_{k_{i}} < p_{k_{i^{'}} + 1}, and

(A11)

\sum_{i = 1}^{m} r_{i} \geq k_{i^{'}} + 1 .

(A12)

Such an m exists for any given

p

and a fixed

k^{'} \geq 1

, which fixes an

i^{'}

.

For each sequence

ω \in Ω^{*}

as defined in the proof of Lemma 2, the uniformity of (20) implies that for any

ε > 0

, there exists an N such that for all

n > N

,

- ε < {\hat{p}}_{k} - p_{k} < ε

for all k,

k \geq 1

. Specifically let

ε_{5} = min {(p_{k_{i}} - p_{k_{i + 1}}) / 2; i = 0, \dots, m} > 0 .

There exists an

N_{5}

such that for all

n > N_{5}

,

max {| {\hat{p}}_{k} - p_{k} |; k \geq 1} < ε_{5}

, which has the following two implications.

The first $\sum_{i = 1}^{m} r_{i}$ probabilities of $p$ , $p_{1}, \dots, p_{\sum_{i = 1}^{m} r_{i}}$ , are covered, respectively, by $k_{i^{'}}$ disjoint intervals, $p_{k_{i}} \pm ε_{5}$ , $i = 1, \dots, i^{'}$ .
The relative frequencies corresponding to ${p_{1}, \dots, p_{\sum_{i = 1}^{m} r_{i}}}$ , namely, ${{\hat{p}}_{1}, \dots, {\hat{p}}_{\sum_{i = 1}^{m} r_{i}}}$ , are also covered, respectively, by the same disjoint intervals, $p_{k_{i}} \pm ε_{5}$ , $i = 1, \dots, i^{'}$ .

On the other hand, noting the strict inequality in (A11) and the fact that

k^{'}

is a fixed integer, there exists a sufficiently small

ε_{6}

such that

1 - \sum_{i = 1}^{m} r_{i} p_{k_{i}} + ε_{6} \sum_{i = 1}^{m} r_{i} < p_{k_{i^{'}} + 1}

(A13)

or, equivalently,

1 - \sum_{i = 1}^{m} r_{i} (p_{k_{i}} - ε_{6}) < p_{k_{i^{'}} + 1} .

(A14)

Let

ε_{7} = min {ε_{5}, ε_{6}}

. By Lemma 2, there exists an

N_{7}

such that for all

n > N_{7}

, all relative frequencies sharing the same

p_{k_{i}}

, namely,

{\hat{p}}_{k_{i}}, {\hat{p}}_{k_{i} + 1}, \dots, {\hat{p}}_{k_{i} + r_{i} - 1}

, are found in

(p_{k_{i}} - ε_{7}, p_{k_{i}} + ε_{7})

(A15)

for all i,

i = 1, \dots, m

, and the updated (A13) and (A14) are

1 - \sum_{i = 1}^{m} r_{i} p_{k_{i}} + ε_{7} \sum_{i = 1}^{m} r_{i} < p_{k_{i^{'}} + 1}

or, equivalently,

1 - \sum_{i = 1}^{m} r_{i} (p_{k_{i}} - ε_{7}) < p_{k_{i^{'}} + 1} .

(A16)

That is, in each of the disjoint intervals of (A15), there are at least

r_{i}

relative frequencies. In particular, the

r_{i}

relative frequencies,

{{\hat{p}}_{k_{i}}, {\hat{p}}_{k_{i} + 1}, \dots, {\hat{p}}_{k_{i} + r_{i} - 1}}

, are covered in

(p_{k_{i}} - ε_{7}, p_{k_{i}} + ε_{7})

for each i,

i = 1, \dots, i^{'} \leq m

, by (A12).

Next it necessary is to show that there may not be more than

r_{i}

relative frequencies in

(p_{k_{i}} - ε_{7}, p_{k_{i}} + ε_{7})

for each i,

i = 1, \dots, i^{'}

. Toward that end, consider the total mass of

100 %

distributed among

{\hat{p}}_{k}

,

k \geq 1

, given n. From interval

(p_{1} - ε_{7}, p_{1} + ε_{7})

to interval

(p_{\sum_{i = 1}^{m} r_{i}} - ε_{7}, p_{\sum_{i = 1}^{m} r_{i}} + ε_{7})

, the total collective mass covered is at least

\sum_{i = 1}^{m} r_{i} {\hat{p}}_{k_{i}}

; however, by (A15) and (A16),

\begin{matrix} \sum_{i = 1}^{m} r_{i} {\hat{p}}_{k_{i}} > & \sum_{i = 1}^{m} r_{i} (p_{k_{i}} - ε_{7}) > 1 - p_{k_{i^{'}} + 1} \end{matrix}

and the remainder of the mass is

\begin{matrix} 1 - \sum_{k = 1}^{m} {\hat{p}}_{k} < & p_{k_{i^{'}} + 1} < p_{k_{i^{'}} + 1} + ε_{7} . \end{matrix}

(A17)

Regardless of if the mass on the left side of (A17) is allocated to one or more than one letter, other than those in

{ℓ_{1}, \dots, ℓ_{\sum_{i = 1}^{m} r_{i}}}

, the corresponding

{\hat{p}}_{k}

,

k \geq \sum_{i = 1}^{m} r_{i} + 1

, could not possibly be sufficiently large to exceed

p_{k_{i^{'}} + 1} + ε_{7}

, nor, therefore,

p_{k^{'}} - ε_{7}

. That implies that, along the path of that selected

ω \in Ω^{*}

, for any

n > N_{7}

,

{{\hat{p}}_{k_{i}}, {\hat{p}}_{k_{i} + 1}, \dots, {\hat{p}}_{k_{i} + r_{i} - 1}}

and only

{{\hat{p}}_{k_{i}}, {\hat{p}}_{k_{i} + 1}, \dots, {\hat{p}}_{k_{i} + r_{i} - 1}}

are covered in

(p_{k_{i}} - ε_{7}, p_{k_{i}} + ε_{7})

for i,

i = 1, \dots, i^{'}

. This immediately implies that

${{\hat{p}}_{k_{i^{'}}}, {\hat{p}}_{k_{i^{'}} + 1}, \dots, {\hat{p}}_{k_{i^{'}} + r_{i} - 1}} = {{\hat{p}}_{(k_{i^{'}})}, {\hat{p}}_{(k_{i^{'}} + 1)}, \dots, {\hat{p}}_{(k_{i^{'}} + r_{i} - 1)}}$ but is not necessarily equal component-wise;
$| {\hat{p}}_{(k_{i^{'}} + j)} - p_{k^{'}} | < ε_{7}$ for all $j = 0, 1, \dots, r_{i^{'}} - 1$ ;
In particular, $| {\hat{p}}_{(k^{'})} - p_{k^{'}} | < ε_{7}$ .

Finally (22) follows the fact that

P (Ω^{*}) = 1

. □

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Rényi, A. On measures of information and entropy. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, Berkeley, CA, USA, 20–30 June 1961; pp. 547–561. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
Zhang, Z.; Zhou, J. Re-parameterization of multinomial distribution and diversity indices. J. Stat. Plan. Inference 2010, 140, 1731–1738. [Google Scholar] [CrossRef]
Hill, M.O. Diversity and evenness: A unifying notation and its consequences. Ecology 1973, 54, 427–432. [Google Scholar] [CrossRef]
Emlen, J.M. Ecology: An Evolutionary Approach; Addison-Wesley: Reading, MA, USA, 1973. [Google Scholar]
Miller, G.A.; Madow, W.G. On the Maximum Likelihood Estimate of the Shannon-Weaver Measure of Information; Air Force Cambridge Research Center Technical Report AFCRC-TR-54-75; Operational Applications Laboratory, Air Force, Cambridge Research Center, Air Research and Development Command: New York, NY, USA, 1954. [Google Scholar]
Miller, G.A. Note on the bias of information estimates. Inf. Theory Psychol. Probl. Methods 1955, 11-B, 95–100. [Google Scholar]
Harris, B. The Statistical Estimation of Entropy in the Non-Parametric Case; Wisconsin University—Madison Mathematics Research Center: Madison, WI, USA, 1975. [Google Scholar]
Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Silva, J.F. Shannon entropy estimation in ∞-alphabets from convergence results: Studying plug-in estimators. Entropy 2018, 20, 397. [Google Scholar] [CrossRef]
Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
Good, I.J. The population frequencies of species and estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
Grabchak, M.; Marcon, G.; Lang, G.; Zhang, Z. The generalized Simpson’s entropy is a measure of biodiversity. PLoS ONE 2017, 12, e0173305. [Google Scholar] [CrossRef] [PubMed]
Contreras-Reyes, J.E. Mutual information matrix based on Rényi entropy and application. Nonlinear Dyn. 2022, 110, 623–633. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley & Son, Inc.: New York, NY, USA, 2006. [Google Scholar]
Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 158–165. [Google Scholar] [CrossRef]
Khinchin, A.I. Mathematical Foundations of Information Theory; Dover Publications: New York, NY, USA, 1957. [Google Scholar]
Amigó, J.M.; Balogh, S.G.; Hernández, S. A Brief Review of Generalized Entropies. Entropy 2018, 20, 813. [Google Scholar] [CrossRef] [PubMed]
Ilić, V.M.; Korbel, J.; Gupta, G.; Scarfone, A.M. An overview of generalized entropic forms. Europhys. Lett. 2021, 133, 50005. [Google Scholar] [CrossRef]
Auerbach, F. Das Gesetz der Bevölkerungskonzentration. Petermann’s Geogr. Mitteilungen 1913, 59, 74–76. [Google Scholar]
Zipf, G.K. Selected Studies of the Principle of Relative Frequency in Language; Harvard University Press: Cambridge, MA, USA; London, UK, 1932. [Google Scholar]
Zhang, Z. Domains of attraction on countable alphabets. Bernoulli 2018, 24, 873–894. [Google Scholar] [CrossRef]
Molchanov, S.; Zhang, Z.; Zheng, L. Entropic Moments and Domains of Attraction on Countable Alphabets. Math. Meth. Stat. 2018, 27, 60–70. [Google Scholar] [CrossRef]
Krichevsky, R.E.; Trofimov, V.K. The Performance of Universal Encoding. IEEE Trans. Inf. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
Holste, D.; Große, I.; Herzel, H. Bayes’ estimators of generalized entropies. J. Phys. A Math. Gen. 1998, 31, 2551–2566. [Google Scholar] [CrossRef]
Schurmann, T.; Grassberger, P. Entropy estimation of symbol sequences. Chaos 1996, 6, 414–427. [Google Scholar] [CrossRef] [PubMed]
Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. In Advances in Neural Information Processing Systems; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; Volume 14. [Google Scholar]
Hausser, J.; Strimmer, K. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 2009, 10, 1469–1484. [Google Scholar]
Chao, A.; Shen, T.-J. Non-parametric estimation of Shannon’s Index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 2003, 10, 429–443. [Google Scholar] [CrossRef]
Vu, V.Q.; Yu, B.; Kass, R.E. Coverage-adjusted entropy estimation. Stat. Med. 2007, 26, 4039–4060. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z. Entropy estimation in Turing’s perspective. Neural Comput. 2012, 24, 1368–1389. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, X. A normal law for the plug-in estimator of entropy. IEEE Trans. Inf. Theory 2012, 58, 2745–2747. [Google Scholar] [CrossRef]
Zhang, Z. Asymptotic normality of an entropy estimator with exponentially decaying bias. IEEE Trans. Inf. Theory 2013, 59, 504–508. [Google Scholar] [CrossRef]
Chen, C.; Grabchak, M.; Stewart, A.; Zhang, J.; Zhang, Z. Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy 2018, 20, 371. [Google Scholar] [CrossRef]
Grabchak, M.; Zhang, Z. Asymptotic Normality for Plug-in Estimators of Diversity Indices on Countable Alphabet. J. Nonparametric Stat. 2018, 30, 774–795. [Google Scholar] [CrossRef]

Table 1. Confidence Levels of Simple Binary Classifier.

$p_{↓}$	$n = 3$	$n = 5$	$n = 7$
$(1 / 2, 1 / 2)$	0.5000	0.5000	0.5000
$(2 / 3, 1 / 3)$	0.7407	0.7901	0.8267
$(3 / 4, 1 / 4)$	0.8438	0.8965	0.9294
$(4 / 5, 1 / 5)$	0.8960	0.9421	0.9667
$(5 / 6, 1 / 6)$	0.9259	0.9645	0.9824

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z. Several Basic Elements of Entropic Statistics. Entropy 2023, 25, 1060. https://doi.org/10.3390/e25071060

AMA Style

Zhang Z. Several Basic Elements of Entropic Statistics. Entropy. 2023; 25(7):1060. https://doi.org/10.3390/e25071060

Chicago/Turabian Style

Zhang, Zhiyi. 2023. "Several Basic Elements of Entropic Statistics" Entropy 25, no. 7: 1060. https://doi.org/10.3390/e25071060

APA Style

Zhang, Z. (2023). Several Basic Elements of Entropic Statistics. Entropy, 25(7), 1060. https://doi.org/10.3390/e25071060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Several Basic Elements of Entropic Statistics

Abstract

1. Introduction and Summary

2. Things Entropic

2.1. Sample Spaces in Different Resolutions

2.2. Entropic Objects

2.3. Examples of Entropic Statistics

3. Entropic Characterization

4. A Basic Convergence Theorem

5. Conclusions and Discussion

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI