Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions

Maryam Moghimi; Herbert W. Corley

doi:10.3390/data5030084

and

Center on Stochastic Modeling, Optimization, and Statistics (COSMOS), the University of Texas at Arlington, Arlington, TX 76013, USA

^*

Authors to whom correspondence should be addressed.

^†

This paper was part of the author’s doctoral dissertation of May 2020.

^‡

The two authors contributed equally to this paper.

Data2020, 5(3), 84;https://doi.org/10.3390/data5030084

Version Notes

Order Reprints

Abstract

In this paper, we study the information lost when a real-valued statistic is used to reduce or summarize sample data from a discrete random variable with a one-dimensional parameter. We compare the probability that a random sample gives a particular data set to the probability of the statistic’s value for this data set. We focus on sufficient statistics for the parameter of interest and develop a general formula independent of the parameter for the Shannon information lost when a data sample is reduced to such a summary statistic. We also develop a measure of entropy for this lost information that depends only on the real-valued statistic but neither the parameter nor the data. Our approach would also work for non-sufficient statistics, but the lost information and associated entropy would involve the parameter. The method is applied to three well-known discrete distributions to illustrate its implementation.

Keywords:

data reduction; Shannon information; entropy; information loss

1. Introduction

We consider the data sample

x = (x_{1}, \dots, x_{n})

from a random sample

X = (X_{1}, \dots, X_{n})

for a discrete random variable

X

with sample space

S

and one-dimensional parameter

θ .

A statistic

T (X)

is a function of the random sample

X

for any fixed but arbitrary value of a parameter

θ

associated with the underlying

X .

Thus a statistic

T (X)

is a random variable itself. Here we consider only real-valued statistics that reduce the data sample

x

to a number

T (x)

that might be used to summarize

X,

to characteriz

X

,

or perhaps to estimate θ .

However, data reduction is an irreversible process [1] and always involves some information loss. For instance, if

T (X)

is the sample mean

\bar{X},

the original measurements

x

cannot be reconstructed from

\bar{x},

and some information about

x

is lost. Nonetheless, such data reduction is frequently used to make inferences.

More explicitly, our motivation for considering such situations is that

T (x)

is usually communicated in practice as a summary for the data

x

but without the actual data. The question then naturally arises: how much information is lost to someone about a data sample

x

when only the value of

T

is available for

x

, but not the data itself? To answer this question, we develop a theoretical framework for determining how much information is lost about a given data set

x

by knowing only the value of

T (x)

but neither

x

itself nor the parameter

θ

. Our information-theoretic approach to data reduction generalizes the observation in [2] that a binomial random variable loses all the information about the order of successes in the associated sequence of Bernoulli trials. In other words, for a series of n Bernoulli trials one cannot recreate the order of successes by only knowing the number that occurred.

For any real-valued statistic

T

and the given sample data

x,

we decompose the total information about

X

available in

x

into the sum of (a) the information available in the reduced data

T (x) = \bar{x}

and (b) the information lost in the process of data reduction. When

T

is a sufficient statistic for

θ,

this lost information is independent of

θ

. Moreover, by taking the expected value of this lost information over all possible data sets, we define an associated entropy measure that depends on

T

but neither

x

nor

θ

. Our approach also works for non-sufficient statistics, but the lost information and associated entropy would then involve

θ

. Thus

θ

must be estimated before computing these quantities.

The paper is organized as follows. In Section 2, we present the necessary definitions, notation, and preliminary results. In Section 3, we decompose the total information available about

X

in

x

and give various expressions for the Shannon information lost by reducing

x

to

T (x) .

In Section 4, we develop an entropy measure associated with this lost information. In Section 5, we present examples of our results for some standard discrete distributions and several sufficient statistics for

θ .

Conclusions are offered in Section 6.

2. Preliminaries

Standard definitions, notation, and results to be used in our development are now presented for completeness and accessibility. In addition, some new definitions and results are established for subsequent use. Definition 1, Result 1, Definition 2, and Definition 3 can be found in [3,4,5] and elsewhere. The notion of a sufficient statistic is first defined.

Definition 1 (Sufficient Statistic [3]).

A statistic

T (X)

is a sufficient statistic (SS) for the parameter

θ

if the probability

P [X = x | T (X) = T (x)]

(1)

is independent of

θ

.

Note that

P

instead of

P_{θ}

is used in (1) since this probability is independent of

θ .

In addition, observe that (1) is not a joint conditional probability distribution for

X

since its condition changes with

x

. This observation is significant in Section 4. The fact that (1) does not involve

θ

can be used to prove the Fisher Factorization Theorem (FFT) below, which is the usual method for determining if a statistic is an SS for

θ

. In the FFT, we use the notation

f (x | θ)

to denote the joint probability mass function (pmf) of

X

evaluated at the variable

x

for a fixed value of

θ

.

Result 1 (FFT [3]).

The real-valued statistic

T (X)

is sufficient for

θ

if and only if there exist functions

g : R^{1} \to R^{1}

and

h : S^{n} \to R^{1}

such that for any sample data

x

and for all values of

θ

the joint pmf

f (x | θ)

of

X

can be factored as

f (x | θ) = g [T (x) | θ] \times h (x)

(2)

for real-valued, nonnegative functions

g

on and

h

on

S^{n} .

The function

h

does not depend on

θ,

while

g

does depend on

x

but only through

T (x) .

We focus on a sufficient statistic

T

for

θ

in Section 3, where we need the notion of a partition [5] defined next.

Definition 2 (Partition [3]).

Let

S

be the denumerable sample space of the discrete random variable

X

so that

S^{n}

is the denumerable sample space of the random sample

X

. For any statistic

T : S^{n} \to R^{1}

, let

τ_{T}

be the denumerable set

τ_{T} = {t | \exists x \in S^{n}

for which

t = T (x)}

, which is the range of

T .

Then

T

partitions the sample space

S^{n}

into the mutually exclusive and collectively exhaustive partition sets

A_{t} = {x \in S^{n} | T (x) = t}, \forall t \in τ_{T}

.

Figure 1 illustrates Definition 2.

Figure 1. Partition Sets.

We also use the well-known likelihood function.

Definition 3 (Likelihood Function [3,4,5]).

Let

x

be sample data from a random sample

X

from a discrete random variable

X

with sample space

S

and real-valued parameter

θ,

and let

f (x | θ)

denote the joint pmf of the random sample

X

. For any sample data

x

, the likelihood function of

θ

is defined as

L (θ | x) = f (x | θ) .

(3)

The likelihood function

L (θ | x)

in (3) is a function of the variable

θ

for given data

x

. However, the joint pmf

f (x | θ)

as a function of

x

for fixed

θ

is frequently called the likelihood function as well. In this case, we also write the joint pmf as

L (x | θ) .

We distinguish the two cases since

L (θ | x)

is not a statistic but

L (x | θ)

is one that incorporates all available information about

X

. Moreover,

L (x | θ)

is an SS for

θ

[4] and uniquely determines an associated SS called the likelihood kernel to be used in subsequent examples.

We next define a new concept called the likelihood kernel. As a function of

x

for fixed

θ

, it is shown below to be a sufficient statistic for

θ

and is used in Section 5 to facilitate the computation of lost information associated with other sufficient statistics T. As a function of

θ

for fixed

x

, a possibility not considered here, the likelihood kernel may be useful in applying the likelihood principle [3,4] to make inferences about

θ

without resorting to the notion of equivalence classes. It would be the “simplest” factor of

L (θ | x)

that can be used in a likelihood ratio comparing two values of

θ .

Definition 4 (Likelihood kernel).

Let

S

be the sample space of

X

. For fixed but arbitrary

θ,

suppose that

L (x | θ)

can be factored as

L (x | θ) = K (x | θ) \times R (x), \forall x \in S^{n},

(4)

where

K : S^{n} \to R^{1}

and

R : S^{n} \to R^{1}

have the following properties.

(a): Every nonnumerical factor of $K (x | θ)$ contains $θ .$
(b): $R (x)$ does not contain $θ .$
(c): For $\forall x \in S^{n},$ both $K (x | θ) \geq 0$ and $R (x) \geq 0 .$
(d): $K (x | θ)$ is not divisible by any positive number except 1.

Then

K (x | θ)

is defined as the likelihood kernel of

L (x | θ)

and

R (x)

as the residue of

L (x | θ) .

Theorem 1.

The likelihood kernel

K (x | θ)

has the following properties.

(i): $K (x | θ)$ exists uniquely.
(ii): $K (x | θ)$ is an SS for $θ$ .
(iii): For any $θ_{1}$ and $θ_{2},$ the likelihood ratio $\frac{L (x | θ_{1})}{L (x | θ_{2})}$ equals $\frac{K (x | θ_{1})}{K (x | θ_{2})} .$

Proof.

To prove (i), for fixed

θ

we first show that the likelihood kernel

K (x | θ)

of Definition 4 exists by construction. Since the formula for

L (θ | x) = f (x | θ)

must explicitly contain

θ,

the parameter

θ

cannot appear only in the range of

x

. Hence

L (x | θ)

as a function of

x

can be factored into

K (x | θ) \times R (x)

, satisfying (a) and (b) of Definition 4, where

K (x | θ) \geq 0, \forall x \in S^{n},

and the numerical factor of

K (x | θ)

is either

+ 1

or

- 1 .

Then

R (x) \geq 0, \forall x \in S^{n}

since

K (x | θ) \geq 0, \forall x \in S^{n},

and

K (x | θ) \times R (x) = f (x | θ) \geq 0 .

Thus (c) is satisfied. Finally, the only positive integer that evenly divides

+ 1

or

- 1

is

1,

so (d) holds. It follows that the

likelihood kernel K (x | θ)

and its associated

R (x)

in Definition 4 are well defined and exist.

We next show that

K (x | θ)

as constructed above is unique. Let

K_{1} (x | θ)

with residue

R_{1} (x)

and

K_{2} (x | θ)

with

R_{2} (x)

both satisfy Definition 4. Thus for

j = 1, 2, R_{j} (x)

does not contain

θ

, while every nonnumerical factor of

K_{j} (x | θ)

does contain

θ .

It follows that

K_{1} (x | θ) \geq 0

and

K_{2} (x | θ) \geq 0

must be identical or else be a positive multiple of one another. Assume that

K_{2} (x | θ) = {λ K}_{1} (x | θ)

for some

λ > 0 .

If

λ \neq 1, K_{2} (x | θ)

is divisible by a positive number other than

1

to contradict (d). Thus

K (x | θ)

is unique.

To prove (ii), we show that this unique

K (x | θ)

is an SS for

θ .

For

L (θ | x) = f (x | θ),

let

g [z] = z

and

h (x) = R (x)

in (2). Then,

L (θ | x) = f (x | θ) = g [K (x | θ)] \times h (x)

=

K (x | θ) \times R (x) .

Thus

K (x | θ)

is an SS by the FFT of Result 1. Finally, (iii) follows from Definition 4 and the fact that the joint pmf

L (x {| θ}_{2}) \neq 0

for

x \in S^{n} .

□

We next discuss the notion of information to be used. Actually, probability itself is a measure of information in the sense that it captures the surprise level of an event. An observer obtains more information, i.e., surprise, if an unlikely event occurs than if a likely one does. Instead of probability, however, we use the additive measure known as Shannon information [6,7], defined as follows.

Definition 5 (Shannon Information [6,7]).

Let

x

be sample data for the random sample

X

from the discrete random variable

X

with a one-dimensional parameter

θ,

and let

f (x | θ)

be the joint pmf of

X

at

x .

The Shannon information obtained from the sample data

x

is defined as

I (x | θ) = - \log f (x | θ),

(5)

where the units of

I (x | θ)

is bits if the base of the logarithm is 2, which is to be used here.

Other definitions for information have been proposed. For example, Vigo [8,9] has defined a measure of representational information. Further details on different types of information can be found in [10,11,12,13,14,15,16]. For Shannon information, we use its expected value over

\forall x \in S^{n}

.

Definition 6 (Entropy [17,18,19,20]).

Under the conditions of Definition 5, the Shannon entropy

H (X | θ)

is defined as the expected value of

I (X | θ);

i.e.,

H (X | θ) = \sum_{x} f (x | θ) I (x | θ) .

(6)

The general properties of Shannon entropy are given in [17,18,19,20], for example. Since entropy is the expected information over all possible random samples, it can be argued that entropy is a better measure of the available information about

X

than would the Shannon information for a single data set

x,

which might not be typical [18]. We next give a method to obtain the information loss about

X

that occurs when a data set

x

is reduced to

T (x) .

In our approach, we focus on a sufficient statistic

T

so there will be no

θ

in (5) for the lost information below.

3. Information Decomposition under Data Reduction by a Real-Valued Statistic

We now develop a procedure to determine how much information about

X

contained in a data set

x

is lost when the data is reduced to

T (x)

by the sufficient statistic

T .

Consider the joint conditional probability

P_{θ} [X = x | T (X) = T (x)],

(7)

which is identified with the probabilistic information lost about the event

X = x

by the data reduction of

x

to

T (x) .

The notation

P_{θ}

refers to the fact that the discrete probability (7), in general, involves the parameter

θ .

We next express (7) using the definition of conditional probability to obtain the basis of our development. Result 2 is given in ([3], p. 273) and proven below to illustrate the reasoning.

Result 2 [3].

Let

x

be sample data for a random sample

X

from a discrete random variable

X

with sample space

S

and real-valued parameter

θ,

and let

T (X)

be any real-valued statistic. Then

P_{θ} [X = x | T (X) = T (x)] = \frac{P_{θ} [X = x]}{P_{θ} [T (X) = T (x)]} .

(8)

Proof.

Using the definition of conditional probability, rewrite (7) as

\frac{P_{θ} [X = x; T (X) = T (x)]}{P_{θ} [T (X) = T (x)]} .

(9)

However,

T (X) = T (x)

whenever

X = x,

so (8) follows. □

Observe that if

T

is an SS for

θ,

the left side of (8) is independent of

θ

by the FFT and hence so is the right. Taking the negative logarithm of (8) and rearranging terms gives

- \log P_{θ} [X = x] = - \log P_{θ} [T (X) = T (x)] - \log P_{θ} [X = x | T (X) = T (x)] .

(10)

From (8), note that

P_{θ} [X = x | T (X) = T (x)] \geq P_{θ} [X = x]

since

P_{θ} [T (X) = T (x)] \leq 1,

so

- \log P_{θ} [X = x | T (X) = T (x)] \leq - \log P_{θ} [X = x] .

Similarly,

- \log P_{θ} [T (X) = T (x)] \leq - \log P_{θ} [X = x] .

These facts suggest that the left side of (10) is the total Shannon information in bits about

X

contained in the sample data

x

. On the right side of (10), the term

- \log P_{θ} [T (X) = T (x)]

is considered the information about

X

contained in the reduced data summary

T (x),

and the term

- \log P_{θ} [X = x | T (X) = T (x)]

is identified as the information about

X

that has been lost as the result of the data reduction by

T (x) .

In particular, this lost information represents a combinatorial loss in the sense that multiple

x

’s may give the same value

T (x) = t

, as depicted in Figure 1 above. In other words, the lost information

- \log P_{θ} [X = x | T (X) = T (x)]

is a measure of the knowledge unavailable about the data sample

x

when only the reduced data summary

T (x)

is known but not

x

itself. For a sufficient statistic

T (X)

for

θ

, this lost information is independent of

θ

. It is a characteristic of

T (X)

for the given data sample

x .

In terms of Figure 1, (10) may be described as follows. On the left of the figure is the sample space

S^{n} \subseteq R^{n}

over which probabilities on

X

are computed. On the right is the range

τ_{T} \subseteq R^{1}

of

T

over which the probability of

T (X)

are computed.

T

reduces the data sample

x

into

T (x),

where multiple

x

’s may give the same

T (x) = t

. In Figure 1, the distinct data samples

x_{1,} x_{2},

and

x_{3}

are all reduced into the same value

t_{1} .

However, knowing that

T (x) = t_{1}

for some data sample

x

does not provide sufficient information to know unequivocally, for example, that

x = x_{1} .

Information is lost in the reduction. One can also say that the total information

- \log P_{θ} [X = x]

obtained from the left side of Figure 1 is reduced to

- \log P_{θ} [T (X) = T (x)]

obtained from the right. The reduction in information from the left to the right side is precisely the lost information

- \log P_{θ} [X = x | T (X) = T (x)]

of (10). For fixed

t,

it is lost due to the ambiguity as to which data sample on the left actually gave

t .

There is no such ambiguity when

T

is one-to-one.

The general decomposition of information in (10) is next summarized in Definition 7, where

T

does not need to be sufficient for

θ .

Definition 7 (

I_{t o t a l}, I_{r e d u c e d}, I_{l o s t}

).

Let

x

be sample data for a random sample

X

from a discrete random variable

X

with sample space

S

and real-valued parameter

θ .

For any real-valued statistic

T (X),

the Shannon information about

X

obtained from the sample data

x

can be decomposed as

I_{total} (x | θ) = I_{reduced} (x | θ, T) + I_{lost} (x | θ, T),

(11)

where

I_{total} (x | θ) = - \log P_{θ} [X = x]

(12)

I_{reduced} (x | θ, T) = - \log P_{θ} [T (X) = T (x)]

(13)

and

I_{lost} (x | θ, T) = - \log P_{θ} [X = x | T (X) = T (x)] .

(14)

Definition 7generalizes the information decomposition of [2] for a data sample

x

of size n from a Bernoulli random variable

X . I n t e r m s o f t h i s p a p e r, t h e p a r a m e t e r θ

in [2] is the probability 0.5 of success on a single Bernoulli trial, and

T (x) = \sum_{i = 1}^{n} x_{i} .

It should be noted that the notation

I_{c o m p}

in [2], which refers to compressed information, corresponds to

I_{r e d u c e d}

in Equation (13). We use the term “data reduction” as described in [3] as opposed to “data compression” to prevent misinterpretation. In computer science, data compression refers to encoding information using fewer bits than the original representation and is often lossless.

Both Result 2 and Definition 7 are valid for any real-valued statistic for

X .

The notation

I_{total} (x | θ)

indicates that

I_{total}

is a function of the sample data

x

for a fixed but arbitrary parameter value

θ .

Similarly, both

I_{reduced} (x | θ, T)

and

I_{lost} (x | θ, T)

are functions of

x

for fixed

θ

and

T .

However, in this paper we focus on sufficient statistics, which provide a simpler expression for

I_{lost} (x | θ, T)

that does not involve

θ .

For a sufficient statistic

T

for

θ,

we use the notation

I_{lost} (x | T)

for the lost information, though

I_{total} (x | θ)

and

I_{reduced} (x | θ, T)

still require

θ .

The next result is an application of the FFT of Result 1.

Theorem 2 (Lost Information for an SS).

Let

x

be sample data for a random sample

X

from a discrete random variable

X

with sample space

S

and real-valued parameter

θ .

Let

T

be an SS for

θ,

let

f (x | θ)

be the joint pmf of

X,

and write

f (x | θ) = g [T (x) | θ] \times h (x)

as in Result 1. Then for all

x \in S^{n}

I_{lost} (x | T) = - \log \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)},

(15)

where

A_{T (x)}

is defined in Definition 2 for

t = T (x) .

Proof.

Let

x \in S^{n} .

Then

f (x | θ) > 0

since

x

is a realization of

X .

Because

T

is an SS, we write (7) without

θ

. It now suffices to establish that

P [X = x | T (X) = T (x)] = \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)},

(16)

from which (15) immediately follows. Rewrite (8) as

P [X = x | T (X) = T (x)] = \frac{P_{θ} [X = x]}{P_{θ} [T (X) = T (x)]} = \frac{f (x | θ)}{\sum_{y ϵ A_{T (x)}} f (y | θ)},

(17)

so from (17) and (2), then

P [X = x | T (X) = T (x)] = \frac{g [T (x) | θ] \times h (x)}{\sum_{y ϵ A_{T (x)}} g [T (y) | θ] \times h (y)} .

(18)

However,

T (y) = T (x), \forall y \in A_{T (x)}

in (18), so

P [X = x | T (X) = T (x)] = \frac{g [T (x) | θ] \times h (x)}{g [T (x) | θ] \times \sum_{y ϵ A_{T (x)}} h (y)}, \forall x \in S^{n} .

(19)

Since

f (x | θ) > 0

and hence

g [T (x) | θ] \neq 0

, this term can be canceled on the right side of (19) to yield (16). Taking the

- \log

of (16) completes the proof. □

Now consider Theorem 2 when each

A_{t}

is a singleton in (16), i.e., when

T

is a one-to-one function. In this extreme case,

P [X = x | T (X) = T (x)] = 1

since

\sum_{y \in A_{T (x)}} h (y) = h (x)

in the denominator of the right side of (16). Thus

I_{lost} (x | T) = 0

from which

I_{comp} (x | θ, T) = I_{total} (x | θ)

for all

x

in

S^{n}

. Thus, the special case of a one-to-one T justifies the identification of the lost information as

I_{lost} (x | θ, T) = - \log P_{θ} [X = x | T (X) = T (x)]

. In other words, for all data samples

x, y \in S^{n},

if

x \neq y

whenever

T (x) \neq T (y),

then

P [X = x | T (X) = T (x)]

is not diminished by the reduction of the singleton

A_{T (x)}

to the number

T (x)

.

More generally, it is also true that

I_{lost} (x | θ, T) = 0

when

T

is one-to-one but not sufficient for

θ .

In this case, write

P_{θ} [X = x | T (X) = T (x)] = \frac{P_{θ} [X = x]}{P_{θ} [T (X) = T (x)]} = \frac{f (x | θ)}{\sum_{y ϵ A_{T (x)}} f (y | θ)} .

However, since

T

is one-to-one, then

\sum_{y ϵ A_{T (x)}} f (y | θ) = f (x | θ),

P_{θ} [X = x | T (X) = T (x)] = 1

, and again

I_{lost} (x | θ, T) = 0

.

Now consider the other extreme case where

T (x) = c

is constant on

S^{n} .

Thus

P_{θ} [X = x | T (X) = c] = \frac{P_{θ} [X = x]}{P_{θ} [T (X) = c]} .

However,

P_{θ} [T (X) = c] = 1,

so

P_{θ} [X = x | T (X) = c] = P_{θ} [X = x]

and

I_{lost} (x | θ, T) = I_{total} (x | θ, T)

on

S^{n} .

In this case,

I_{reduced} (x | θ, T) = 0

because the event

T (x) = c

gives no information about

x .

We also note that

I_{lost} (x | T)

could be used as a metric to compare sufficient statistics for a given data sample

x

. For example,

T_{1}

could be regarded as better than

T_{2}

for

x

if

I_{lost} (x, T_{1}) < I_{lost} (x, T_{2}) .

However, this comparison would be limited to the given

x

. In Section 4, we propose but do not explore a metric based on entropy independent of a particular data sample. We next show that (16) can be simplified when

T

is the likelihood function.

Corollary 1 (Information Loss for Likelihood Function).

Under the assumptions of Theorem 2, if

T (x) = L (x | θ),

then

I_{lost} (x | L) = - \log \frac{1}{| A_{L (x | θ)} |},

(20)

where

| A_{L (x | θ)} |

is the cardinality of the partition set

A_{t}

for

t = L (x | θ) .

Proof.

For

T (x) = L (x | θ) = f (x | θ)

in (2), let

g

be the identity function and

h (x) = 1 .

Then substituting

h (x) = 1

into (16) gives the denominator

\sum_{y ϵ A_{L (x | θ)}} 1 = | A_{L (x | θ)} |

to yield (20). □

We next state a reproductive property of a statistic

T^{'}

that is a one-to-one function of a sufficient statistic

T

for

θ .

Theorem 3.

If there is a one-to-one function between a sufficient statistic

T

for

θ

and an arbitrary real-valued statistic

T^{'}

on

S^{n},

then the following hold.

(i): $T^{'}$ is also an SS.
(ii): $T$ and $T^{'}$ partition the sample space $S$ into the same partition sets.
(iii): $I_{l o s t} (x | T) = I_{l o s t} (x | T^{'}), \forall x \in S^{n} .$

Proof.

To prove (i), let

u

be a real-valued one-to-one function of

T^{'}

such that

T (x) = u [T^{'} (x)] .

(21)

Since

T

is an SS, by Equation (2) there are real-valued functions

g

on

R^{1}

and

h

on

S^{n}

for which

f (x | θ) = g [T (x) | θ] \times h (x) .

(22)

By substituting

T (x)

from (21) in (22), we get

f (x | θ) = g (u [T^{'} (x)] | θ) \times h (x),

(23)

which can be rewritten as

f (x | θ) = (g \circ u) [T^{'} (x) | θ] \times h (x) .

(24)

Since

T^{'}

in (24) satisfies the condition of Result 1 for

g^{'} = g \circ u

,

T^{'}

is an SS.

To prove (ii), we use Definition 2. Let

T

partition the sample space

S^{n}

into the mutually exclusive and collectively exhaustive sets

A_{t} = {x | T (x) = t}, \forall t \in τ_{T}

. By Equation (21) we can also write

A_{t}

as

A_{t} = {x | u [T^{'} (x)] = t}, \forall t \in τ_{T} .

(25)

Since

u

is a one-to-one function, it has an inverse

u^{- 1}

. Letting

u^{- 1} (t) = t^{'},

we apply

u^{- 1}

to the right side of (25) and get

A_{t} = {x | T^{'} (x) = t^{'}}, \forall t^{'} \in u (τ_{T}) .

(26)

However,

u (τ_{T}) = τ_{T^{'}}

and the cardinalities satisfy

| τ_{T} | = | τ_{T^{'}} |,

so the right side of (26) is

A_{t^{'}}

and

A_{t} = A_{t^{'}} .

(27)

Finally, to get (iii) we use Theorem 2 to calculate information lost over two statistics

T

and

T^{'} .

Since

h (x)

is the same in (22) and (24) and since Equation (27) holds, we sum

h (x)

over the same sets in the denominator of Equation (16) for both

T

and

T^{'}

to give

I_{lost} (x | T) = I_{lost} (x | T^{'})

(28)

and complete the proof. □

We next compare the information loss of the sufficient statistic

L (x | θ)

to other sufficient statistics. For the sufficient statistic

K (x | θ),

a lemma is needed.

Lemma 1.

Let

x

be any data sample for a random sample

X

from the discrete random variable

X

with real-valued parameter

θ

. Then

K (x | θ)

is a function of

L (x | θ)

and

τ_{L} \geq τ_{K} .

Proof.

From ([3], p. 280),

K (x | θ)

is a function of

L (x | θ)

if and only if

K (x | θ) = K (y | θ)

whenever

L (x | θ) = L (y | θ) .

For all data samples

x

and

y,

we prove that if

L (x | θ) = L (y | θ),

then

K (x | θ) = K (y | θ) .

Thus suppose that

L (x | θ) = L (y | θ) .

By Definition 4, we can decompose

L (x | θ)

and

L (y | θ)

into

K (x | θ) R (x)

and

K (y | θ) R (y),

respectively. Note that

K (y | θ) \neq 0 .

Otherwise,

L (y | θ) = 0

in contradiction to

y

being sample data with a nonzero probability of occurring. Now write

\frac{K (x | θ)}{K (y | θ)} = \frac{R (y)}{R (x)} .

(29)

Suppose that

K (x | θ) \neq K (y | θ)

so that

\frac{K (x | θ)}{K (y | θ)} = \frac{R (y)}{R (x)} \neq 1

in (29). From Definition 4, every nonnumerical factor of

K (x | θ)

and

K (y | θ)

contains

θ .

Moreover, neither

K (x | θ)

nor

K (y | θ)

is divisible by any positive number except the number 1. Hence, since

\frac{R (y)}{R (x)}

does not contain

θ,

the nonnumerical factors of

K (x | θ)

and

K (y | θ)

must cancel in (29) and the remaining numerical factors could not be identical. Thus at least one of these factors would be divisible by a positive number other than 1 in contradiction to Definition 4. It now follows that

K (x | θ) = K (y | θ),

so

K (x | θ)

is some function

u

of

L (x | θ) .

Finally,

τ_{L} \geq τ_{K}

since this function

u

is surjective from

S^{n}

onto its image

u (S^{n})

. □

Lemma 2.

Under the conditions of Lemma 1, the sufficient statistics

L

and

K

satisfy

I_{reduced} (x | θ, L) \geq I_{reduced} (x | θ, K), \forall x \in S^{n} .

(30)

Proof.

Let

x \in S^{n}

and suppose that

y \in A_{L (x)} .

Then

L (y | θ) = L (x | θ),

so it follows from Lemma 1 that

K (y | θ) = K (x | θ)

and thus

y \in A_{K (x)} .

Hence

A_{L (x)} \subseteq A_{K (x)},

and so

P_{θ} [L (X | θ) = L (x | θ)] = \sum_{y ϵ A_{L (x)}} f (x | θ) \leq \sum_{y ϵ A_{K (x)}} f (x | θ) = P_{θ} [K (X | θ) = K (x | θ)], \forall x \in S^{n} .

(31)

Taking the negative log of both sides of the inequality in (31) and using (13) gives (30). □

Theorem 4.

Let

x

be sample data for a random sample

X

from a discrete random variable

X

with the real-valued parameter

θ .

Then for all

x \in S^{n},

I_{lost} (x | L) \leq I_{lost} (x | K) .

(32)

Proof.

Let

x \in S^{n} .

Note that

I_{total} (x | θ)

in (12) does not depend on the arbitrary sufficient statistic

T

of (11). Hence

I_{total} (x | θ) = I_{reduced} (x | θ, L) + I_{lost} (x | L) = I_{reduced} (x | θ, K) + I_{lost} (x | K) .

(33)

Then (32) follows immediately from (30) and (33). □

As a consequence of Theorem 3, Theorem 4 has an immediate corollary.

Corollary 2.

Under the conditions of Theorem 4, let

T

be a sufficient statistic for

θ

for which there is a one-to-one function between

T

and

K

. Then for all

x \in S^{n},

I_{lost} (x | L) \leq I_{lost} (x | T) .

(34)

The question remains open as to whether (34) holds for all sufficient statistics

T

for

θ

. Regardless, the proofs of Lemma 2 and Theorem 4 illustrate the fact that the relation between the lost information for two statistics

T

and

T^{'}

is determined by the relation between their partition sets

A_{t} = {x | T (x) = t}

and

B_{t^{'}} = {x | T^{'} (x) = t^{'}} .

For example, if for every

A_{t}

there exists a

B_{t^{'}}

for which

A_{t} \subset B_{t^{'}}

, then the partition of

S^{n}

by the

B_{t^{'}}

of

T^{'}

is said to be coarser than the partition by the

A_{t}

of

T .

In that case,

I_{lost} (x | θ, T) \leq I_{lost} (x | θ, T^{'})

because each

x \in S^{n}

has more

y \in S^{n}

with

T^{'} (y) = T^{'} (x)

than there are with

T (y) = T (x) .

In other words,

T^{'} (y) = t^{'}

is at least as ambiguous as

T (y) = t

in determining the data sample giving the value of the respective statistics.

4. Entropic Loss for an SS

For a sufficient statistic

T

for

θ

, we now propose an entropy measure to characterize

T

by the expected lost information incurred by the reduction of

X

to

T (X) .

This expectation is taken over all possible data sets

x .

This nonstandard entropy measure is called entropic loss, and it depends on neither a particular data set

x

nor the value of

θ .

Before defining this measure, we need to determine the appropriate pmf to use in taking an expectation. The following results are used.

Result 3.

Under the assumptions of Theorem 2, for any data sample let

t = T (x)

and consider the partition set

A_{t} .

Then

\sum_{x \in A_{t}} P [X = x | T (X) = t] = 1 .

(35)

Proof.

Summing (16) over

x \in A_{t}

yields

\sum_{x \in A_{t}} P [X = x | T (X) = t] = \frac{\sum_{x ϵ A_{T (x)}} h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} = 1 .

(36)

to give (35). □

Result 4.

Under the assumptions of Theorem 2, the sum

\sum_{x \in S^{n}} P [X = x | T (X) = T (x)] = | τ_{T} | .

(37)

Proof.

We perform the sum on the left of (37) by first summing over

x \in A_{t}

for fixed

t

and then summing over each

t \in τ_{T}

to give

\sum_{x \in S^{n}} P [X = x | T (X) = T (x)] = \sum_{t \in τ_{T}} \sum_{x \in A_{t}} P [X = x | T (X) = t] .

(38)

The inner series on the right side of (38) sums to one by Result 3. Hence, the outer sum yields

| τ_{T} |

for

τ_{T} = {t | \exists x \in S^{n}

for which

t = T (x)} .

□

From (37), it follows that the left side of (37) is not a probability distribution on

S^{n}

unless

| τ_{T} | = 1 .

Moreover,

P [X = x | T (X) = T (x)]

is not a conditional probability distribution even if

| τ_{T} | = 1

since the condition

T (X) = T (x)

varies with

x .

However, we use Result 4 to normalize

P [X = x | T (X) = T (x)]

and obtain the appropriate pmf for calculating the expectation of

I_{lost} (X | T)

.

Definition 8 (Entropic Loss).

Under the assumptions of Theorem 2, the entropic loss resulting from the data reduction by

T

is defined as

H_{lost} (X, T) = \frac{- 1}{| τ_{T} |} \sum_{x \in S^{n}} P [X = x | T (X) = T (x)] \log P [X = x | T (X) = T (x)],

(39)

which from (15) and (16) can be rewritten as

H_{lost} (X, T) = \frac{- 1}{| τ_{T} |} \sum_{x \in S^{n}} \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} \log \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} .

(40)

Observe that (39) and (40) are independent of both

x

and

θ .

Indeed, for a given underlying random variable

X

and sample size n,

H_{lost} (X, T)

is a function only of

T .

Thus,

H_{lost} (X, T)

could be used as a metric to compare sufficient statistics independent of the data sample. In particular,

T_{1}

could be regarded as better than

T_{2}

if

H_{lost} (X, T_{1}) < H_{lost} (X, T_{2}),

i.e., if the expected information loss associated with

T_{1}

is less than that for

T_{2} .

Moreover, for a given underlying random variable

X

and sample size n, Definition 8 could be extended to non-sufficient statistics. In that case, the entropic loss

H_{lost} (X, T, θ)

would be a function of both

T

and

θ .

For a fixed

θ,

a non-sufficient statistic

T_{1}

could again be considered as better than a non-sufficient

T_{2}

if

H_{lost} (X, T_{1}, θ)

<

H_{lost} (X, T, θ) . Furthermore,

for a given statistic

T,

the numerical value

θ_{1}

could be considered as a better numerical point estimate for

θ

than the value

θ_{2}

if

H_{lost} (X, T, θ_{1})

<

H_{lost} (X, T, θ_{2}) .

Similarly,

H_{lost} (X, T, θ)

could be minimized over

θ

to give a best numerical point estimate for

θ

based on the entropic loss criterion. However, we do not pursue these possibilities here. We next compute

H_{lost} (T)

for the sufficient statistic

T (X) = L (X | θ) .

Theorem 5 (Entropic Loss for Likelihood Function).

Under the assumptions of Theorem 2, the entropic loss resulting from the data reduction by

T (x) = L (x | θ)

is

H_{lost} (X, L) = \frac{- 1}{| τ_{L} |} \sum_{t \in τ_{L}} \log \frac{1}{| A_{t} |} .

(41)

Proof.

From (20), write

H_{lost} (X, L) = \frac{- 1}{| τ_{L} |} \sum_{x \in S^{n}} \frac{1}{| A_{L (x)} |} \log \frac{1}{| A_{L (x)} |} .

(42)

We decompose the sum over x ∈ Sⁿ in (42) to consecutive sums over

x \in A_{t}

and then

t \in τ_{T}

to get

H_{lost} (X, L) = \frac{- 1}{| τ_{L} |} \sum_{t \in τ_{L}} \sum_{x \in A_{t}} \frac{1}{| A_{t} |} \log \frac{1}{| A_{t} |} = \frac{- 1}{| τ_{L} |} \sum_{t \in τ_{L}} \frac{| A_{t} |}{| A_{t} |} \log \frac{1}{| A_{t} |} .

(43)

Equation (41) now follows from (43). □

Result 5.

Suppose there is a one-to-one function between two sufficient statistics

T

and

T^{'}

for

θ

. Then

H_{lost} (X, T) = H_{lost} (X, T^{'}) .

(44)

Proof.

For all

x \in S^{n},

I_{lost} (x | T) = I_{lost} (x | T^{'})

from Theorem 3, so

- \log \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} = - \log \frac{h (x)}{\sum_{y ϵ A_{T^{'} (x)}} h (y)},

(45)

from which

\frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} = \frac{h (x)}{\sum_{y ϵ A_{T^{'} (x)}} h (y)} .

(46)

Thus from (45) and (46),

\frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} \log \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} = \frac{h (x)}{\sum_{y ϵ A_{T^{'} (x)}} h (y)} \log \frac{h (x)}{\sum_{y ϵ A_{T^{'} (x)}} h (y)} .

(47)

Now summing (47) over

x \in S^{n}

yields

\sum_{x \in S^{n}} \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} \log \frac{h (x)}{\sum_{y ϵ A_{T (x)}} h (y)} = \sum_{x \in S^{n}} \frac{h (x)}{\sum_{y ϵ T^{'} (x)} h (y)} \log \frac{h (x)}{\sum_{y ϵ A_{T^{'} (x)}} h (y)} .

(48)

However, from Theorem 3,

| τ_{T} | = | τ_{T^{'}} | .

Thus dividing the left side of (48) by

- | τ_{T} |

and the right side by

- | τ_{T^{'}} |

yields (44). □

5. Examples and Computational Issues

In this section, we present examples involving the discrete Poisson, binomial, and geometric distributions [21]. For each distribution, three sufficient statistics for a parameter

θ

are analyzed. For each such

T

,

{the quantity I}_{lost} (x | T)

does not involve

θ

. However, calculating

I_{lost} (x | T)

can still present computational issues, some of which are discussed below. Our examples are simple in order to focus on the definitions and results of Section 3 and Section 4.

Example 1 (Poisson Distribution).

Consider the random sample

X = (X_{1}, \dots, X_{n})

with the data sample

x = (x_{1}, \dots, x_{n})

from a Poisson random variable

X

. We consider three sufficient statistics for the parameter

θ > 0 .

These sufficient statistics are

T_{1} (X) = \sum_{i = 1}^{n} X_{i},

the likelihood kernel

T_{2} (X) = K (X | θ)

for fixed but arbitrary

θ,

and the likelihood function

T_{3} (X) = L (X | θ) f o r f i x e d b u t a r b i t r a r y θ

. We use

T_{1} (X)

as a surrogate for

T_{1}^{'} (X) = \frac{\sum_{i = 1}^{n} X_{i}}{n} .

Neither

T_{1} (X)

or

T_{1}^{'} (X)

involves

θ

and can thus be used either to characterize

X

or to estimate

θ .

Moreover, there is an obvious one-to-one function relating

\frac{\sum_{i = 1}^{n} X_{i}}{n}

and

\sum_{i = 1}^{n} X_{i},

so Theorems 3 and 5 establish that

I_{l o s t} (x | T_{1}^{'}) = I_{l o s t} (x | T_{1})

and

H_{l o s t} (X, T_{1}^{'}) = H_{l o s t} (X, T_{1}),

respectively. We analyze

T_{1} (X)

because it is also Poisson, whereas

T_{1}^{'} (X)

is not Poisson since

\frac{\sum_{i = 1}^{n} X_{i}}{n}

is notnecessarily a nonnegative integer. In contrast to

T_{1} (X),

both

T_{2} (X)

and

T_{3} (X)

contain

θ

and can only be used to characterize

X .

For each of these three sufficient statistics, we develop an expression for

I_{l o s t} (x | T)

and describe how to obtain a numerical value. We then illustrate previous results with a realistic Poisson data sample. We present further computational results in Table 1.

Table 1. Poisson Example.

Case 1: Let

T_{1} (X) = \sum_{i = 1}^{n} X_{i} . {Observe that T}_{1} (X)

is a sufficient statistic for

θ

from Result 1 since

f (x | θ) = P_{θ} [X = x] = \frac{θ^{\sum_{i = 1}^{n} x_{i}} e^{- n θ}}{\prod_{i = 1}^{n} x_{i}!}

can be factored in (2) into the functions

g [T_{1} (x) | θ] = θ^{\sum_{i = 1}^{n} x_{i}} e^{- n θ}

and

h (x) = \frac{1}{\prod_{i = 1}^{n} x_{i}!} .

Next recall that the statistic

\sum_{i = 1}^{n} X_{i}

has a Poisson distribution with parameter

n θ

[21]. Thus,

P_{θ} [\sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} x_{i}] = \frac{{(n θ)}^{\sum_{i = 1}^{n} x_{i}} e^{- n θ}}{(\sum_{i = 1}^{n} x_{i})!},

and so (8) becomes

P [X = x | \sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} x_{i}] = \frac{1}{n^{\sum_{i = 1}^{n} x_{i}}} (\begin{matrix} \sum_{i = 1}^{n} x_{i} \\ x_{1}, \dots, x_{n} \end{matrix}),

(49)

where the multinomial coefficient

(\begin{matrix} \sum_{i = 1}^{n} x_{i} \\ x_{1}, \dots, x_{n} \end{matrix}) = \frac{(\sum_{i = 1}^{n} x_{i})!}{\prod_{i = 1}^{n} x_{i}!} .

It follows from (49) and (10) that

I_{lost} (x {| T}_{1}) = - \log (\begin{matrix} \sum_{i = 1}^{n} x_{i} \\ x_{1}, \dots, x_{n} \end{matrix}) + (\log n) \sum_{i = 1}^{n} x_{i},

(50)

which is also

I_{lost} (x | T_{1}^{'})

, as noted above.

For a data sample

(x_{1}, \dots, x_{n}),

the evaluation of

I_{lost} (x {| T}_{1})

in (50) involves computing factorials [22]. For realistic data, the principal limitation to calculating them by direct multiplication is their magnitude. See [23] for a discussion. However, (50) can be approximated using either the well-known Stirling formula or the more accurate Ramanujan approximation [24]. The online multinomial coefficient calculator [25] can evaluate multinomial coefficients for when all

x_{i}

as well as

n

are less than approximately 50 if any

x_{i} = 0

is removed from

(\begin{matrix} \sum_{i = 1}^{n} x_{i} \\ x_{1}, \dots, x_{n} \end{matrix}) .

Such deletions do not affect the calculation since

0! = 1

.

As a numerical example, consider a data sample

x

of size

n = 34

from a Poisson random variable

X

with

θ = 3

, where

x = (4, 7, 1, 3, 4, 2, 5, 0, 1, 2, 3, 6, 8, 0, 1, 2, 4, 9, 0, 2, 3, 1, 4, 2, 0, 1, 5, 6, 2, 7, 0, 1, 4, 2) .

(51)

Then,

T_{1} (x) = \sum_{i = 1}^{n} x_{i} = 102 from (51),

and the calculator at [25] gives

(\begin{matrix} \sum_{i = 1}^{n} x_{i} \\ x_{1}, \dots, x_{n} \end{matrix}) \approx 1.574 \times

10¹²³ in (49) and (50). Moreover,

(\log n) \sum_{i = 1}^{n} x_{i} =

518.915. Hence, from (50),

I_{lost} (x {| T}_{1}) = I_{lost} (x | T_{1}^{'}) \approx 109.667

bits. This value corresponds to 13.708 bytes at 8 bits per byte or to 0.013 kilobytes (KB) at 1024 bytes per KB. It follows from previous discussion in this example that the Shannon information lost by using the sample mean

T_{1}^{'}

as a surrogate for

x

itself is

I_{lost} (x | T_{1}^{'}) = I_{lost} (x {| T}_{1}) \approx 0.013

KB, which seems surprisingly small. Perhaps the small loss results partially from the fact that

T_{1}^{'} (x) = \bar{x} = θ = 3 exactly .

Case 2: Let

T_{2} (X) = K (X | θ)

for fixed but arbitrary

θ > 0 .

For a data sample

(x_{1}, \dots, x_{n})

, write

L (x | θ) = f (x | θ) = \frac{θ^{\sum_{i = 1}^{n} x_{i}} e^{- n θ}}{\prod_{i = 1}^{n} x_{i}!},

(52)

from which

K (x | θ) = θ^{\sum_{i = 1}^{n} x_{i}} e^{- n θ}

(53)

and

R (x) = \frac{1}{\prod_{i = 1}^{n} x_{i}!}

in (4). Note that for all fixed

θ > 0 except

θ = 1,

there is an obvious one-to-one function between

T_{1} (x) = \sum_{i = 1}^{n} x_{i}

and (53). Hence, from Case 1,

I_{lost} (x | K (x | θ)) = I_{lost} (x {| T}_{1}) \approx 0.013

KB from Theorem 3 for all

θ > 0 except θ = 1 .

For

θ = 1, K (x | θ) = e^{- n}

from (53) and is constant with respect to any data sample

x .

Thus,

I_{reduced} (x | 1, K) = 0

and

I_{lost} (x | K (x | 1)) = I_{total} (x | 1, K) .

It follows that

K (x | 1)

provides no information about

X .

Case 3: Let

T_{3} (X) = L (X | θ)

for fixed but arbitrary

θ > 0 .

We attempt to obtain

I_{lost} (x | L (x | θ))

for a data sample

x = (x_{1}, \dots, x_{n})

by determining

| A_{L (x | θ)} |

and using (20). From (52), note that for all fixed

θ > 0 except θ = 1,

y \in A_{L (x | θ)}

if and only if

\frac{θ^{\sum_{i = 1}^{n} y_{i}}}{\prod_{i = 1}^{n} y_{i}!} = \frac{θ^{\sum_{i = 1}^{n} x_{i}}}{\prod_{i = 1}^{n} x_{i}!} .

(54)

Thus, for any fixed

θ

satisfying

θ > 0 and θ \neq 1,

y \in A_{L (x | θ)}

if both

\sum_{i = 1}^{n} y_{i} = \sum_{i = 1}^{n} x_{i}

and

\prod_{i = 1}^{n} y_{i}! = \prod_{i = 1}^{n} x_{i}! .

However, for some

θ > 0 and θ \neq 1,

it is possible that

y \in A_{L (x | θ)}

when neither

\sum_{i = 1}^{n} y_{i} = \sum_{i = 1}^{n} x_{i}

nor

\prod_{i = 1}^{n} y_{i}! = \prod_{i = 1}^{n} x_{i}! .

For example, let

θ = 2, x = (4, 1, 1, 0),

and

y = (3, 2, 0, 0) .

Then,

\sum_{i = 1}^{n} x_{i} = 6, \sum_{i = 1}^{n} y_{i} =

5,

\prod_{i = 1}^{n} x_{i}! = 24,

and

\prod_{i = 1}^{n} y_{i}! = 12 .

However, (54) is satisfied.

This complication suggests that an efficient implicit enumeration of the

y

satisfying (54) would be required to obtain

| A_{L (x | θ)} |

for calculating

I_{lost} (x | L (x | θ))

from (20). Using such an algorithm, a conventional computer could possibly compute

I_{lost} (x | L (x | θ))

for the numerical data and value of

θ

in Case 1, since there is now a 250 petabyte, 200 petaflop conventional computer [26]. Substantially larger problems, if not already tractable, will likely be so in the foreseeable future on quantum computers. Recently, the milestone of quantum supremacy was achieved where the various possible combinations of a certain randomly generated output were obtained in 110 s, whereas this task would have taken the above conventional supercomputer 10,000 years [27]. Regardless, for the data of Case 1, we have the upper bound

I_{lost} (x | L (x | θ)) \leq 0.013

KB from (32).

Finally, we present some simple computational results to illustrate the relationships among

T_{1}, T_{2}, T_{3}

with regard to the Poisson distribution. Table 1 below summarizes the results for sample data

(x_{1}, x_{2}, x_{3})

with

\sum_{i = 1}^{3} x_{i} \leq 2 .

In particular, a complete enumeration of

A_{L (x | θ)}

in (20) gives

I_{lost} (x | L (x | θ))

.

Example 2 (Binomial Distribution).

Consider a random sample

X = (X_{1}, \dots, X_{n})

from a binomial random variable

X

withparameters

m

and

θ,

where

θ

is the probability of success on any of the

m

Bernoulli trials associated with the

X_{i}, i = 1, \dots, n .

Let

m

be fixed, so the only parameter is

θ .

Moreover, the sample space of the underlying random variable

X

is now finite.

Case 1:

T_{1} (X) = \sum_{i = 1}^{n} X_{i}

. Again,

\sum_{i = 1}^{n} X_{i}

is an SS for

θ .

From [21],

\sum_{i = 1}^{n} X_{i}

has a binomial distribution with parameter

θ

for fixed

nm .

Hence,

P_{θ} [\sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} x_{i}] = θ^{\sum_{i = 1}^{n} x_{i}} θ^{mn - \sum_{i = 1}^{n} x_{i}} (\begin{matrix} mn \\ \sum_{i = 1}^{n} x_{i} \end{matrix})

(55)

and

P_{θ} [X = x] = θ^{\sum_{i = 1}^{n} x_{i}} θ^{mn - \sum_{i = 1}^{n} x_{i}} \prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix}) .

(56)

From (1), dividing (56) by (55) gives

P [X = x | \sum_{i = 1}^{n} X_{i} = t] = \frac{\prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix})}{(\begin{matrix} mn \\ t \end{matrix})} .

(57)

By taking the

- \log

of (57), the lost information is given as

I_{lost} (x {| T}_{1}) = - \log \frac{\prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix})}{(\begin{matrix} mn \\ t \end{matrix})} = - \sum_{i = 1}^{n} \log (\begin{matrix} m \\ x_{i} \end{matrix}) + \log (\begin{matrix} mn \\ t \end{matrix}) .

(58)

Case 2:

T_{2} (X) = K (X | θ) .

In this case, we use (16) as in Example 1. Write

L (x | θ) = f (x | θ) = θ^{\sum_{i = 1}^{n} x_{i}} {(1 - θ)}^{mn - \sum_{i = 1}^{n} x_{i}} \prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix}),

(59)

from which

K (x | θ) = θ^{\sum_{i = 1}^{n} x_{i}} {(1 - θ)}^{mn - \sum_{i = 1}^{n} x_{i}}

and

R (x) = \prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix})

in (4). To factor the right side of (60) as in (2), let

g

be the identity function and

h (x) = \prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix})

. Hence,

I_{lost} (x {| T}_{2}) = - \log \frac{\prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix})}{\sum_{y \in A_{K (x | θ)}} \prod_{i = 1}^{n} (\begin{matrix} m \\ y_{i} \end{matrix})},

(60)

and (60) gives

I_{lost} (x {| T}_{2}) = - \sum_{i = 1}^{n} \log (\begin{matrix} m \\ x_{i} \end{matrix}) + \log \sum_{y \in A_{K (x | θ)}} \prod_{i = 1}^{n} (\begin{matrix} m \\ y_{i} \end{matrix}),

(61)

where

A_{K (x | θ)} = {y \in S^{n} {| θ}^{\sum_{i = 1}^{n} y_{i}} {(1 - θ)}^{mn - \sum_{i = 1}^{n} y_{i}} = θ^{\sum_{i = 1}^{n} x_{i}} {(1 - θ)}^{mn - \sum_{i = 1}^{n} x_{i}}} .

(62)

From (62), for any fixed

θ

satisfying

0 < θ < 1 and θ \neq 1 / 2,

it can easily be shown that

y \in A_{K (x | θ)}

if and only if

\sum_{i = 1}^{n} y_{i} = \sum_{i = 1}^{n} x_{i} .

Thus, in general, for a given

x

and fixed

θ,

determining

A_{K (x | θ)}

in Case 2 would require an enumeration of the

y

satisfying (62) to compute (61). We perform such an enumeration below for a simple example.

Case 3:

T_{3} (X) = L (X | θ) .

For a data sample

x = (x_{1}, \dots, x_{n})

, we now have

L (x | θ) = {(\frac{θ}{1 - θ})}^{\sum_{i = 1}^{n} x_{i}} {(1 - θ)}^{mn} \prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix})

(63)

with

g

being the identity function and

h (x) = 1

in (2). For fixed

θ

satisfying

0 < θ < 1 and θ \neq 1 / 2,

from (63) we obtain that

y \in A_{L (x | θ)}

if and only if

{(\frac{θ}{1 - θ})}^{\sum_{i = 1}^{n} y_{i}} \prod_{i = 1}^{n} (\begin{matrix} m \\ y_{i} \end{matrix}) = {(\frac{θ}{1 - θ})}^{\sum_{i = 1}^{n} x_{i}} \prod_{i = 1}^{n} (\begin{matrix} m \\ x_{i} \end{matrix}) .

(64)

As in Case 3 of Example 1, developing an algorithm to use (64) and determine

| A_{L (x | θ)} |

for calculating

I_{lost} (x | L (x | θ))

from (20) is beyond the scope of this paper.

As a simple example, consider the experiment of flipping a possibly biased coin twice (

m = 2

). The total number of heads follows a binomial distribution with the parameter

θ

, which is the probability of getting a head on any flip. By doing this experiment three times, we generate the random variables

X_{1}, X_{2}, X_{3}

with possible values 0, 1, 2. Table 2 shows all the possibilities and the lost information for the statistics. The small size of this example allows the computation of

I_{lost}

in Cases 2 and 3 via total enumeration.

Table 2. Binomial Example.

Now, using (40), we give in Table 3 the entropic losses of Example 2 for

T_{1}, T_{2}, T_{3} .

Note that

H_{lost} (X, T)

is the same for the sum

T_{1}

and the likelihood kernel

T_{2},

which are related by a one-to-one function. Hence, Result 5 is corroborated. In addition, observe that

H_{lost} (X, T)

is smallest for the likelihood function

T_{3}

.

Table 3. Entropic loss over different statistics for a binomial distribution.

Example 3 (Geometric Distribution).

Consider a random sample

X = (X_{1}, \dots, X_{n})

with sample data

x = (x_{1}, \dots, x_{n})

from a geometric random variable

X,

where the parameter

θ

is the probability of success on any of the series of independent Bernoulli trials for which

X

is the trial number on which the first success is obtained. It readily follows from [5] that

P [X = x] = θ^{n} {(1 - θ)}^{\sum_{i = 1}^{n} x_{i} - n} .

(65)

Case 1:

T_{1} (X) = \sum_{i = 1}^{n} X_{i}

. For fixed

n, \sum_{i = 1}^{n} X_{i}

has a negative binomial distribution with parameter

θ

[21]. Hence,

P [\sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} x_{i}] = (\begin{matrix} \sum_{i = 1}^{n} x_{i} - 1 \\ n - 1 \end{matrix}) θ^{n} {(1 - θ)}^{\sum_{i = 1}^{n} x_{i} - n} .

(66)

Thus

T_{1} (X) = \sum_{i = 1}^{n} X_{i}

is an SS for

θ

since it satisfies (2) with

g [T_{1} (x) | θ] = θ^{n} {(1 - θ)}^{T_{1} (x) - n}

and

h (x_{1}, \dots, x_{n}) = (\begin{matrix} \sum_{i = 1}^{n} x_{i} - 1 \\ n - 1 \end{matrix})

. Moreover, substitution of (65) and (66) into (8) gives

P [X = x | \sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} x_{i}] = \frac{1}{(\begin{matrix} \sum_{i = 1}^{n} x_{i} - 1 \\ n - 1 \end{matrix})} .

(67)

Then from (14) and (67) we obtain that

I_{lost} (x {| T}_{1}) = \log (\begin{matrix} \sum_{i = 1}^{n} x_{i} - 1 \\ n - 1 \end{matrix}) .

(68)

Case 2:

T_{2} (X) = K (X | θ)

. From (66), for all

x \in S^{n},

R (x) = 1

and

K (x | θ) = L (x | θ) = {(\frac{θ}{1 - θ})}^{n} {(1 - θ)}^{\sum_{i = 1}^{n} x_{i}} .

(69)

Thus, for 0

< θ < 1,

there is an obvious one-to-one function between

T_{1} (x) = \sum_{i = 1}^{n} x_{i}

and

T_{2} (x) =

K (x | θ)

in (69). Thus, from Theorem 3,

I_{lost} (x {| T}_{2} (x)) = I_{lost} (x {| T}_{1})

as given in (68).

Case 3:

T_{3} (X) = L (X | θ)

. Since

K (X | θ) = L (X | θ)

from (69), then

I_{lost} (x {| T}_{3}) = \log (\begin{matrix} \sum_{i = 1}^{n} x_{i} - 1 \\ n - 1 \end{matrix})

(70)

from (68). However, there is an alternate derivation of (70). For 0

< θ < 1

it follows from (69) that then

y \in A_{L (x | θ)}

if and only if

\sum_{i = 1}^{n} y_{i} = \sum_{i = 1}^{n} x_{i} .

(71)

But for fixed positive integers

x_{1}, \dots, x_{n}

we have from [28] that the number of solutions

| A_{L (x | θ)} |

to (71) in positive integers

y_{1}, \dots, y_{n}

is

(\begin{matrix} \sum_{i = 1}^{n} x_{i} - 1 \\ n - 1 \end{matrix}) .

(72)

Thus, (70) follows for

L (X | θ)

from (72) and (20), so

I_{lost} (x {| T}_{1}) = I_{lost} (x {| T}_{2}) = I_{lost} (x {| T}_{3})

from Theorem 3.

As a numerical illustration, let the random variable

X

denote the number of flips of a possibly biased coin until a head is obtained. Then,

X

has a geometric distribution, where the parameter

θ

is now the probability of getting a head on any flip. Suppose this experiment is performed three times yielding the sample data

x = (x_{1}, x_{2}, x_{3})

shown in Table 4.

I_{lost} (x | T)

is then calculated for each of the sufficient statistics for

θ

of Example 3. Observe that the individual statistics depend on

θ

while the lost information does not. Moreover,

I_{lost} (x {| T}_{1}) = I_{lost} (x {| T}_{2}) = I_{lost} (x {| T}_{3})

for all data samples, as established above.

Table 4. Geometric Example.

6. Conclusions

In this paper, the Shannon information obtained for a random sample

X

taken from a discrete random variable

X

with a single parameter

θ

was decomposed into two components: (i) the reduced information associated with the value of a real-valued statistic

T (X)

evaluated at the data sample

x

, and (ii) the information lost by using this value as a surrogate for

x .

Information is lost because multiple data sets can give the same value of the statistic. In data analysis, the data uniquely determines the value of a statistic, but typically the value of the statistic does not uniquely determine the data yielding it. The lost information thus measures the knowledge unavailable about the data sample

x

when only the reduced data summary

T (x)

is known, but not

x

itself. To eliminate the effect of

θ

, we focused on sufficient statistics for

θ

such as the sample mean. We then answered the question: how much Shannon information is lost to someone about a data sample

x

when only the value of

T (x)

is available but not the data

x

itself? Our answer is independent of the parameter

θ

and does not require that

θ

be known. Our method generalizes the approach of [2] for analyzing the information contained in a sequence of Bernoulli trials.

More generally, we developed a metric associated with the value

T (x)

used to summarize, represent, or characterize a given data set. Our approach and results are significant because such statistics are often communicated without the original data. One could argue that

I_{lost} (x | T)

should be communicated along with

T (x)

in a manner similar to providing the margin of error associated with the results of a poll. A small

I_{lost} (x | T)

would signify that

T (x)

is more informative than if

I_{lost} (x | T)

were large.

In addition, we defined the entropic loss associated with a sufficient statistic

T

under consideration as the expected lost information over all possible samples, to give a value dependent only on

T

. We noted but did not explore the possibility that entropic loss could be used as a metric to compare different sufficient statistics. Moreover, if sufficient statistics were not required, entropic loss could provide metrics on either

θ

or

T

if the other of these variables is fixed. Finally, numerical examples of our results were presented and some computational issues noted.

Author Contributions

M.M. suggested the topic after reading [2]. She shared the development of the theory and examples of this paper, as well as wrote early drafts. H.C. formulated the general decomposition here. He shared the development of the theory and examples of this paper, as well as edited the final draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Landauer, R. Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 1961, 5, 183–191. [Google Scholar] [CrossRef]
Hodge, S.E.; Vieland, V.J. Information loss in binomial data due to data compression. Entropy 2017, 19, 75. [Google Scholar] [CrossRef]
Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Cengage Learning: Delhi, India, 2002. [Google Scholar]
Pawitan, Y. All Likelihood: Statistical Modeling and Inference Using Likelihood, 1st ed.; The Clarendon Press: Oxford, UK, 2013. [Google Scholar]
Rohatgi, V.K.; Saleh, A.K.E. An Introduction to Probability and Statistics, 2nd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2001. [Google Scholar]
Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication, 1st ed.; The University of Illinois Press: Urbana, IL, USA, 1964. [Google Scholar]
Shannon, C. A mathematical theory of communication. Bell. Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Vigo, R. Representational information: A new general notion and measure of information. Inf. Sci. 2011, 181, 4847–4859. [Google Scholar] [CrossRef]
Vigo, R. Complexity over uncertainty in generalized representational information theory (GRIT): A structure-sensitive general theory of information. Information 2013, 4, 1–30. [Google Scholar] [CrossRef]
Klir, G.J. Uncertainty and Information: Foundations of Generalized Information Theory, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
Devlin, K. Logic and Information, 1st ed.; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Luce, R.D. Whatever happened to information theory in psychology? Rev. Gen. Psychol. 2003, 7, 183–188. [Google Scholar] [CrossRef]
Floridi, L. The Philosophy of Information, 1st ed.; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
Garner, W.R. The Processing of Information and Structure, 1st ed.; Wiley: New York, NY, USA, 1974. [Google Scholar]
Spellerberg, I.F.; Fedor, P.J. A tribute to Claude-Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the “Shannon-Wiener” index. Glob. Ecol. Biogeogr. 2003, 12, 177–179. [Google Scholar] [CrossRef]
Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef]
Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
Kapur, J.N.; Kesavan, H.K. Entropy Optimization Principles and Their Applications, 1st ed.; Water Science and Technology Library, Springer: Dordrecht, The Netherlands, 1992; Volume 9. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
Ibekwe-SanJuan, F.; Dousa, T. Theories of Information, Communication and Knowledge: A Multidisciplinary Approach, 1st ed.; Springer: Dordrecht, The Netherlands, 2014. [Google Scholar]
Johnson, J.L. Probability and Statistics for Computer Science, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2008. [Google Scholar]
Beeler, R.A. How to Count: An Introduction to Combinatorics and Its Applications, 1st ed.; Springer: Cham, Switzerland, 2015. [Google Scholar]
Sridharan, S.; Balakrishnan, R. Foundations of Discrete Mathematics with Algorithms and Programming, 1st ed.; Chapman and Hall/CRC: New York, NY, USA, 2018. [Google Scholar]
Mortici, C. Ramanujan formula for the generalized Stirling approximation. Appl. Math. Comput. 2010, 19, 2579–2585. [Google Scholar] [CrossRef]
Multinomial Coefficient Calculator. Available online: https://mathcracker.com/multinomial-coefficient-calculator.php (accessed on 27 July 2019).
Wan, L.; Mehta, K.V.; Klasky, S.A.; Wolf, M.; Wang, H.Y.; Wang, W.H.; Li, J.C.; Lin, Z. Data management challenges of exascale scientific simulations: A case study with the Gyrokinetic Toroidal Code and ADIOS. In Proceedings of the 10th International Conference on Computational Methods, ICCM’19, Singapore, 9–13 July 2019. [Google Scholar]
Arute, F.; Arya, K.; Martinis, J.M. Quantum supremacy using a programmable superconducting processor. Nature 2019, 574, 505–510. [Google Scholar] [CrossRef] [PubMed]
Mahmoudvand, R.; Hassani, H.; Farzaneh, A.; Howell, G. The exact number of nonnegative integer solutions for a linear Diophantine inequality. IAENG Int. J. Appl. Math. 2010, 40, 5. [Google Scholar]

Figure 1. Partition Sets.

Table 1. Poisson Example.

$x = (x_{1}, x_{2}, x_{3})$	T₁(x)	$I_{l o s t} (x \| T_{1})$	$T_{2} (x)$	$I_{l o s t} (x \| T_{2})$	$T_{3} (x)$	$I_{l o s t} (x \| T_{3})$
(0,0,0)	0	0	$e^{- 3 θ}$	0	$e^{- 3 θ}$	0
(0,0,1)	1	$\log 3$	${θ e}^{- 3 θ}$	$\log 3$	${θ e}^{- 3 θ}$	$\log 3$
(0,1,0)
(1,0,0)
(1,1,0)	2	$\log \frac{9}{2}$	$θ^{2} e^{- 3 θ}$	$\log \frac{9}{2}$	$θ^{2} e^{- 3 θ}$	$\log 3$
(1,0,1)
(0,1,1)
(2,0,0)	2	$\log 9$	$θ^{2} e^{- 3 θ}$	$\log 9$	$\frac{θ^{2} e^{- 3 θ}}{2}$	$\log 3$
(0,2,0)
(0,0,2)

Table 2. Binomial Example.

$x = (x_{1}, x_{2}, x_{3})$	$T_{1} (x)$	$I_{l o s t} (x \| T_{1})$	$T_{2} (x)$	$I_{l o s t} (x \| T_{2})$	$T_{3} (x)$	$I_{l o s t} (x \| T_{3})$
(0,0,0)	0	$0$	${(1 - θ)}^{6}$	$0$	${(1 - θ)}^{6}$	$0$
(0,0,1)	1	$\log 3$	(1 − θ)⁵θ¹	$\log 3$	$2 {(1 - θ)}^{5} θ^{1}$	$\log 3$
(0,1,0)
(1,0,0)
(1,1,0)	2	$\log \frac{15}{4}$	${(1 - θ)}^{4} θ^{2}$	$\log \frac{15}{4}$	$4 {(1 - θ)}^{4} θ^{2}$	$\log 3$
(1,0,1)
(0,1,1)
(2,0,0)	2	$\log 15$	${(1 - θ)}^{4} θ^{2}$	$\log 15$	${(1 - θ)}^{4} θ^{2}$	$\log 3$
(0,2,0)
(0,0,2)
(1,1,1)	3	$\log \frac{5}{2}$	${(1 - θ)}^{3} θ^{3}$	$\log \frac{5}{2}$	$8 {(1 - θ)}^{3} θ^{3}$	$0$
(2,1,0)	3	$\log 10$	${(1 - θ)}^{3} θ^{3}$	$\log 10$	$2 {(1 - θ)}^{3} θ^{3}$	$\log 6$
(2,0,1)
(1,0,2)
(1,2,0)
(0,1,2)
(0,2,1)
(2,1,1)	4	$\log \frac{15}{4}$	${(1 - θ)}^{2} θ^{4}$	$\log \frac{15}{4}$	$4 {(1 - θ)}^{2} θ^{4}$	$\log 3$
(1,2,1)
(1,1,2)
(2,2,0)	4	$\log 15$	${(1 - θ)}^{2} θ^{4}$	$\log 15$	${(1 - θ)}^{2} θ^{4}$	$\log 3$
(2,0,2)
(0,2,2)
(2,2,1)	5	$\log 3$	${(1 - θ)}^{1} θ^{5}$	$\log 3$	$2 {(1 - θ)}^{1} θ^{5}$	$\log 3$
(2,1,2)
(1,2,2)
(2,2,2)	6	$0$	$θ^{6}$	$0$	$θ^{6}$	$0$

Table 3. Entropic loss over different statistics for a binomial distribution.

$H_{l o s t} (X, T_{1})$	$H_{l o s t} (X, T_{2})$	$H_{l o s t} (X, T_{3})$
1.4722	1.4722	1.2095

Table 4. Geometric Example.

$x = (x_{1}, x_{2}, x_{3})$	T₁(x)	$I_{l o s t} (x \| T_{1})$	$T_{2} (x)$	$I_{l o s t} (x \| T_{2})$	$T_{3} (x)$	$I_{l o s t} (x \| T_{3})$
(1,1,1)	3	$0$	$θ^{3}$	$0$	$θ^{3}$	$0$
(2,1,1)	4	$\log 3$	$θ^{3} (1 - θ)$	$\log 3$	$θ^{3} (1 - θ)$	$\log 3$
(1,2,1)
(1,1,2)
(2,2,1)	5	$\log 6$	$θ^{3} {(1 - θ)}^{2}$	$\log 6$	$θ^{3} {(1 - θ)}^{2}$	$\log 6$
(2,1,2)
(1,2,2)
(2,2,2)	6	$\log 10$	$θ^{3} {(1 - θ)}^{3}$	$\log 10$	$θ^{3} {(1 - θ)}^{3}$	$\log 10$

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions

Abstract

1. Introduction

2. Preliminaries

3. Information Decomposition under Data Reduction by a Real-Valued Statistic

4. Entropic Loss for an SS

5. Examples and Computational Issues

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics