Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions

: In this paper, we study the information lost when a real-valued statistic is used to reduce or summarize sample data from a discrete random variable with a one-dimensional parameter. We compare the probability that a random sample gives a particular data set to the probability of the statistic’s value for this data set. We focus on su ﬃ cient statistics for the parameter of interest and develop a general formula independent of the parameter for the Shannon information lost when a data sample is reduced to such a summary statistic. We also develop a measure of entropy for this lost information that depends only on the real-valued statistic but neither the parameter nor the data. Our approach would also work for non-su ﬃ cient statistics, but the lost information and associated entropy would involve the parameter. The method is applied to three well-known discrete distributions to illustrate its implementation.


Introduction
We consider the data sample x = (x 1 , . . . , x n ) from a random sample X = (X 1 , . . . , X n ) for a discrete random variable X with sample space S and one-dimensional parameter θ. A statistic T(X) is a function of the random sample X for any fixed but arbitrary value of a parameter θ associated with the underlying X. Thus a statistic T(X) is a random variable itself. Here we consider only real-valued statistics that reduce the data sample x to a number T(x) that might be used to summarize X, to characteriz X, or perhaps to estimate θ. However, data reduction is an irreversible process [1] and always involves some information loss. For instance, if T(X) is the sample mean X, the original measurements x cannot be reconstructed from x, and some information about x is lost. Nonetheless, such data reduction is frequently used to make inferences.
More explicitly, our motivation for considering such situations is that T(x) is usually communicated in practice as a summary for the data x but without the actual data. The question then naturally arises: how much information is lost to someone about a data sample x when only the value of T is available for x, but not the data itself? To answer this question, we develop a theoretical framework for determining how much information is lost about a given data set x by knowing only the value of T(x) but neither x itself nor the parameter θ. Our information-theoretic approach to data reduction generalizes the observation in [2] that a binomial random variable loses all the information about the order of successes in the associated sequence of Bernoulli trials. In other words, for a series of n Bernoulli trials one cannot recreate the order of successes by only knowing the number that occurred.

Result 1 (FFT [3]).
The real-valued statistic T(X) is sufficient for θ if and only if there exist functions g : R 1 → R 1 and h : S n → R 1 such that for any sample data x and for all values of θ the joint pmf f (x θ) of X can be factored as for real-valued, nonnegative functions g on and h on S n . The function h does not depend on θ, while g does depend on x but only through T(x).
We focus on a sufficient statistic T for θ in Section 3, where we need the notion of a partition [5] defined next. [3]). Let S be the denumerable sample space of the discrete random variable X so that S n is the denumerable sample space of the random sample X . For any statistic T : S n → R 1 , let τ T be the denumerable set τ T = {t|∃x ∈ S n for which t = T(x) , which is the range of T. Then T partitions the sample space S n into the mutually exclusive and collectively exhaustive partition sets A t = x ∈ S n T(x) = t , ∀t ∈ τ T . We also use the well-known likelihood function. [3][4][5]). Let be sample data from a random sample from a discrete random variable with sample space and real-valued parameter , and let ( | ) denote the joint pmf of the random sample . For any sample data , the likelihood function of is defined as

Definition 3 (Likelihood Function
The likelihood function L(θ| ) in (3) is a function of the variable θ for given data . However, the joint pmf f( |θ) as a function of for fixed θ is frequently called the likelihood function as well. In this case, we also write the joint pmf as L( |θ). We distinguish the two cases since L(θ| ) is not a statistic but L( |θ) is one that incorporates all available information about . Moreover, L( |θ) is an SS for θ [4] and uniquely determines an associated SS called the likelihood kernel to be used in subsequent examples.
We next define a new concept called the likelihood kernel. As a function of for fixed θ, it is shown below to be a sufficient statistic for θ and is used in Section 5 to facilitate the computation of lost information associated with other sufficient statistics T. As a function of θ for fixed , a possibility not considered here, the likelihood kernel may be useful in applying the likelihood principle [3,4] to make inferences about θ without resorting to the notion of equivalence classes. It would be the "simplest" factor of L(θ| ) that can be used in a likelihood ratio comparing two values of θ.

Definition 4 (Likelihood kernel).
Let be the sample space of . For fixed but arbitrary , suppose that ( | ) can be factored as where : ⟶ and : ⟶ have the following properties. (a) Every nonnumerical factor of ( | ) contains . Then ( | ) is defined as the likelihood kernel of ( | ) and ( ) as the residue of ( | ).  We also use the well-known likelihood function.

Definition 3 (Likelihood Function [3-5])
. Let x be sample data from a random sample X from a discrete random variable X with sample space S and real-valued parameter θ, and let f (x θ) denote the joint pmf of the random sample X . For any sample data x, the likelihood function of θ is defined as The likelihood function L(θ|x) in (3) is a function of the variable θ for given data x. However, the joint pmf f(x |θ ) as a function of x for fixed θ is frequently called the likelihood function as well. In this case, we also write the joint pmf as L(x |θ ). We distinguish the two cases since L( θ| x) is not a statistic but L(x |θ ) is one that incorporates all available information about X. Moreover, L(x |θ ) is an SS for θ [4] and uniquely determines an associated SS called the likelihood kernel to be used in subsequent examples.
We next define a new concept called the likelihood kernel. As a function of x for fixed θ, it is shown below to be a sufficient statistic for θ and is used in Section 5 to facilitate the computation of lost information associated with other sufficient statistics T. As a function of θ for fixed x, a possibility not considered here, the likelihood kernel may be useful in applying the likelihood principle [3,4] to make inferences about θ without resorting to the notion of equivalence classes. It would be the "simplest" factor of L( θ| x) that can be used in a likelihood ratio comparing two values of θ.

Definition 4 (Likelihood kernel).
Let S be the sample space of X . For fixed but arbitrary θ, suppose that L(x θ) can be factored as where K : S n → R 1 and R : S n → R 1 have the following properties.
is not divisible by any positive number except 1.
Then K(x θ) is defined as the likelihood kernel of L(x θ) and R(x) as the residue of L(x θ).
Theorem 1. The likelihood kernel K(x θ) has the following properties.
(ii) K(x θ) is an SS for θ.
(iii) For any θ 1 and θ 2 , the likelihood ratio Proof. To prove (i), for fixed θ we first show that the likelihood kernel K(x |θ ) of Definition 4 exists by construction. Since the formula for L(θ|x) = f(x |θ ) must explicitly contain θ, the parameter θ cannot appear only in the range of x. Hence L(x |θ ) as a function of x can be factored into K(x |θ ) × R(x), satisfying (a) and (b) of Definition 4, where K(x |θ ) ≥ 0, ∀x ∈ S n , and the numerical factor of K(x |θ ) is either +1 or −1. Then R(x) ≥ 0, ∀x ∈ S n since K(x |θ ) ≥ 0, ∀x ∈ S n , and K(x |θ ) × R(x) = f(x |θ ) ≥ 0. Thus (c) is satisfied. Finally, the only positive integer that evenly divides +1 or −1 is 1, so (d) holds. It follows that the likelihood kernel K(x |θ ) and its associated R(x) in Definition 4 are well defined and exist. We next show that K(x |θ ) as constructed above is unique. Let K 1 (x |θ ) with residue R 1 (x) and K 2 (x |θ ) with R 2 (x) both satisfy Definition 4. Thus for j = 1, 2, R j (x) does not contain θ, while every nonnumerical factor of K j (x |θ ) does contain θ. It follows that K 1 (x |θ ) ≥ 0 and K 2 (x |θ ) ≥ 0 must be identical or else be a positive multiple of one another. Assume that K 2 (x |θ ) = λK 1 (x |θ ) for some λ > 0. If λ 1, K 2 (x |θ ) is divisible by a positive number other than 1 to contradict (d). Thus K(x |θ ) is unique.
To prove (ii), we show that this unique K(x |θ ) is an SS for θ.
is an SS by the FFT of Result 1. Finally, (iii) follows from Definition 4 and the fact that the joint pmf L(x |θ 2 ) 0 for x ∈ S n . We next discuss the notion of information to be used. Actually, probability itself is a measure of information in the sense that it captures the surprise level of an event. An observer obtains more information, i.e., surprise, if an unlikely event occurs than if a likely one does. Instead of probability, however, we use the additive measure known as Shannon information [6,7], defined as follows.

Definition 5 (Shannon Information [6,7])
. Let x be sample data for the random sample X from the discrete random variable X with a one-dimensional parameter θ, and let f (x|θ) be the joint pmf of X at x. The Shannon information obtained from the sample data x is defined as where the units of I(x|θ ) is bits if the base of the logarithm is 2, which is to be used here.
Other definitions for information have been proposed. For example, Vigo [8,9] has defined a measure of representational information. Further details on different types of information can be found in [10][11][12][13][14][15][16]. For Shannon information, we use its expected value over ∀x ∈ S n .
The general properties of Shannon entropy are given in [17][18][19][20], for example. Since entropy is the expected information over all possible random samples, it can be argued that entropy is a better measure of the available information about X than would the Shannon information for a single data set x, which might not be typical [18]. We next give a method to obtain the information loss about X that occurs when a data set x is reduced to T(x). In our approach, we focus on a sufficient statistic T so there will be no θ in (5) for the lost information below.

Information Decomposition under Data Reduction by a Real-Valued Statistic
We now develop a procedure to determine how much information about X contained in a data set x is lost when the data is reduced to T(x) by the sufficient statistic T. Consider the joint conditional probability which is identified with the probabilistic information lost about the event X = x by the data reduction of x to T(x). The notation P θ refers to the fact that the discrete probability (7), in general, involves the parameter θ. We next express (7) using the definition of conditional probability to obtain the basis of our development. Result 2 is given in ( [3], p. 273) and proven below to illustrate the reasoning.
Result 2 [3]. Let x be sample data for a random sample X from a discrete random variable X with sample space S and real-valued parameter θ, and let T(X) be any real-valued statistic. Then Proof. Using the definition of conditional probability, rewrite (7) as However, T(X) = T(x) whenever X = x, so (8) follows.
Observe that if T is an SS for θ, the left side of (8) is independent of θ by the FFT and hence so is the right. Taking the negative logarithm of (8) and rearranging terms gives These facts suggest that the left side of (10) is the total Shannon information in bits about X contained in the sample data x. On the right side of (10), the term − log P θ [T(X) = T(x)] is considered the information about X contained in the reduced data summary T(x), and the term − log P θ [X = x | T (X) = T(x)] is identified as the information about X that has been lost as the result of the data reduction by T(x).
In particular, this lost information represents a combinatorial loss in the sense that multiple x's may give the same value T(x) = t, as depicted in Figure 1 above. In other words, the lost information − log P θ [X = x | T (X) = T(x)] is a measure of the knowledge unavailable about the data sample x when only the reduced data summary T(x) is known but not x itself. For a sufficient statistic T(X) for θ, this lost information is independent of θ. It is a characteristic of T(X) for the given data sample x.
In terms of Figure 1, (10) may be described as follows. On the left of the figure is the sample space S n ⊆ R n over which probabilities on X are computed. On the right is the range τ T ⊆ R 1 of T over which the probability of T(X) are computed. T reduces the data sample x into T(x), where multiple x's may give the same T(x) = t. In Figure 1, the distinct data samples x 1, x 2 , and x 3 are all reduced into the same value t 1 . However, knowing that T(x) = t 1 for some data sample x does not provide sufficient information to know unequivocally, for example, that x = x 1 . Information is lost in the reduction. One can also say that the total information − log P θ [X = x] obtained from the left side of Figure 1 is reduced to − log P θ [T(X) = T(x)] obtained from the right. The reduction in information from the left to the right side is precisely the lost information − log P θ [X = x | T (X) = T(x)] of (10). For fixed t, it is lost due to the ambiguity as to which data sample on the left actually gave t. There is no such ambiguity when T is one-to-one.
The general decomposition of information in (10) is next summarized in Definition 7, where T does not need to be sufficient for θ. Definition 7 (I total , I reduced , I lost ). Let x be sample data for a random sample X from a discrete random variable X with sample space S and real-valued parameter θ. For any real-valued statistic T(X), the Shannon information about X obtained from the sample data x can be decomposed as where and Definition 7 generalizes the information decomposition of [2] for a data sample x of size n from a Bernoulli random variable X. In terms o f this paper, the parameter θ in [2] is the probability 0.5 of success on a single Bernoulli trial, and T(x) = n i=1 x i . It should be noted that the notation I comp in [2], which refers to compressed information, corresponds to I reduced in Equation (13). We use the term "data reduction" as described in [3] as opposed to "data compression" to prevent misinterpretation. In computer science, data compression refers to encoding information using fewer bits than the original representation and is often lossless.
Both Result 2 and Definition 7 are valid for any real-valued statistic for X. The notation I total (x |θ ) indicates that I total is a function of the sample data x for a fixed but arbitrary parameter value θ. Similarly, both I reduced (x |θ , T) and I lost (x |θ , T) are functions of x for fixed θ and T. However, in this paper we focus on sufficient statistics, which provide a simpler expression for I lost (x |θ , T) that does not involve θ. For a sufficient statistic T for θ, we use the notation I lost (x |T ) for the lost information, though I total (x |θ ) and I reduced (x |θ , T) still require θ. The next result is an application of the FFT of Result 1.

Theorem 2 (Lost Information for an SS)
. Let x be sample data for a random sample X from a discrete random variable X with sample space S and real-valued parameter θ. Let T be an SS for θ, let f (x|θ ) be the joint pmf of X, and write f (x|θ ) = g[T(x) θ] × h(x) as in Result 1. Then for all x ∈ S n Proof. Let x ∈ S n . Then f(x|θ) > 0 since x is a realization of X. Because T is an SS, we write (7) without θ. It now suffices to establish that from which (15) immediately follows. Rewrite (8) as so from (17) and (2), then However, T(y) = T(x), ∀y ∈ A T(x) in (18), so Since f(x|θ) > 0 and hence g[T(x) θ] 0 , this term can be canceled on the right side of (19) to yield (16). Taking the − log of (16) completes the proof.
Now consider Theorem 2 when each A t is a singleton in (16), i.e., when T is a one-to-one function. In this extreme case, in the denominator of the right side of (16). Thus I lost (x|T) = 0 from which I comp (x |θ , T) = I total (x θ) for all x in S n . Thus, the special case of a one-to-one T justifies the identification of the lost information as In other words, for all data samples x, y ∈ S n , if x y whenever T(x) T(y), then P[X = x |T (X) = T(x)] is not diminished by the reduction of the singleton A T(x) to the number T(x).
More generally, it is also true that I lost (x |θ , T) = 0 when T is one-to-one but not sufficient for θ. In and I lost (x |θ , T) = I total (x |θ , T) on S n . In this case, I reduced (x |θ , T) = 0 because the event T(x) = c gives no information about x.
We also note that I lost (x |T ) could be used as a metric to compare sufficient statistics for a given data sample x. For example, T 1 could be regarded as better than T 2 for x if I lost (x, T 1 ) < I lost (x, T 2 ). However, this comparison would be limited to the given x. In Section 4, we propose but do not explore a metric based on entropy independent of a particular data sample. We next show that (16) can be simplified when T is the likelihood function.

Corollary 1 (Information Loss for Likelihood Function).
Under the assumptions of Theorem 2, if T(x) = L(x|θ ), then where A L(x|θ) is the cardinality of the partition set A t for t = L(x|θ ). We next state a reproductive property of a statistic T that is a one-to-one function of a sufficient statistic T for θ.

Theorem 3.
If there is a one-to-one function between a sufficient statistic T for θ and an arbitrary real-valued statistic T on S n , then the following hold.
(i) T is also an SS.
(ii) T and T partition the sample space S into the same partition sets. (iii) I lost (x|T) = I lost (x|T ), ∀x ∈ S n .

Proof.
To prove (i), let u be a real-valued one-to-one function of T such that Since T is an SS, by Equation (2) there are real-valued functions g on R 1 and h on S n for which By substituting T(x) from (21) in (22), we get which can be rewritten as Since T in (24) satisfies the condition of Result 1 for g = g • u, T is an SS.
To prove (ii), we use Definition 2. Let T partition the sample space S n into the mutually exclusive and collectively exhaustive sets A t = x |T (x) = t , ∀t ∈ τ T . By Equation (21) we can also write A t as Since u is a one-to-one function, it has an inverse u −1 . Letting u −1 (t) = t , we apply u −1 to the right side of (25) and get A t = x T (x) = t , ∀t ∈ u(τ T ).
However, u(τ T ) = τ T and the cardinalities satisfy τ T = τ T , so the right side of (26) is A t and Finally, to get (iii) we use Theorem 2 to calculate information lost over two statistics T and T . Since h(x) is the same in (22) and (24) and since Equation (27) holds, we sum h(x) over the same sets in the denominator of Equation (16) for both T and T to give and complete the proof.
We next compare the information loss of the sufficient statistic L(x|θ) to other sufficient statistics. For the sufficient statistic K(x|θ), a lemma is needed. Lemma 1. Let x be any data sample for a random sample X from the discrete random variable X with real-valued parameter θ. Then K(x|θ ) is a function of L(x|θ ) and τ L ≥ τ K .
Proof. From ( [3], p. 280), K(x|θ) is a function of L(x|θ) if and only if K(x|θ) = K(y θ) whenever L(x|θ) = L(y θ). For all data samples x and y, we prove that if L(x|θ) = L(y θ), then K(x|θ) = K(y θ). Thus suppose that L(x|θ) = L(y θ). By Definition 4, we can decompose L(x|θ) and L(y θ) into K(x|θ)R(x) and K(y θ)R(y), respectively. Note that K(y θ) 0. Otherwise, L(y θ) = 0 in contradiction to y being sample data with a nonzero probability of occurring. Now write Suppose that K(x|θ) K(y θ) so that K(x|θ) 1 in (29). From Definition 4, every nonnumerical factor of K(x|θ) and K(y θ) contains θ. Moreover, neither K(x|θ) nor K(y θ) is divisible by any positive number except the number 1. Hence, since R(y) R(x) does not contain θ, the nonnumerical factors of K(x|θ) and K(y θ) must cancel in (29) and the remaining numerical factors could not be identical. Thus at least one of these factors would be divisible by a positive number other than 1 in contradiction to Definition 4. It now follows that K(x|θ) = K(y θ), so K(x|θ) is some function u of L(x|θ). Finally, τ L ≥ τ K since this function u is surjective from S n onto its image u(S n ).

Lemma 2.
Under the conditions of Lemma 1, the sufficient statistics L and K satisfy Proof. Let x ∈ S n and suppose that y ∈ A L(x) . Then L(y θ) = L(x|θ), so it follows from Lemma 1 that K(y θ) = K(x|θ) and thus y ∈ A K(x) . Hence A L(x) ⊆ A K(x) , and so Taking the negative log of both sides of the inequality in (31) and using (13) gives (30).
Theorem 4. Let x be sample data for a random sample X from a discrete random variable X with the real-valued parameter θ. Then for all x ∈ S n , I lost (x |L ) ≤ I lost (x |K ).
As a consequence of Theorem 3, Theorem 4 has an immediate corollary.

Corollary 2.
Under the conditions of Theorem 4, let T be a sufficient statistic for θ for which there is a one-to-one function between T and K. Then for all x ∈ S n , The question remains open as to whether (34) holds for all sufficient statistics T for θ. Regardless, the proofs of Lemma 2 and Theorem 4 illustrate the fact that the relation between the lost information for two statistics T and T is determined by the relation between their partition sets A t = x |T (x) = t and B t = x T (x) = t . For example, if for every A t there exists a B t for which A t ⊂ B t , then the partition of S n by the B t of T is said to be coarser than the partition by the A t of T. In that case, I lost (x |θ , T) ≤ I lost (x |θ , T ) because each x ∈ S n has more y ∈ S n with T (y) = T (x) than there are with T(y) = T(x). In other words, T (y) = t is at least as ambiguous as T(y) = t in determining the data sample giving the value of the respective statistics.

Entropic Loss for an SS
For a sufficient statistic T for θ, we now propose an entropy measure to characterize T by the expected lost information incurred by the reduction of X to T(X). This expectation is taken over all possible data sets x. This nonstandard entropy measure is called entropic loss, and it depends on neither a particular data set x nor the value of θ. Before defining this measure, we need to determine the appropriate pmf to use in taking an expectation. The following results are used.

Result 3.
Under the assumptions of Theorem 2, for any data sample let t = T(x) and consider the partition set A t . Then to give (35).

Result 4. Under the assumptions of Theorem 2, the sum
Proof. We perform the sum on the left of (37) by first summing over x ∈ A t for fixed t and then summing over each t ∈ τ T to give The inner series on the right side of (38) sums to one by Result 3. Hence, the outer sum yields |τ T | for τ T = {t|∃x ∈ S n for which t = T(x) .
From (37), it follows that the left side of (37) is not a probability distribution on S n unless |τ T | = 1. Moreover, P[X = x | T (X) = T(x)] is not a conditional probability distribution even if |τ T | = 1 since the condition T(X) = T(x) varies with x. However, we use Result 4 to normalize P[X = x | T (X) = T(x)] and obtain the appropriate pmf for calculating the expectation of I lost (X |T ).

Definition 8 (Entropic Loss). Under the assumptions of Theorem 2, the entropic loss resulting from the data reduction by T is defined as
which from (15) and (16) can be rewritten as Observe that (39) and (40) are independent of both x and θ. Indeed, for a given underlying random variable X and sample size n, H lost (X, T) is a function only of T. Thus, H lost (X, T) could be used as a metric to compare sufficient statistics independent of the data sample. In particular, T 1 could be regarded as better than T 2 if H lost (X, T 1 ) < H lost (X, T 2 ), i.e., if the expected information loss associated with T 1 is less than that for T 2 . Moreover, for a given underlying random variable X and sample size n, Definition 8 could be extended to non-sufficient statistics. In that case, the entropic loss H lost (X, T, θ) would be a function of both T and θ. For a fixed θ, a non-sufficient statistic T 1 could again be considered as better than a non-sufficient T 2 if H lost (X, T 1 , θ ) < H lost (X, T, θ ). Furthermore, for a given statistic T, the numerical value θ 1 could be considered as a better numerical point estimate for θ than the value θ 2 if H lost (X, T, θ 1 ) < H lost (X, T, θ 2 ). Similarly, H lost (X, T, θ) could be minimized over θ to give a best numerical point estimate for θ based on the entropic loss criterion. However, we do not pursue these possibilities here. We next compute H lost (T) for the sufficient statistic T(X) = L(X|θ).

Theorem 5 (Entropic Loss for Likelihood Function).
Under the assumptions of Theorem 2, the entropic loss resulting from the data reduction by T(x) = L(x|θ) is Proof. From (20), write We decompose the sum over x ∈ S n in (42) to consecutive sums over x ∈ A t and then t ∈ τ T to get Equation (41) now follows from (43).

Result 5.
Suppose there is a one-to-one function between two sufficient statistics T and T for θ. Then H lost (X, T) = H lost (X, T ).

Examples and Computational Issues
In this section, we present examples involving the discrete Poisson, binomial, and geometric distributions [21]. For each distribution, three sufficient statistics for a parameter θ are analyzed. For each such T, the quantity I lost (x |T ) does not involve θ. However, calculating I lost (x |T ) can still present computational issues, some of which are discussed below. Our examples are simple in order to focus on the definitions and results of Sections 3 and 4.

Example 1 (Poisson Distribution).
Consider the random sample X = (X 1 , . . . , X n ) with the data sample x = (x 1 , . . . , x n ) from a Poisson random variable X . We consider three sufficient statistics f or the parameter θ > 0. These sufficient statistics are T 1 (X) = n i=1 X i , the likelihood kernel T 2 (X) = K(X|θ ) for fixed but arbitrary θ, and the likelihood function T 3 (X) = L(X|θ ) f or f ixed but arbitrary θ. We use T 1 (X) as a surrogate for T 1 (X) = n i=1 X i n . Neither T 1 (X) or T 1 (X) involves θ and can thus be used either to characterize X or to estimate θ. Moreover, there is an obvious one-to-one function relating n i=1 X i n and n i=1 X i , so Theorems 3 and 5 establish that I lost x T 1 = I lost (x|T 1 ) and H lost X, T 1 = H lost (X, T 1 ), respectively. We analyze T 1 (X) because it is also Poisson, whereas T 1 (X) is not Poisson since n i=1 X i n is notnecessarily a nonnegative integer. In contrast to T 1 (X), both T 2 (X) and T 3 (X) contain θ and can only be used to characterize X. For each of these three sufficient statistics, we develop an expression for I lost (x|T ) and describe how to obtain a numerical value. We then illustrate previous results with a realistic Poisson data sample. We present further computational results in Table 1.  X i has a Poisson distribution with parameter nθ [21].
, and so (8) becomes where the multinomial coefficient It follows from (49) and (10) that which is also I lost x T 1 , as noted above. For a data sample (x 1 , . . . , x n ), the evaluation of I lost (x |T 1 ) in (50) involves computing factorials [22]. For realistic data, the principal limitation to calculating them by direct multiplication is their magnitude. See [23] for a discussion. However, (50) can be approximated using either the well-known Stirling formula or the more accurate Ramanujan approximation [24]. The online multinomial coefficient calculator [25] can evaluate multinomial coefficients for when all x i as well as n are less than approximately 50 if any x i = 0 is removed from This value corresponds to 13.708 bytes at 8 bits per byte or to 0.013 kilobytes (KB) at 1024 bytes per KB. It follows from previous discussion in this example that the Shannon information lost by using the sample mean T 1 as a surrogate for x itself is I lost (x T 1 ) = I lost (x |T 1 ) ≈ 0.013 KB, which seems surprisingly small. Perhaps the small loss results partially from the fact that T 1 (x) = x = θ = 3 exactly. Case 2: Let T 2 (X) = K(X|θ) for fixed but arbitrary θ > 0. For a data sample (x 1 , . . . , x n ), write from which and R(x) = 1 n i=1 x i ! in (4). Note that for all fixed θ > 0 except θ = 1, there is an obvious one-to-one function between T 1 (x) = n i=1 x i and (53). Hence, from Case 1, I lost (x |K (x|θ)) = I lost (x |T 1 ) ≈ 0.013 KB from Theorem 3 for all θ > 0 except θ = 1. For θ = 1, K(x|θ) = e −n from (53) and is constant with respect to any data sample x. Thus, I reduced (x|1, K ) = 0 and I lost (x |K (x|1 )) = I total (x|1, K ). It follows that K(x|1 ) provides no information about X.
This complication suggests that an efficient implicit enumeration of the y satisfying (54) would be required to obtain A L(x|θ) for calculating I lost (x |L (x |θ )) from (20). Using such an algorithm, a conventional computer could possibly compute I lost (x |L (x |θ )) for the numerical data and value of θ in Case 1, since there is now a 250 petabyte, 200 petaflop conventional computer [26]. Substantially larger problems, if not already tractable, will likely be so in the foreseeable future on quantum computers. Recently, the milestone of quantum supremacy was achieved where the various possible combinations of a certain randomly generated output were obtained in 110 s, whereas this task would have taken the above conventional supercomputer 10,000 years [27]. Regardless, for the data of Case 1, we have the upper bound I lost (x |L (x |θ )) ≤ 0.013 KB from (32).
Finally, we present some simple computational results to illustrate the relationships among T 1 , T 2 , T 3 with regard to the Poisson distribution. Table 1 below summarizes the results for sample data x i ≤ 2. In particular, a complete enumeration of A L(x|θ) in (20) gives I lost (x |L (x |θ )).

Example 2 (Binomial Distribution).
Consider a random sample X = (X 1 , . . . , X n ) from a binomial random variable X withparameters m and θ, where θ is the probability of success on any of the m Bernoulli trials associated with the X i , i = 1, . . . , n. Let m be fixed, so the only parameter is θ. Moreover, the sample space of the underlying random variable X is now finite.
X i is an SS for θ. From [21], n i=1 X i has a binomial distribution with parameter θ for fixed nm. Hence, From (1), dividing (56) by (55) gives By taking the − log of (57), the lost information is given as Case 2: T 2 (X) = K(X|θ). In this case, we use (16) as in Example 1. Write (4). To factor the right side of (60) as in (2), let g be the identity function and h(x) = n i=1 m x i . Hence, and (60) gives where From (62), for any fixed θ satisfying 0 < θ < 1 and θ 1/2, it can easily be shown that y ∈ A K(x|θ) if and only if x i . Thus, in general, for a given x and fixed θ, determining A K(x|θ) in Case 2 would require an enumeration of the y satisfying (62) to compute (61). We perform such an enumeration below for a simple example. Case 3: T 3 (X) = L(X|θ). For a data sample x = (x 1 , . . . , x n ), we now have with g being the identity function and h(x) = 1 in (2). For fixed θ satisfying 0 < θ < 1 and θ 1/2, from (63) we obtain that y ∈ A L(x|θ) if and only if As in Case 3 of Example 1, developing an algorithm to use (64) and determine A L(x|θ) for calculating I lost (x |L (x |θ )) from (20) is beyond the scope of this paper.
As a simple example, consider the experiment of flipping a possibly biased coin twice (m = 2). The total number of heads follows a binomial distribution with the parameter θ, which is the probability of getting a head on any flip. By doing this experiment three times, we generate the random variables X 1 , X 2 , X 3 with possible values 0, 1, 2. Table 2 shows all the possibilities and the lost information for the statistics. The small size of this example allows the computation of I lost in Cases 2 and 3 via total enumeration. (65) X i has a negative binomial distribution with parameter θ [21]. Hence, Thus Moreover, substitution of (65) and (66) into (8) gives Then from (14) and (67) we obtain that Case 2: T 2 (X) = K(X θ) . From (66), for all x ∈ S n ,R(x) = 1 and Thus, for 0 < θ < 1, there is an obvious one-to-one function between T 1 (x) = n i=1 x i and T 2 (x) = K(x |θ ) in (69). Thus, from Theorem 3, I lost (x |T 2 (x)) = I lost (x |T 1 ) as given in (68). Case 3: T 3 (X) = L(X θ) . Since K(X |θ ) = L(X θ) from (69), then from (68). However, there is an alternate derivation of (70). For 0 < θ < 1 it follows from (69) that then y ∈ A L(x|θ) if and only if But for fixed positive integers x 1 , . . . , x n we have from [28] that the number of solutions A L(x|θ) to (71) in positive integers y 1 , . . . , y n is Thus, (70) follows for L(X |θ ) from (72) and (20), so I lost (x |T 1 ) = I lost (x |T 2 ) = I lost (x |T 3 ) from Theorem 3.
As a numerical illustration, let the random variable X denote the number of flips of a possibly biased coin until a head is obtained. Then, X has a geometric distribution, where the parameter θ is now the probability of getting a head on any flip. Suppose this experiment is performed three times yielding the sample data x = (x 1 , x 2 , x 3 ) shown in Table 4. I lost (x |T ) is then calculated for each of the sufficient statistics for θ of Example 3. Observe that the individual statistics depend on θ while the lost information does not. Moreover, I lost (x |T 1 ) = I lost (x |T 2 ) = I lost (x |T 3 ) for all data samples, as established above. Table 4. Geometric Example.

Conclusions
In this paper, the Shannon information obtained for a random sample X taken from a discrete random variable X with a single parameter θ was decomposed into two components: (i) the reduced information associated with the value of a real-valued statistic T(X) evaluated at the data sample x, and (ii) the information lost by using this value as a surrogate for x. Information is lost because multiple data sets can give the same value of the statistic. In data analysis, the data uniquely determines the value of a statistic, but typically the value of the statistic does not uniquely determine the data yielding it. The lost information thus measures the knowledge unavailable about the data sample x when only the reduced data summary T(x) is known, but not x itself. To eliminate the effect of θ, we focused on sufficient statistics for θ such as the sample mean. We then answered the question: how much Shannon information is lost to someone about a data sample x when only the value of T(x) is available but not the data x itself? Our answer is independent of the parameter θ and does not require that θ be known. Our method generalizes the approach of [2] for analyzing the information contained in a sequence of Bernoulli trials.
More generally, we developed a metric associated with the value T(x) used to summarize, represent, or characterize a given data set. Our approach and results are significant because such statistics are often communicated without the original data. One could argue that I lost (x |T ) should be communicated along with T(x) in a manner similar to providing the margin of error associated with the results of a poll. A small I lost (x |T ) would signify that T(x) is more informative than if I lost (x |T ) were large.
In addition, we defined the entropic loss associated with a sufficient statistic T under consideration as the expected lost information over all possible samples, to give a value dependent only on T. We noted but did not explore the possibility that entropic loss could be used as a metric to compare different sufficient statistics. Moreover, if sufficient statistics were not required, entropic loss could provide metrics on either θ or T if the other of these variables is fixed. Finally, numerical examples of our results were presented and some computational issues noted.

Conflicts of Interest:
The authors declare no conflict of interest.