Article Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get?

Studies of learning algorithms typically concentrate on situations where potentially ever growing training sample is available. Yet, there can be situations (e.g., detection of differentially expressed genes on unreplicated data or estimation of time delay in non-stationary gravitationally lensed photon streams) where only extremely small samples can be used in order to perform an inference. On unreplicated data, the inference has to be performed on the smallest sample possible—sample of size 1. We study whether anything useful can be learnt in such extreme situations by concentrating on a Bayesian approach that can account for possible prior information on expected counts. We perform a detailed information theoretic study of such Bayesian estimation and quantify the effect of Bayesian averaging on its first two moments. Finally, to analyze potential benefits of the Bayesian approach, we also consider Maximum Likelihood (ML) estimation as a baseline approach. We show both theoretically and empirically that the Bayesian model averaging can be potentially beneficial.


Introduction
Studies in (computational) learning theory mostly tend to concentrate on situations where potentially ever increasing number of training examples is available. While such results can lead to deep insights into the workings of learning algorithms, e.g., linking together characteristics of the data generating distributions, learning machines and sample sizes, there can be situations where, by very nature of the problem, only extremely small samples are available. In such situations it is of utmost importance to theoretically analyze exactly what and under what circumstances can be learnt. One example of such a scenario in count data is detection of differentially expressed genes, where even subtle changes in gene expression levels can be indicators of biologically crucial processes [1]. When replicas are costly to obtain one can attempt to use the limited data at one's disposal to make the relevant inferences, as for example in the Audic and Claverie approach [2][3][4][5][6]. Another situation where available count data can be extremely sparse is estimation of time delay in non-stationary gravitationally lensed photon streams. When the scale of variability of the source is of order, say, of more than tens of days and observation gaps are not too long, one can resolve the time delay between lensed images of the same source by working directly with daily measurements of fluxes in the radio, optical or X-ray range [7][8][9][10]. However, when the variability scale is of the order of hours, one must turn to photon streams in the lensed images. One possibility of time delay detection in such cases is through comparing counts in relatively short and time-shifted moving time windows in the lensed photon streams.
In this paper we theoretically study what happens in the extreme situation of unreplicated data when the inference has to be performed on the smallest sample possible-sample of size 1. We consider a model-based Bayesian approach that averages over possible Poisson models with weighting determined by the posterior over the models, given the single observation. In fact, such a Bayesian approach has been considered in the bioinformatics literature under the assumption of flat improper prior over the Poisson rate parameter [2][3][4][5][6]. One can, of course, be excused for being highly sceptical about the relevance of such inferences, yet the methodology has apparently been used in a number of successful studies. In an attempt to build theoretical foundations behind such inference schemes, we proved a rather surprising result [11]: The expected Kullback-Leibler divergence from the true unknown Poisson distribution to its model learnt from a single realization never exceeds 1/2 bit.
Even though the field of bioinformatics is moving fast and better procedures for detection of differentially expressed genes have been introduced (e.g., not relying on the Poisson assumption, specifically taking into account potential dependencies among the genes, etc.), the primary focus of this study is different. Irrespective of the application domain, we theoretically investigate how reliably can a model for count data be build from a single count observation, under the assumption of a Poisson source. There are two issues that need careful consideration: 1. Equal a-priori weighting (flat prior) over possible (unknown) Poisson sources is unrealistic.
Typical values of observed counts are usually bounded by the nature of the problem (e.g., gene magnification setting used in the experiments or time window on the photon streams). One may have a good initial (a-priori) guess as to what ranges of typical observed counts might be reasonably expected. In particular, we are interested in the low count regimes. In such cases, it is desirable to incorporate such prior knowledge into the inference mechanism. In this study, we do this in the Bayesian framework through prior distribution over the expected counts. 2. To understand potential benefits of the proposed learning/inference method (in our case Bayesian approach), it is important to compare it with a simple straightforward baseline (here maximum likelihood estimation). We contrast the expected Kullback-Leibler divergences from the true unknown Poisson distribution to its Bayesian and maximum likelihood estimates, inferred from a single realization.
The paper has the following organization. In Section 2 we introduce the maximum likelihood and Bayesian (with flat prior over mean rates) approaches to inferring predictive distribution over counts based on a single count observation. We also briefly review past work on information theoretic properties of the two approaches. Section 3 contains derivation of a more general Bayesian approach with gamma prior on the mean count parameter. In Section 4 we calculate the first two central moments of our generalized model. This enables us to better understand the influence of the prior on the inferred model and highlight the differences with the previous approach using the flat (improper) prior. In Section 5 we perform an information theoretic study of learning capabilities of the generalized model. Empirical investigations are presented in Section 6 and the main findings are discussed and summarized in Section 7.

Single Count Data-Bayesian and Maximum Likelihood Approaches
In this section we will briefly review the original Audic-Claverie [2] and maximum likelihood approaches outside the bioinformatics context.

Bayesian Averaging in the Audic-Claverie Approach
Let x be an observed count in an experiment. When repeating the experiment, possibly under different conditions, we observe a (possibly different) count y. The quantity of interest is the probability of observing y given that we already observed x, not knowing the identity of the generating Poisson source where λ ≥ 0 is the (unknown) parameter representing the mean count value. Under the null hypothesis (not differentially expressed genes), both counts x and y come from the same underlying Poisson distribution P (·|λ). The key instrument in the Audic-Claverie approach is a distribution P AC (y|x) over counts y informed by the observed count x, under the null hypothesis. P AC (y|x) is obtained by Bayesian averaging (infinite mixture) of all possible Poisson distributions P (y|λ ′ ) with mixing proportions equal to the posteriors p(λ ′ |x) under the flat prior over λ. Formally, the probability of count y, given the observed count x from the same (unknown) Poisson distribution is: Imposing the flat (improper) prior p(λ) over the Poisson parameter λ results in where Γ(a) = ∫ ∞ 0 u a−1 e −u du is the Gamma function, we have which, since x and y are integers (i.e., Γ(x) = (x − 1)!), can be rewritten as P AC (·|x) can then be used, e.g., for principled inferences, construction of confidence intervals or statistical testing.

Information Theory of P AC (y|x)
Consider a "true" underlying Poisson distribution P (y|λ) (1) over possible counts y ≥ 0. We first use P (·|λ) to generate a count x and then employ P AC (y|x) (5) as a model distribution over y, given the already observed count x. We ask: If we repeated the process above, how different, in terms of Kullback-Leibler divergence, are on average the two distributions over y? One would naturally hope that P AC (y|x) is sufficiently representative of the true unknown distribution P (y|λ).
In [11] we proved that, given an underlying Poisson distribution P (x|λ), if we repeatedly generated a "representative" count x from P (x|λ), the average divergence E(λ) of P AC (y|x) from the truth P (y|λ) would never exceed 1/2 bit.

P AC (y|x) vs. Maximum Likelihood
In this section we will briefly recall information theoretic analysis of the maximum likelihood estimate P M L (y|x) in place of P AC (y|x) [12]. First note that Poisson distribution P (y|λ) is only defined for positive λ. In the case of observing zero count x = 0, we cannot directly use the "maximum likelihood estimate" P (y|0). One option for dealing with zero observed counts is to allow for some form of model regularization, e.g., infer a Poisson model P (y|ϵ), for some small ϵ > 0. In other words, if a count x ≥ 1 is observed, follow the standard maximum likelihood procedure and infer P M L (y|x) = P (y|x) as the Poisson model; if a zero count is observed, x = 0, infer P M L (y|0) = P (y|ϵ) for some fixed ϵ ∈ (0, 1]. This is the route taken in [12] and adopted in this paper. Only a minimum amount of necessary regularization due to zero observed counts is employed in the otherwise straightforward ML approach. Theorem 2 [12] Consider an underlying Poisson distribution P (·|λ) parameterized by some λ > 0 and a regularization constant ϵ ∈ (0, 1]. The expected divergence in bits Υ(λ, ϵ) between the true Poisson source and its (regularized) maximum likelihood estimate based on a single observation, is equal to Note that the expected divergence Υ(λ, ϵ) can get prohibitively large when regularizing with small ϵ > 0. As an illustration, in Figure 1 we show expected divergence Υ(λ, ϵ = 1) of the ML estimation (zero count regularized with ϵ = 1) for a range of mean parameter values λ of the underlying Poisson source (solid line). Also shown is the expected divergence E(λ) of P AC (y|x) (dashed line). Except for very small Poisson source rates λ, P AC (y|x) is clearly benefitting from the stabilizing effect of Bayesian averaging, given the extremely small sample size.

Generalized P AC (y|x) with Gamma Prior
In this section we will generalize P AC (y|x) through the use of (conjugate) gamma prior on the Poisson mean parameter λ. The positive parameters α, β determine the overall shape of the prior. Given a single observation x, the posterior is the gamma distribution with parameters α + x and β + 1, The mean of P (λ|x, α, β) is equal to (α + x)/(β + 1). A loose intuitive interpretation of the prior parameters α, β (assuming they are integers) is that prior to seeing the current data (in our case only one observation (count) x), we have seen β "observations", . As in the case of P AC (y|x), having observed a count x, we build a predictive distribution over future counts y by integrating out the mean parameter λ with respect to the posterior P (λ|x, α, β), From normalization of the gamma distribution we get It can be easily verified that the original P AC (y|x) is obtained as a special case of P G (y|x, α, β) when α = 1 and β → 0. If Jeffrey's prior were used instead of the flat prior in P AC (y|x), we would obtain P G (y|x, α, β) with α = 1/2 and β → 0 etc.
If α is an integer, we have where x ′ = x + α − 1 is the observed count including prior observations. This expression generalizes P AC (y|x) (5), While P G (y|x, α, β) (9) can be used with any appropriate setting of α, β (e.g., given a prior knowledge of the range of counts one may reasonably expect), in this contribution we concentrate on using the gamma prior to mitigate for the unrealistic equal weighting of all λ > 0 in the flat prior behind P AC (y|x). Indeed, the observed counts are typically bounded by the nature of the problem and one can represent this through setting α = 1 and varying β > 0 in the gamma prior P (λ|α, β) underlying P G (y|x, α, β). Some examples of such priors are shown in Figure 2. Decreasing β leads to weaker emphasis on low λ, eventually recovering the flat (improper) prior for β = 0. In Section 2.3 maximum likelihood estimation was regularized at zero count by imposing a non-zero "count" ϵ instead of the observed zero one. The generalized form of P AC (y|x), P R (y|x, β) = P G (y|x, α = 1, β) can be also viewed as an alternative "soft" form of regularization of the maximum likelihood approach at zero counts.
Parameter β in the Gamma prior can be set in a data driven manner, e.g., using the following strategy: Given the observed count x, we require that the area up to x + 1 covered by the prior is equal to θ, for some threshold θ ∈ (0, 1) (e.g., θ = 1/4). In other words, F (x + 1|β) = θ, where F (λ|β) = 1 − e −βλ is the cumulative distribution function of P (λ|α = 1, β). This leads to For zero observed count x = 0, β(0) = − ln(1 − θ) and the prior gets more concentrated on smaller values of λ as likely candidates for the mean count of the underlying Poisson source. With increasing count values x > 0 the parameter β(x) decreases to 0 and the prior gradually approaches the flat prior of P AC (y|x).
Finally, we contrast P G (y|x, α, β) with the negative binomial distribution with parameters r > 0 and q ∈ [0, 1]. One interpretation of the negative binomial distribution P N B (y|r, q) is that it corresponds to a Gamma-Poisson mixture that one obtains by imposing a Gamma prior P (λ|r, (1 − q)/q) on the mean count parameter λ of the Poisson distribution P (y|λ) and integrating out λ. In our context it is natural to identify r and (1 − q)/q with hyperparameters α and β used in P G (y|x, α, β). It follows that q = (β + 1) −1 . Hence, we rewrite (12) as Direct comparison of (13) with (9) leads to an intuitive insight: The β prior measurements of total count α introduced by the gamma prior P (λ|α, β) are in the case of P G (y|x, α, β) extended with a single observation x, resulting in β + 1 observations of total count α + x. This can be represented by It follows that P G (y|x, α, β) P N B (y|α + x, (β + 2) −1 ) = (β + 1) x+α−y .
Bayesian averaging in P G (y|x, α, β) with respect to the posterior over λ, given a count x, differs from the corresponding negative binomial distribution P N B (y|α + x, (β + 2) −1 ) by the factor (β + 1) x+α−y that depends on the difference between the prior+observed count α + x and y.

First and Second Moments of the Generalized P AC (y|x)
In [11] we showed that P AC (y|x) and the underlying Poisson distribution are quite similar in their nature: for any (integer) mean rate λ ≥ 1, the Poisson distribution P (·|λ) has two neighboring modes located at λ and λ − 1, with P (λ|λ) = P (λ − 1|λ). Analogously, given a count x ≥ 1, P AC (·|x) has two neighboring modes, one located at x, the other at x − 1, with P AC (x|x) = P AC (x − 1|x). As in Poisson distribution, the values of P AC (y|x) decrease as one moves away from the modes in both directions. In this section we derive the first two moments of the generalized P AC (y|x), P G (y|x, α, β). As a special case, we will show that as a result of Bayesian averaging, the variance of P AC (y|x) is double that of the underlying (unobserved) Poisson distribution.
Theorem 3 Consider a non-negative integer x and the associated generalized model P G (y|x, α, β). Then, Proof: Let us evaluate In the third equality we have used substitution y ′ = y − 1 and the last equality follows from Γ(z + 1) = z · Γ(z). By (15), Solving (17) we obtain For the variance of P G (y|x, α, β) we have Now, Using (16), (18) and (20), we obtain which can be solved as Given an observation x, the maximum likelihood estimate of the underlying Poisson distribution is the Poisson distribution with mean x, After observing x, the mean of the maximum likelihood and P AC (·|x) estimates is x and x + 1, respectively. Hence, Bayesian averaging in P AC (·|x) induced by the flat improper prior over the mean rate λ results in increased expected value x + 1 of the next count from the same underlying source, given that the current count x. However, a much more marked consequence of using the flat prior can be seen in the variance of P AC (·|x): while variance of the maximum likelihood is x, it is 2(x + 1) in P AC (·|x). Theorem 3 illustrates the role of more concentrated prior over λ on the generalized model. The mean expected count, after seeing x, is equal to the mean of the posterior P (λ|x, α, β) over λ, namely (α + x)/(β + 1). As explained earlier, observed single count x with prior β counts of cumulative value α results in β + 1 counts of cumulative value α + x. Hence the mean count per observation is (α + x)/(β + 1). As with Poisson distribution, the variance of the generalized model is closely related to its mean and approaches the mean with increasing number of prior counts β.
As for the soft regularization P R (y|x, β) = P G (y|x, α = 1, β), its mean is, as expected, biased towards values smaller than the observed count x, provided β > 1/x. Increased values of β result in smaller variance of P R (y|x, β). But how do such prior parameter modifications manifest themselves in terms of accuracy of estimation of the underlying source? This question is investigated in the next section.

Expected Divergence of the Generalized P AC (y|x) from the True Underlying Poisson Distribution
Consider an underlying Poisson source P (x|λ) generating counts x. In this section we would like to quantify the average divergence of the corresponding generalized P AC (y|x), P R (y|x, β) = P G (y|x, α = 1, β) ("softly" regularized ML), from the truth P (y|λ), if we repeatedly generated a "representative" count x from P (x|λ). The same question was considered in the context of maximum likelihood estimation in Section 2.3. In particular, we are interested in specifying under what circumstances is the generalized form of P AC (y|x), P R (y|x, β) = P G (y|x, α = 1, β), preferable to the original P AC (y|x) = P G (y|x, α = 1, β → 0) and how it fares with the maximum likelihood estimation P M L (y|x) of Section 2.3.
Note that for β → 0 we recover our original result [11] that the expected divergence E(λ) of the original P AC (y|x) from the "truth" P (y|λ) is (up to terms of order λ −1 ) never greater than 1/2 bit. The soft regularization in P R (y|x, β) (using prior P (λ|α = 1, β) with β > 0) can result in larger expected divergence from the underlying source than is the case for P AC (y|x) (using improper flat prior over λ). Moreover, (unlike in P AC (y|x)) such a regularization causes linear divergence of E G (λ; β) for large λ. The next theorem specifies for which underlying Poisson sources the soft regularization approach of P R (y|x, β) is preferable to the original P AC (y|x).

Theorem 5 For Poisson sources with mean rates
it holds E(λ) > E G (λ; β) and hence P R (y|x, β) is on average guaranteed to approximate (in the Kullback-Leibler divergence sense) the underlying source better than the original P AC (y|x).
Proof: It was shown in [11] that for the original P AC (y|x), From (27) and (29) we have that the difference between the expected divergences of the original and generalized forms of P AC (y|x) is The result follows from solving for E(λ) > E G (λ; β). The graph (in log-log scale) of κ(β) is shown in Figure 3. An alternative way of data-driven setting of parameter β is suggested by the fact that κ(β) is lower bounded by β −1 . If the experimental setting is such that most counts are expected not to exceed some x max , β can be set to β = 1/x max , so that P R (y|x, β) is preferable to P AC (y|x).
In Figure 4 we present the expected divergences E G (λ; β) (solid line) and E(λ) (dashed line) for β = 0.2 (left) and β = 0.01 (right). As expected, for underlying sources with small mean counts λ the advantage of using the regularized form P R (y|x, β) (as opposed to the original P AC (y|x)) is more pronounced. However, for larger λ there is a heavy price to be paid in terms of inaccurate modelling by P R (y|x, β).

Figure 3. Graph of κ(β). For Poisson sources with mean rates λ < κ(β), E(λ) > E G (λ; β)
and hence P R (y|x, β) is on average guaranteed to approximate the underlying source better than the original P AC (y|x).

Empirical Investigations
To investigate potential value of the more sophisticated Bayesian approach in the original and the generalized Audic-Claverie frameworks (Sections 2.1 and 3, respectively) against the baseline of simple (regularized) maximum likelihood estimation (Section 2.3), we conducted a series of simple illustrative experiments. In the generalized Audic-Claverie framework developed in this study, we used the two schemes for setting the regularization parameter β suggested in Sections 3 and 5. In the regularized maximum likelihood approach P M L (y|x) we set ϵ = 1. From Figure 1, it appears that the biggest difference between the expected divergences from the true underlying Poisson source P (x|λ) to the original P AC (·|x) and the maximum likelihood estimate occurs for small mean rates λ roughly around λ = 5. We therefore run the experiments with λ = 5.
For illustration purposes, we follow the data generation mechanism used in [13] to compare methods for distinguishing between differential expression of genes associated with two treatment regimes. We stress that in no way we suggest that our experiments have strong relevance for bioinformatics, nor do we claim that the framework of [13] is the best test bed for assessing differential gene expression detection algorithms. We use the framework of [13] merely to illustrate whether the sophistication of the Bayesian approach (as opposed to simple (regularized) maximum likelihood) can bring benefits in a practical situation with low-count data.
Gene counts are simulated across the two treatment groups T 1 and T 2 . The tests are assessed by comparing false positive and true positive rates. In each experiment 10,000 gene pair counts (x 1,j , x 2,j ), j = 1, 2, ..., 10, 000, were produced, counts x 1,j and x 2,j associated with regimes T 1 and T 2 , respectively. As specified above, the sampling rate for T 1 was fixed at λ 1 = 5 throughout the experiment. We varied the mean log 2 fold change (LFC) between T 1 and T 2 from −2 to 2. Each gene pair count (x 1,j , x 2,j ), j = 1, 2, ..., 10, 000, was obtained through a generative process specified in [13] and described in detail in Appendix A.
Having generated the gene pair counts, we used methods considered in this study to make a decision for each j = 1, 2, ..., 10, 000, whether the counts x 1,j , x 2,j originated from the same underlying source, i.e., whether when generating y 1,j and y 2,j , the mean rates in the two regimes T 1 and T 2 were identical (LF C j = 0). Given the "test distribution" Q(y|x) and a confidence level ϑ ∈ [0, 1], we guess that x 1,j , x 2,j originated from the same source if the (1 − ϑ)-quantile around the mean of Q(y|x 1,j ) contains x 2,j and vice-versa, i.e., if the (1 − ϑ)-quantile around the mean of Q(y|x 2,j ) contains x 1,j . In place of Q(y|x) we used P AC (y|x), its regularized form P R (y|x, β) and the regularized maximum likelihood estimate P M L (y|x) with ϵ = 1.
For a given confidence level ϑ ∈ [0, 1] and test statistic Q(y|x) we calculate the false positive rate (type I error rate) as the proportion of times a gene count pair (x 1,j , x 2,j ) was declared to have originated from two different underlying sources (differentially expressed gene) when in fact LF C j was zero. The true positive rate (statistical power) was determined as the proportion of times a gene was correctly declared differentially expressed -(x 1,j , x 2,j ) declared to have originated come from two different underlying sources and LF C j ̸ = 0.
Plot of false positive rate vs. true positive rate obtained for different values of ϑ constitutes a receiver operating characteristic (ROC) curve. If the ROC curve for one test distribution is always above another, this suggests its superiority in classifying genes as differentially expressed. Trivial classification of genes as differentially expressed using a completely random guess would yield the identity (diagonal) ROC curve. ROC curves for the maximum likelihood method (ϵ = 1, red dashed line) and the soft regularization model P R (y|x, β), β = 1/50, 1/100 (solid lines) are plotted in Figure 5. Not surprisingly, the Bayesian approach (solid lines) outperforms the penalized maximum likelihood one (red dashed line). However, the original P AC (y|x) (β = 0, black line) and the soft regularization model (color solid lines) achieve almost identical performances. In this challenging setting (single observations at low mean rate with additional noise), the scheme for setting the regularization parameter β suggested in Section 5 has little effect on the resulting classification performance. We also ran experiments to test the "dynamic" scheme for setting β introduced in Section 3, but no significant performance improvements were achieved. Finally, we devised yet another scheme for determining the hyper-parameters α and β of the prior P (λ|α, β) from the data. In the spirit of type II maximum likelihood, we find the most likely values of α, β, given the observed counts Using this method, we first optimize the prior hyperparameters on the observed data. The "optimized" prior P (x i |α * , β * ) now reflects the possible ranges of mean counts λ one can expect given the data. We then repeated the experiments using the generalized model P G (y|x, α * , β * ) derived from the optimized prior. In this way we can assess to what degree the relatively minor performance differences between the generalized and maximum likelihood models in Figure 5 are due to constraining α to α = 1 (in P R (y|x, β)), or due to inherent difficulty of learning from single counts. The resulting ROC analysis is shown in Figure 6. The data driven setting of hyperparameters α, β leads to slight improvement over P AC (y|x) and P R (y|x, β).

Discussion and Conclusion
Studies of learning algorithms traditionally concentrate on situations where potentially ever increasing number of training examples is available. However, there are situations where only extremely small samples can be used in order to perform an inference. In this contribution we concentrated on extreme case of low count data governed by Poisson distribution, where only a single observation is available. We performed a rigorous theoretical investigation of the appropriateness of various model estimators, based on the single observation. We considered a Bayesian approach along the lines of [2], where the model built on the basis of a single observed count is no longer Poisson, even though we know that the generating source is Poisson (but do not know the mean rate).
We showed that the Bayesian approach is more optimal than the regularized maximum likelihood, in the sense that the expected Kullback-Leibler divergence from the source to the model is smaller for the Bayesian approach. Furthermore, we generalized the original model of [2] to account for possible prior information on expected expression counts. Detailed information theoretic study of learning capabilities of such a generalized model was conducted for the case of low count data. We also quantified the effect of Bayesian averaging on its first two moments.