Bayes in Wonderland! Predictive supervised classification inference hits unpredictability

The marginal Bayesian predictive classifiers (mBpc) as opposed to the simultaneous Bayesian predictive classifiers (sBpc), handle each data separately and hence tacitly assumes the independence of the observations. However, due to saturation in learning of generative model parameters, the adverse effect of this false assumption on the accuracy of mBpc tends to wear out in face of increasing amount of training data; guaranteeing the convergence of these two classifiers under de Finetti type of exchangeability. This result however, is far from trivial for the sequences generated under Partition exchangeability (PE), where even umpteen amount of training data is not ruling out the possibility of an unobserved outcome (Wonderland!). We provide a computational scheme that allows the generation of the sequences under PE. Based on that, with controlled increase of the training data, we show the convergence of the sBpc and mBpc. This underlies the use of simpler yet computationally more efficient marginal classifiers instead of simultaneous. We also provide a parameter estimation of the generative model giving rise to the partition exchangeable sequence as well as a testing paradigm for the equality of this parameter across different samples. The package for Bayesian predictive supervised classifications, parameter estimation and hypothesis testing of the Ewens Sampling formula generative model is deposited on CRAN as PEkit package and free available from https://github.com/AmiryousefiLab/PEkit.


Introduction
Under the broad realm of inductive inference, the goal of the supervised classification is to assign the test objects into a priori defined number of classes learned from the training data [14].One of the most applicable machinery that can optimally handle these scenarios is Bayesian which with a given prior information and accruing observed data, gradually enhances the precision of the inferred population's parameters [6].We consider here the general supervised classification case where the sets of species observed for features are not closed a priori, leaving the probability of observing new species at any stage non-negative.The de Finetti type of exchangeability [1], seems intractable in these cases.Nevertheless, one solution is to adhere to a form of partition exchangeability due to Kingman [10].Assuming this type of exchangeability for each competing class, the derivation here shows that given an infinite amount of data, the simultaneous and marginal predictive classifiers will converge asymptotically.This is congruent with the similar study under the de Finetti exchangeability by Corander [2].Due to the existence of marginal dependency between the data points, the simultaneous and marginal classifiers are not necessarily equal.On the other hand, their convergence is not intuitive due to the complication posed with a priori unfixed set of observable species.Upon availability of umpteen amount of data however, the proof presented here justifies the replacement of the marginal classifiers with the computationally expensive simultaneous ones.
The following section introduces partition exchangeability and a probability distribution related to it.Hypothesis tests for its parameter are also introduced.Finally, the predictive classifiers under partition exchangeability are derived and the algorithms used to implement the classifiers are presented along with classification results on simulated data sets from partition exchangeable distributions.

Partition exchangeability
Assume that number of species related to our feature is unfixed a priori.Upon availability of the vector of test labels S, under the partition exchangeability framework, we can deduce the sufficient statistic for each subset of data.To define this statistic, consider the assignment of arbitrary permutation of integers 1, . .., -s cto the items in s c where n c = |s c | is the size of a given class c.Introducing the I(. ..) indicator function and n cl = Σ i∈sc I(x i = l) as the frequency of items in class c having value l ∈ χ , then in terms of count in x (c) one can write the sufficient statistic as The vector of sufficient statistic ρ c = (ρ ct ) nc t=1 indicates a partition of the integer n c such that ρ ct is the frequency of specific feature values that have been observed only t times in class c of test data.Given the above formulation [9], the random partition is exchangeable if and only if two different sequences having the same vector of sufficient statistics have the same probabilities of occurring.According to Kingman's representation theorem [10], the probability distribution of the vector of sufficient statistics under partition exchangeability will follow the Poisson-Dirichlet(ψ) (PD) distribution known also as the Ewens sampling formula [4], where, For a comprehensive review of the Ewens sampling formula and its history and further applications, see [3].

Parameter estimation
The parameter ψ in ( 2) is called a dispersal parameter, as the higher its value is the more distinct species will likely be observed in a sample from this distribution.A Maximum Likelihood Estimate (MLE) for the parameter ψ can be derived from a sample as demonstrated in [4].The estimate turns out to be the root of the equation The proof of this is provided in Appendix.The sum of sufficient statistics on the right side of the equation equals the observed number of distinct species in the sample.The left side of the equation equals, as shown in by Ewens in [4], the expected number of distinct species observed given parameter ψ.Thus, the MLE of ψ is that number for which the observed number of distinct species equals the expected number of species observed in a sample of size n.There is no closed form solution for the equation for arbitrary n, so the MLE ψ has to be numerically searched for.As the right side of the equation is a strictly increasing function when ψ > 0, a binary search algorithm can find the root.In supervised classification, an estimate for ψ can be calculated for each class in the training data.

Hypothesis testing
A Lagrange multiplier test as defined in [13] can be used for statistical testing of a hypothesized parameter ψ 0 under the null hypothesis H 0 : ψ = ψ 0 for a single sample.The test statistic is constructed as follows: where U is the gradient of the log-likelihood L(ψ), and I is the Fisher information of the distribution.Under the null hypothesis the test statistic S follows the χ 2 1distribution.In the case of the PD distribution, these quantities become The proof of this is provided in Appendix.
For a multiple-sample test to infer whether there is a statistically significant difference in the ψ of each sample, we have a devised a Likelihood Ratio Test (LRT) [11].The null hypothesis of the test is that there is no difference in the ψ of each of s samples, H 0 : ψ 1 = ψ 2 = . . .= ψ s , and consequently the alternative hypothesis is the inequality of at least two ψ.The test statistic Λ is constructed as follows: where L(θ 0 ) is the likelihood function of the data given the model under the null-hypothesis, and L( θ) is the unrestricted likelihood of the model.The sup refers to supremum, so the likelihood is evaluated at the MLE of the parameters.Λ asymptotically converges in distribution to the χ 2 d , where d equals the difference in the amount of parameters between the models.When testing the ψ of s different samples from the PD-distribution with possibly different sample sizes n s , the model under the null hypothesis has one shared ψ, while the unrestricted model has s different dispersal parameters, (ψ 1 , ψ 2 , . . ., ψ t ), so d = t − 1.The likelihood L of multiple independent samples from the PD-distribution is a product of the density functions for the partitions ρ of those samples, and under H 1 the MLE of ψ for each of the samples is evaluated as in (3) from each sample independently, as the other samples have no effect on the ψ of a single sample.However, under the nullhypothesis, the samples share a common ψ, the MLE of which according to 2.1 would be estimated by solving, As the likelihood of multiple independent samples with identical ψ under the null-hypothesis is just a product of their likelihoods, the derivative of the loglikelihood is just a sum of the derivatives of each sample's likelihood.Again, the ψ has to be obtained as the root of the above equation, which can be found with a small modification to the same binary algorithm as the one we use to determine the MLE of a single PDdistribution.Having found these MLEs of the ψ under both hypotheses, the likelihood ratio in the test statistic in equation ( 7) can then be expressed as, where the likelihood function L(ρ j | ψj ) is the likelihood function of the PD(ρ j |ψ) distribution for the partition ρ of the j-th sample.As the restricted model in the numerator can never have a larger likelihood than the unrestricted likelihood in the denominator, and both are positive real numbers, this ratio is bounded between 0 and 1.

A note on two-parameter PD
A two-parameter formulation of the distribution presented in [12].The added parameter in PD(α, ψ) is called the discount parameter.The role of the parameters is discussed in length in [3].In short, the α is defined to fall in the interval [−1, 1].When it is positive, it increases the probability of observing new species in the future proportional to the amount of already discovered species, while decreasing the probability of seeing the newly observed species again.When α is negative, the opposite is true and the number of new species to be discovered is bounded.The single parameter P D(ψ) is the special case of the two parameter distribution with α set to 0, P D(0, ψ).The two parameter distribution is not considered further here, as the estimation of the parameters becomes a daunting task compared to the simple one parameter formalization.

Supervised classifiers
Consider the set of m available training items by M and correspondingly the set of n test items by N .For each item, we observe only one feature 1 that can take value from species set χ = {1, 2, ..., r}.Note that each number in χ is represented with one species such that the first species observed is represented with integer 1, the second species is represented with integer 2, and so on.
On the other hand, r is not known priori denoting the fact that we are uninformative about all of the species possible in our population The S denote the space of possible simultaneous classifications for a given N and so S ∈ S.
The predictive probability of observing a new feature value of species j given a set of prior observations from the PD distribution is where n j is the frequency of species j in the observed set n.However, if the value j is of a previously unobserved species, the predictive probability is The proof of this arises both from the mechanics of an urn model that [7] showed to generate the Ewens sampling formula, as well as from [8].The predictive probability of a previously unseen feature value is thus higher for a population with a larger parameter ψ than it is for a population of equal size with a smaller ψ.Now the product predictive distribution for all test data that is assumes to be i.i.d for the marginal classifier under the partition exchangeability framework becomes: Note that under a maximum a posteriori -rule with training data of equal size for each class, a newly observed feature value that has not been previously seen in the training data for any class, will always be classified to the class with the highest estimated dispersal parameter ψ.However, each class still has a positive probability of including previously unseen values based on the observed variety of distinct values previously observed within the class.
Analogously to the product predictive probability for the case of general exchangeability in [2], the product predictive distribution of the simultaneous classifier under partition exchangeability, fitting the learning algorithm that is described in a later chapter, is then defined as An asymptotic relationship between these classifiers is immediately apparent in these predictive probabilities.As the amount of training data in each class m c increases, the impact of class-wise test data n c becomes negligible in comparison, and the difference in the predictive probabilities approaches zero asymptotically.As the classifiers are searching for classification structures S that optimize the test data predictive probability given the training data, and the predictive probabilities converge asymptotically, the classifiers are searching for the same optimal labeling.However, the classifiers handle values unseen in the training data differently.This is the situation where m i;cjl = 0 in the predictive probabilities p M and p S The marginal classifier's predictive probability for the test data is always maximized by assigning such a value into the class with the highest ψ.The simultaneous classifier, however, considers the assignment of other instances of an unseen value as well.This can lead to optimal classification structures where different instances of an unseen value are classified into different classes.Thus, the convergence of the test data predictive probabilities of the marginal and simultaneous classifiers is not certain in the presence of unseen values.In practice, though, as the amount of training data m tends to infinity, the probability of observing new values from the PD(ψ)-distribution presented in equation ( 11) tends to 0: ψ/(ψ + m) → 0 as m → ∞.Unexpected values would be very rare and would have a minimal effect on the classifications made by the different classifiers.

Algorithms for the predictive classifiers
In this chapter the learning algorithms defined in [2] are described.The predictive probabilities used in these algorithms are defined above up to an unknown normalizing constant that, however, can be omitted as it doesn't affect the probability maximization step in the classification.
The marginal classifier is computationally attractive, as each test data point is individually classified according to a maximum a posteriori rule: For the simultaneous classification algorithm, define classification structure S (i) c to be an identical classification structure to S with the item i reclassified to class c.The greedy deterministic algorithm is then defined as pS : Set an initial S 0 with the marginal classifier algorithm pM .2. Until S remains unchanged between iteration, do for each test item i ∈ N : pS : argmax The simultaneous classifier thus works the same as the marginal classifier, but for each test item the potential labeling of other test items is also taken into account.

Numerical illustrations of classifier performance and convergence
To study the classification performance and convergence of the two different classifiers, we simulated both training and test data sets from the PD distribution with a generative urn model described in [15] called the De Morgan process.We varied the amount of training and test data, the values of the distribution parameter ψ, as well as the amount of classes k.A large training data set of 4 millino data points was created for each ψ.A small sample of size 1000 was used to train the model first, and more of it was subsequently added until finally half of the whole data set was used as training data.A test data set created with the same parameters was kept constant for classification with all training data samples.An excerpt of the results can be found in Table 1, as well as Figure 1.The table and the figures as well as the results not presented here show that the simultaneous classifier performs better than the marginal classifier especially as the amount of classes is higher and training data is small.As the amount of training data used increases, the predictions of the classifiers converge, as the extra information in the test data used by the simultaneous classifier becomes negligible.This behaviour is illustrated in Figure 1.Additionally, the first row of the table shows that the classifier fails on the binary classification task with data sets created with ψ of 1 and 2 respectively.This is because the classes created with ψ of such similar magnitude will be distributed very similarly.Thus the classification results are no better than random guessing.The data is in fact so homogeneous in this case, that the labelings of the two different classifiers already converge with test data of size 1000.
In conclusion, the results support the hypothesis that the marginal and simultaneous classifiers converge in their labelings with enough training data.This justifies the use of the marginal classifier in place of the more accurate simultaneous one with large data.Still, the simultaneous classifiers is more accurate with smaller training data, as it benefits from the information in the test data.

Discussion
Previously unseen or unanticipated feature values are challenging scenarios in standard Bayesian inductive inference.Under general exchangeability as formalized by de Finetti, upon such an observation the entire alphabet of anticipated feature values must be retrospectively changed.The classifiers introduced in this article are, however, equipped to update their predictions au-Table 1 The item-wise 0-1 classification error for the marginal and simultaneous classifiers, as well as the 0-1 difference between the predicted labels between the two classifiers.The superiority in classification accuracy of labeling the test data simultaneously instead of one by one was also shown under partition exchangeability.The assumption that test data is i.i.d. is obviously an unrealistic one.The two classifiers considered here only converge in prediction as the amount of training data approaches infinity.However, The computational cost of the simultaneous classifier increases exponentially as the amount of test data increases, making it unfeasible to use for large-scale prediction.Additionally, the algorithm presented here is only capable of arriving at local optima.Further research could be directed at the Gibbs sampler -assisted algorithm presented in [2], although the convergence of such algorithms with large test data sets is also uncertain.An implementation for the supervised classifiers considered here could also be devised for the two parameter distribution Poisson-Dirichlet(α, ψ).where the expectation of k is as described from above.

Fig. 1
Fig.1The classification error of the marginal and simultaneous classifiers with data sets from PD-distribution with ψ ∈ (1, 10, 50), as well as the convergence of their labelings.Rows 2 and 3 in 2.5 are included in the figure.

L− 1 ψ 2 ,
(ψ) = log(n!)+ n i=1 {−log(ψ + i − 1) + ρ i log ψ − ρ i log ρ i − log(ρ i !)} => l(ψ) = n i=1{−log(ψ + i − 1) + ρ i log ψ} l ′ (ψ) = found by finding the root of the equation:According to Ewens in[4], the left side of the above equation equals the expected number of unique values observed with this ψ and this sample size n, while the right side is the observed number of unique values: This is needed for the Fisher information I, along with the second derivative of l(ψ): l ′′ (ψ) = where k denotes the observed amount of unique values in the sample.The Fisher information then becomes:I(ψ) = −E[l ′′ (k; ψ)|ψ] + i − 1) − 1 (ψ + i − 1) 2 , Collections of the training and test data features are denoted by vectors z and x, respectively.Furthermore consider that the training data are allocated into k distinct classes and T is a joint labeling of all the training items into these classes.Simultaneous supervised classification will assign labels to all the test data in N in a joint manner.We can consider partitioning of N test elements into k different classes similar to T such that S = (s 1 , . . ., s k ), s c ⊆ N, c = 1, ..., k be the joint labeling of this partition.The T and S structures indicate a partition of the training and test feature vectors, such that z (c) and x (c) represent the subset of training and test items in class c = 1, ..., k, respectively.
. A training item i ∈ M is characterized by a feature z i such that, z i ∈ χ.Similarly, we have for a test item i ∈ N the feature x i such that, x i ∈ χ.